What is the ICS grid?
The ICS grid contains groups of distributed compute clusters both public and research. These clusters are unified by the Sun Grid Engine (SGE) where jobs are submitted, scheduled and dispatched according to available resources. Jobs may embed requests for certain types of resources or these can be requested at time of submission. Your job will be queued and dispatched when the appropriate resources are available.
Research clusters are owned by various faculty and have dedicated nodes to run their computing jobs. Trying to submit jobs directly to these clusters will result in jobs being rejected for access denied unless permission is previously granted. The ICS grid also has public clusters for general use. These have wall clock limits so that you, the general public, may have a turn on the grid. Below is a listing of the public queues.
Public Queue OS Runtime Limit Slots Arch 12hour_cluster.q CentOS 5 (RHEL 5) 12 hours 16 (Offline after F12) x86_64 15day_cluster.q CentOS 5 (RHEL 5) 15 days 36 (Offline after F12) x86_64 12hour.q CentOS 6 (RHEL 6) 12 hours 0 (Online W13) x86_64 15day.q CentOS 6 (RHEL 6) 15 days 0 (Online W13 x86_64 Note: The 12hour_cluster.q and 15day_cluster.q will be retired after Fall quarter, 2012.
For a list of all queues you can query SGE
$ qconf -sqlTo get details on a particular queue such as time, nodes, access, etc
$ qconf -sq 12hour.q
How do I request an account?
All ICS accounts can access the public queues. Contact the ICS helpdesk in order to get access to research queues.
How do I access my account?
Linux/Solaris/MacOS:
To access the grid simply login via ssh since these systems have ssh installed by default.
$ ssh -Y icsaccount@openlab.ics.uci.eduWindows:
On Windows, use the free PuTTY application or one of these alternatives. Telnet access is NOT available; and, if you use ssh without the -Y or -X option, you won't be able to view X11 graphics.
I'm logged in, now what?
Once you are logged in, you have access to a shell for which you can access ICS computing resources such as our software stack along with the grid. In order to access these you must know a few commands before you begin.
$ module load sgeLet's go over the command above. The module command is our method of loading the software applications out of our software stack. The application in question would be sge. This will setup your essential environment variables in order to run commands provided by sge. You may edit your shell rc files to automatically load sge if that is your primary use or even if you find it convenient so you don't have to run it at every login. There are no additional resources taken up if you decide not to use it even with the module loaded. Remember the command will only modify your environment variables.
For more on modules and our software stack please visit our page here.
I've got SGE loaded, what command can I run?
For starters the first command to get accustomed to is qstat. Running qstat alone you will see nothing for the first time because by default it displays the status for your submitted jobs. To get a full listing run
$ qstat -fNow once you've got a job submitted you will be given a job id to track the detail of your job for which you can run.
$ qstat -j «job_id»That command will give you better detail as to why your job is sitting in the queue waiting to be dispatched. Resources may be busy and in use so you'll have to wait and sit tight. If your job sits in the queue for too long, especially in our 12 hour or 14 day long public queues, there's definitely something fishy going on so please email helpdesk to assist in looking at the queues more thoroughly.
Now that that is out of the way, lets submit your first job! The simple method is to not request any particular resource constraints such as memory or platform or sge queues for that fact.
$ qsub /auto/sge-6.2u5/examples/jobs/sleeper.shExciting, I know given the name. This simple script once dispatched will run on an SGE selected host and sleep for a given number of seconds. This is mostly used to test that the basic SGE scheduling is working. This is a good idea to test and run at times when your custom scripts or applications don't run, but you think that they should be working. You can test particular SGE queues by specifying a queue request during submission as follows
$ qsub -q 12hour.q /auto/sge-6.2u5/examples/jobs/sleeper.shThere are various requests that can be made. You can either read the man pages or visit this usage page here. The usage page is a valuable resource for your routine tasks of using the grid more effectively. An example is the use of job arrays for scenarios that require lots of jobs that are the same, but working on different datasets. You would submit a job array of a set size to do just that. Otherwise we come into problems of the scheduler being bombarded with hundreds of thousands of jobs and bringing down the scheduler. This is not fun when you have to explain to your colleagues that you single-handedly brought SGE down as they are on a deadline.
How do I get files/scripts to and from my account?
For systems managed by ICS Computing Support, you will find that your Unix home directory has already been mounted via NFS. All of our nodes on the grid automount your home directories as well as our software stack mentioned above via the use of modules. If this is your case then you are already off on the right foot and have it easy when it comes to transporting file to and from your ICS account for use on the grid.
For those systems not managed by ICS, the most easy and widely available way is via Secure copy such as scp. Besides the command line scp utility bundled with all Linux, Solaris and Mac hosts, there are GUI clients for MacOS and Windows, and of course, Linux. If you have large collections of files or large individual files that change only partially, you might be interested in using rsync as well.
For Windows users, we recommend the free WinSCP application, which gives you a graphical interface for SCP, SFTP and FTP. Those whom are on our Windows UCI-ICS domain, will already have their H: drive mounted so you can just drag and drop directly to your home directory.
For Mac OS X users, we recommend the free, though oddly named, Cyberduck application, which provides graphical file browsing via FTP, SCP/SFTP, WebDAV, and even Amazon S3(!). Macs also have access to your Unix home so like Windows, it's also a matter of drag and drop.
For Linux and Solaris users, we recommend using the built-in capabilities of KDE's Swiss Army knife browser Konqueror or twin panel file manager. Krusader which both support the secure file browser kio-plugin called fish. Advanced users should read the document HOWTO_move_data, which discusses in detail how to transfer large amounts of data over the network. Again if you have a Linux or Solaris install from ICS, you will already have your home directory mounted and accessible from your desktop.
How do I use the grid?
There are basically two good uses for SGE. The Sun Grid Engine scheduler controls all access to the grid's compute nodes. All jobs must use the qsub or qrsh commands, which submit jobs to the grid in an orderly fashion. There are policies in place to prevent a single user from dominating the machine by flooding the queue with jobs, particularly a 12 hour wallclock limit and 14 days for longer jobs on our public queues. THIS SYSTEM IS NOT FOOLPROOF. Please be courteous and run as few simultaneous jobs as possible, particularly if you notice that there is a lot of usage (with the qstat command).
You can check the status of the batch queue backlog using the qstat and qhost commands. For more information see the section on Monitoring and Controlling Jobs from the wiki.
Send email to helpdesk@ics.uci.edu if you have complaints about your job turnaround or if you need special scheduling considerations to meet a project deadline.
Batch submission:
Use the qsub command to submit a job script to the grid. A job script consists of UNIX directives, comments and executable statements. It is important to remember that all the commands in the job script execute serially on the node that runs your script.
TIP: When your script begins execution the working directory is your home directory. Use the -cwd option with qsub to use the current working directory (wherever you currently are). Output and error will be directed wherever you happen to be, allowing for a cleaner environment. Otherwise stdout and stderr go to your home directory.
SGE provides an abundant number of examples for you to start out with. You can find them located at $SGE_ROOT/examples or just Google SGE examples and you'll have plenty of references.
There are several things to lookout for when using batch submissions.
- Submitting a large set of jobs. This has the affect of saturating the scheduler to a crawl or at times inaccessible to you and everyone else on the grid. The best solution is to use job arrays.
- If you chain jobs, remember to check the status of the previous job before spawning another qsub. This will prevent the system from flooding the scheduler with failed jobs.
- Remember that for your convenience we automounted your home directory along with our software stack. Please be aware that if you flood the scheduler with jobs that mount a software stack or access data from your home directory, your scripts will fail trying to lookup a file or directory. If you encounter this, please make sure to make create more robust scripts by adding random timers or give the lookup some time to return your mounts.
Interactive:
The submission of interactive jobs instead of batch jobs is useful in situations where a job requires your direct input to influence the job results. Such situations are typical for X Windows applications or for tasks in which your interpretation of immediate results is required to steer further processing.
You can create interactive jobs in three ways:
- qlogin – An rlogin-like session that is started on a host selected by the Grid Engine software.
- qrsh – The equivalent of the standard UNIX rsh facility. A command is run remotely on a host selected by the Grid Engine system. If no command is specified, a remote rlogin session is started on a remote host.
- qsh – An xterm that is displayed from the machine that is running the job. The display is set corresponding to your specification or to the setting of the DISPLAY environment variable. If the DISPLAY variable is not set, and if no display destination is defined, the Grid Engine system directs the xterm to the 0.0 screen of the X server on the host from which the job was submitted.
The default handling of interactive jobs differs from the handling of batch jobs. Interactive jobs are not queued if the jobs cannot be executed when they are submitted. When a job is not queued immediately, the user is notified that the cluster is currently too busy.
You can change this default behavior with the -now no option to qsh, qlogin, and qrsh. If you use this option, interactive jobs are queued like batch jobs. When you use the -now yes option, batch jobs that are submitted with qsub can also be handled like interactive jobs. Such batch jobs are either dispatched for running immediately, or they are rejected.
Where can I get more information on grid computing?
The best resource is always the user documentation located here. Your peers are the next best since they may already have experience using the grid for what may be the very purpose you might need.
Where do I report problems?
Please contact ICS helpdesk if you have problems accessing the grid or if there is a failure beyond normal script logic failures. We are here to make sure the grid is operational and that you have access so that you can make full use of the grid. When emailing please articulate what problem(s) you experience, what scripts were run and how they were submitted to the grid. Please copy/paste any relevant error messages. This will help us troubleshoot the problem in a quick and timely fashion.