Introduction
SGE or Sun Grid Engine is typically used on a computer farm or computer cluster and is responsible for accepting, scheduling, dispatching, and managing the remote execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses.
- How do I use SGE in the ICS environment?
- What version of SGE is ICS running?
- How many machines are in the cluster?
- How many CPUs are in each machine?
- How much memory and disk space is available?
- How can I find out more information about SGE?
- How do I submit/remove/modify a job?
- Why do my jobs get stuck in queue with an error state?
- Who is allowed to use SGE?
- Which clusters nodes are available for me to use since some are private?
- Would it be possible to run other grid frameworks on these clusters?
SGE is installed as a module so you can run the command below% module load sgeNOTE: refer to modules help for further detailQ: What version of SGE is ICS running?
We are currently running SGE 6u8.
Q: How many machines are in the cluster?
There are 346 hosts in the computing grid (qconf -sel).
Q: How many CPUs are in each machine?
The total number of CPUs is 751 (qconf -sep), which includes both sol-amd64 and lx24-x86 architectures. The groups of clusters queues (sharing similar hardware configuration) in the grid total 20 (qconf -sql).
Q: How much memory and disk space is available?
The memory varies from host to host, but should at least be the same within the same cluster group. If your job requires a certain amount of memory, SGE provides parameters that you can specify to use soft/hard resource limits or use qstat to query the resources that are provided.
As for diskspace, SGE jobs will have access to local tmp space on each host for crunching jobs, but you can also specify an NFS share that your department leases or owns. Again SGE also provides job parameters to set the soft/hard limits for this as well. Please refer to the user manual.
Q: How can I find out more information about SGE?
There are PDF files supplied with the installation located under /auto/sge-6.0/doc/N1GE6Update4_User_Guide.pdf
Q: How do I submit/remove/modify a job?
All these can be found in the user guide.
Q: Why do my jobs get stuck in queue with an error state?
A job can be stuck for many reasons. To get an accurate account please run the following command
% qstat -j <job_id>NOTE: it would be helpful in submitting a trouble report to helpdesk with that information if the error is related to infrastructure problems
Mostly researchers use SGE to run simulations or experiments, but anyone is welcomed to request access at helpdesk@ics.uci.edu.
Q: Which clusters nodes are available for me to use since some are private?
We currently have simpsons_cluster.q and hayes_cluster.q for open use but you can collaborate and coordinate with other reasearch groups to share their resources. For further information about who owns which private cluster groups, please contact helpdesk.
Q: Would it be possible to run other grid frameworks on these clusters?
No. We already have the infrastructure solution for SGE and will not be running other grid computing framework in conjunction with SGE.