|
||||||||||
Section Contents
|
Sun Grid Engine Documentation.Quick start.This document is intended for new users of the service. It gives information on such basics as how to log in, where your work space is and where to get help. The simple princiole is that if you have a job, or sequence of jobs that can be run unattended at a suitable machine, it can be queued for submission on the Grid Engine. The system is designed to make the queuing process simple, and provide most of the support you need to just do it! More information is available in the SGE FAQ section of these web pages and in the SGE 'man' pages. Getting Help.More information is available in the SGE FAQ section of these web pages and in the SGE 'man' pages or you may wish to contact ithelp@geos.ed.ac.uk. Getting Started.To access the cluster, you need to be logged in to a "submit" host such as ssh.geos.ed.ac.uk. Other linux machines can be turned in to submit hosts on request, though there is little advantage to this - opening a window on to a submit host is a simple operation. All significant disk usage during a job should be on the local scratch disk (/scratch/local) for efficency. There is a dedicated piece of local scratch prepared for each job by the system - this can be accessed via the environment as $TMPDIR. This special directory is cleaned up by the system after your job exits, so you need to copy any results elsewhere. What a job looks like.To help the Grid Engine run jobs efficiently, you need to tell it what resources you will use. When the system is busy, the fewer resources you request, the sooner your job is likely to run, so it is best to aim to describe your job as accurately as you can, but always err on the generous side. Such job parameters can be passed in on the command line, or can be edited in to your job itself (on lines beginning #$ ) which keeps things together better. Each job runs independently, and (without trying to cheat the system) you generally do not know which system it will run on. Thus you should probably begin by copying any files you need to read more than once in to TMPDIR, then run the program, put the output somewhere safe and you're done. jobscript.sh #!/bin/sh # This is an example job script that can be used to submit a job to an SGE cluster # It is also possible to run it stand-alone, # since the SGE specific commands are shell comments # Arguments to pass to _qsub_ whenever this script is used: # Set the maximum run time to 1 hours, 0 minutes, 0 seconds: #$ -l h_rt=01:00:00 # Redirect stdout/stferr files to the directory the job is submitted from: #$ -cwd # This will also be the current directory when the job starts # Copy a data file in to the job provided temporary directory TMPDIR cp /home/user/data/dir/datafile.dat.gz $TMPDIR # Unzip the data we collected (this all happens on the local node) gunzip $TMPDIR/datafile.dat.gz # Now run myprogram on the datafile, recording the results locally ./myprogram $TMPDIR/datafile.dat >$TMPDIR/results # Compress the results before transfer gzip $TMPDIR/results # Now copy the results out - here we are using the current directory '.' cp $TMPDIR/results.gz . # Alternatively we might have used a home directory, or shared space Submitting a Job.Once your job is ready, you can submit jobs to the cluster with the command: qsub jobscript.sh where jobscript.sh is the script that you have prepared. Output (stdout) and errors (stderr) from the job will be redirected to files. These files will be in the directory you submitted the jobs from if you use the -cwd option, and in your home directory otherwise. The names are constructed from the name of the job script, .o for output or .e for errors, and the job number. So the command above might create files such as 'jobscript.sh.o12345' and 'jobscript.sh.e12345' where '12345' is the number that is assigned to your job when you submit it. Cancelling a Job.To cancel a job, use the command: qdel 12345 where 12345 is the job number assigned to your job. Job status.To query the status of your submitted jobs, use the command: qstat or for further information, qstat -f Parallel JobsThe SGE is aware of some types of parallel job control. Where suitable support is available, the SGE can allocate an appropriate number of "slots" to allow the job to run effectively. OpenMPTo use OpenMP, you need to specify that you wish to use that Parallel Environment, and declare the number of slots. This is done with an argument to qsub (probably within the job script): -pe openmp n (where n is the number of slots). Note that this prepares the SGE to accept an openmp job, the job itself must still make suitable declarations so that the OpenMP libraries know what is required. A typical script might thus include: openmp script extract #$ -pe openmp 4 export OMP_NUM_THREADS=4 ./myprog Array JobsWhen you want to run a number of mostly identical jobs with the only difference being input parameters or data sets you should submit an Array Job. This feature helps you to easily manage a job series with one command. You might have considered: qsub job.sh data.1 Instead use the following command to submit an array job: qsub -t 1-100 job.array.sh data The SGE provides the number of the array job in the environment variable SGE_TASK_ID Where 'job.array.sh' looks like: job.array.sh # This is an example job array script that can be used to submit an array job to an SGE # Arguments to pass to 'qsub' whenever this script is used: # Set the maximum run time to 1 hours, 0 minutes, 0 seconds: #$ -l h_rt=01:00:00 # Redirect stdout/stferr files to the directory the job is submitted from: #$ -cwd # the job to be run 'x' times ./job.sh $1.$SGE_TASK_ID This will schedule 100 jobs, with each one being identical except for the data input being data.number, with number counting up from 1 to 100. $1 represents the string passed on the command line (in this case 'data'), and $SGE_TASK_ID represents the counter. The job.sh could be exactly the same as the earlier jobscript.sh example. |
|||||||||
|
© School of GeoSciences ---
Privacy & Cookies ---
Last modified: 27 Feb, 2009 --- Page contact:
|
||||||||||