Job Submission Examples (LSF)
Users submit jobs to the server using the bsub command. The current state of the queue in the server can be viewed using bjobs. There are a host of other utilities that can be used by Torque users like: bkill, bmod, bstop, bmig, bresume etc. bsub can be used for batch as well as interactive submission of jobs. Interactive job submission should be used only when a user needs to run and debug his code and for short-duration jobs.
The basic syntax for bsub is simply
- bsub < batchfilename.bat
where batchfilename.bat is a file with shell commands that are to be executed. The first few lines of the batch file should contain BSUB directives (lines starting with #BSUB) that specify the resources that the job requires (e.g., number of nodes, number of processors, memory required, etc.).
A simple batch job example
Suppose you have an R program runme.R in your home directory that runs for a long time and that you would like to run on the cluster. It requires a single cpu for, say, no more than 10 hours. Here's a batch file that would do the trick.
#BSUB -W 10:00 #BSUB -n 1 #BSUB -e <some directory>/%J.err #BSUB -o <some directory>/%J.out module load R cd ~ # execute program R CMD BATCH runme.R
Here's a break down of what the lines in this batch file mean:
- #BSUB -L /bin/csh tells LSF to use the c-shell as a shell (/bin/bash or /bin/tcsh are other options)
- #BSUB -W 10:00 tells LSF that your jobs will require no more than 10 hours of walltime to complete. The time format is HH:MM. Some schedulers will prioritize short jobs over long jobs, so the less time you ask for, the more likely it is your job will get scheduled sooner rather than later. Should the actual job length exceed what you requested then your job will be killed. (this feature is currently not used in our implementation, but a default running time will likely be implemented at some point)
- #BSUB -n 1 asks LSF for one CPU core. This means that when your jobs starts you will have exclusive access to one CPU. But if you want something like 4 nodes each with exactly 2 CPU cores (total of 8 cores), then you would use something like -n 8 and -R "span[ptile=2]". Instead, if you just want any 8 cores in the cluster, you would request like just -n 8.
- #BSUB -e ~/lsf_logs/%J.err tells LSF to store all output that would normally be put in stderr into a file in your lsf_logs directory. This file's name will contain the LSF job number and will have suffix .err. This enables you to check whether there were any errors running your R program.
- #BSUB -o ~/lsf_logs/%J.out tells LSF to redirect all output to a .out file in your lsf_logs directory, similarly to the location of the error file in the previous line.
- comment lines: The other lines in the sample script that begin with '#' are comments. The '#' for comments and PBS directives must be in column one of your script file. The remaining lines in the sample script are executable commands.
A parallel batch job example
Suppose now you have an parallel mpi job that needs 4 processors and you would like to have 2 processors on 2 nodes. Here's the corresponding batch file you can submit with bsub:
#BSUB -W 10:00 #BSUB -n 4 #BSUB -R "span[ptile=2]" #BSUB -e <some directory>/%J.err #BSUB -o <some directory>/%J.out module load mpich1/gnu cd ~ # execute program mpiexec -np 4 myprogramname
The line #BSUB -n 4 together with -R "span[ptile=2]" requests 2 nodes with 2 processors per node. You could also have requested -n 4 if you didn't care about their location.
LSF is configured to measure memory in Megabytes. So, if you know that your job requires a lot (say, 16GB) of memory, then you can request it with the #BSUB -M 16000 directive.
An interactive job example
There are several types of interactive access when using LSF. If you just need an interactive shell access, use bsub -Is <shell>. If you need to run an interactive batch job, then do bsub -Is <script>. This is useful for small debugging or test runs.See the discussion of the format for bsub -I in manpages for additional information. You should use this only for short, interactive runs. If there are no nodes free, the bsub command will wait until they become available. This can be a long wait, even hours, depending on the mix of running and queued jobs. Please check the system to be sure that there are available nodes before issuing bsub -I. You can determine if there are free nodes by using the bslots command.
bsub -Is -n 2 -W 30 myscript
This requests interactive access to 2 processors for thirty minutes. Change the number of nodes and processors and the time to suit your needs.
X-Forwarding in interactive jobs
LSF also provides X-Forwarding ability through a special -X switch that you can use in an interactive job. Please note that -X switch always needs a -I switch. LSF will complain, correctly so, if your DISPLAY environment variable is not set properly for a job request that has -X switch. An example job request with X-Forwarding would look like this.
bsub -Is -XF -n 2 <shell>
bsub -Is -XF -n 2 <script>
Once nodes are allocated to you, you will receive a command prompt. Type ^C (control-c) or "exit" to exit the job.
The default wallclock time for jobs is 1 hour and this includes jobs submitted using bsub -I. When you use bsub -I you hold your processors whether you compute or not. Thus, as soon as you are done with your job commands you should type ^ D to end your interactive job. If you submit an interactive job and do not specify a wall clock time you will hold your processors for 1 hour or until you type ^D.
There are several software licenses managed by the Division of Biomedical Informatics including Matlab, Totalview, Discovery Studio, and GeneSpring.
Although most of these software are in a floating license model, some are not. LSF can talk natively to a FlexLM-based license manager to retrieve information on available licenses for a software. In cases where this is not possible, either because the license manager is not FlexLM-based or when the licenses are node-locked, LSF can be configured to make these software as Generic Resources rather than as software.
Some of these software could be accessed from the compute nodes, if required. LSF has been configured to allocate a node only when licenses for a specific software that has been requested by the user is available. With the below option added to the other LSF options in the command line or in the batch file, the <software_name> software will be reserved out of the FlexLM server for a duration of <number> minutes in LSF, by which time your application is expected to startup the software and thereby obtain the license from the FlexLM servers directly.
Please note that if your job is slow to startup the application and it takes more than <number> minutes before the application license is checked out, LSF might assume that license is available and give it out to another job.
Software that can be requested as a software:
Software that can be requested as a generic resource:
• We have implemented a set of scripts for some commonly-used interactive applications, which will transparently perform job submission.
• Following are the applications that currently have a wrapper script.:
◦ Autodock (3 and 4)
• The scripts accept the following parameters which are for the Torque batch manager:
◦ lsf-walltime (in hours:min or min format)
◦ lsf-mem (in mb)
Please note that these parameters are not mandatory, and if not given, a site-defined default will be applied.
• To start any of these applications, please do as below:
<software_name> <parameters for torque> <application parameters>
• plink --lsf-mem=2000 --lsf-walltime=24:00 --script _scr
• GeneSpring --lsf-mem=8000
• R CMD BATCH Rbatch.bat
• R --lsf-mem=12000 --lsf-walltime=10:00 CMD BATCH Rbatch.bat
• matlab --lsf-walltime=600 example1.m
Other useful hints
What if firewall keeps kicking me out often? How can I preserve my session across such freezes?
1. ssh ssh.research.cchmc.org -l<username>
2. ssh bmiclusterp.chmcres.cchmc.org
3. screen (and then enter when you get a message in your screen).
4. In the command prompt, do your usual bsub thing.
5. If you lose network connection after sometime, just go over steps 1 and 2 again and you will get a new session to bmiclusterp.
Once you are there, just type 'screen -d -r' and that should get you your interactive Torque session back.