Personal tools
You are here: Home Resources Computational Cluster Policies

Policies

Read about policies and scheduling limits before using the cluster.

The DMZ cluster maintained by the Division of Biomedical Informatics is for authorized users only. If you have any questions regarding usage or would like to submit any feedback, please email the BMI Help Desk.

Also, please make sure you are subscribed to the bmi-cluster-users mailing list so that you receive important announcements regarding downtimes, policy changes, etc. You can subscribe yourself by browsing the following web page: http://mailman.cchmc.org/mailman/listinfo.

By using this cluster, you acknowledge that you understand and accept the the policies and scheduling limits outlined below.

Current Limits

  • Currently only users who have a job running on a node can gain SSH access to that node. This access policy prevents users from submitting jobs outside of the batch controller. This necessitates the following changes for you as a user:
    • When you submit a job, please make sure that within your job, all your temporary files are copied back to either your home directory or somewhere where you can access them after your job completes.
  • There is a limit of 140 CPUs for the jobs in the large queue (> 24 hours). This is approximately 1/3 of the total number of CPUs in the cluster. This limit
    • ensures that there are at least some CPUs always available for short (=< 2 hours) and medium-duration (=< 24hours) jobs.
    • encourages users not to over-estimate their job's walltime requirements.
  • There is a maximum walltime limit of 8640000 seconds (2400 hours or 100 days) per user for jobs to get scheduled to run. For example:
    • User A has 100 jobs with walltime requests of 24 hours each. So total walltime is 2400 hours. But suppose these jobs have been running for 2 hours, so the total remaining walltime is 2400 - (2 x 100) = 2200 hours. At this point, user A can only submit new jobs with a cumulative walltime request of < (2400 - 2200 = 200 hours). If he/she submits new jobs with walltime > 200 hours, some jobs that go beyond the limit will get queued. If walltime is < 200 hours, based on CPU availability, the job will be eligible to be scheduled in the cluster.
  • The above policies are ignored if the incoming request is an interactive job request. In such a case, there is a maximum wall time limit of 12 hours. This limit is to prevent users from abusing this policy.

Access policies

  • You will need privileges to submit jobs to the cluster. So, if your job submission errors out like
    prakash@fructose:~> qsub -I -lnodes=1
    qsub: Job rejected by all possible destinations
    it is highly likely that you did not request us for job submission privileges to the cluster. If you need this access, please email us at BMI Help Desk.

Queue policies

  • By default, all requests from users go to a routing queue called "routing" which in turn routes those jobs to execution queues based on the resource requests.
  • Any request from users belonging to 'wwwgroup' system group would then be routed to 'www' Torque queue. Requests from other users would fall into several other queues named 'small,' 'medium,' 'large,' 'interactive,' etc.
  • An upper limit of 140 CPUs has been enforced on the 'large' queue. This is to encourage users from providing accurate walltimes for their jobs instead of overestimating. This number will be changed based on the total number of CPUs available in the cluster.
  • Please remove any reference to 'users' queue from your job batch file (or from your qsub command line if you do this interactively) as soon as convenient. This is no longer required, as the routing queue routes requests to appropriate queues automatically.

Node/Resource requests

  • All users have a default CPU limit of 60. If you have jobs running currently in the cluster that account for 60 procs, any new jobs you submit will go into Q status and not be scheduled to run until one or more of your current jobs complete. For example, if you have five parallel jobs running in the cluster each with 12 CPUs, you have 60 CPUs allocated for your jobs. And if you request a serial job requesting one CPU, your job will not run until one of your parallel jobs completes.
    • Reasoning: This policy is set to prevent a single user from flooding the entire cluster with jobs. Even though 60 CPUs is the limit, please be considerate on others and break down your jobs into smaller (CPU time) jobs.
    • If you do require more than 60 CPUs at any one time, please contact Michal Kouril to make special arrangements.
  • By default, all jobs are assigned a default resource requirement of 900 MB of memory and one CPU hour. If your job exceeds these limits and you did not ask for different values for these resources explicitly, your job will potentially be terminated (some exceptions apply).
  • We have nodes with different resources (some have faster CPUs, some slower; memory ranges from 4GB in some nodes to 32 GB in some). This has a direct effect in the scheduling of resources to jobs.
  • If all you want is 4 CPUs in the cluster, and if it does not matter how those processors are allocated (all in 1 node, or 1 CPU per node in 4 nodes, or 2 per node in 2 different nodes etc.), please use
    prakash@fructose:~> qsub -l nodes=4
  • If you need a specific number of CPUs per node, then you need to specify like
    prakash@fructose:~> qsub -l nodes=4:ppn=2
    In this case, you will get a total of 8 CPUs, but this will NOT guarantee that you get those CPUs out of 4 DIFFERENT nodes. This only tells the scheduler that it needs to allocate at least 2 processors per node. So, if you do cat $PBS_NODEFILE from an interactive session, you might find the allocated nodes as below, which is a surprise. Scheduler allocated a total of 8 CPUs and made sure there are at least 2 CPUs per allocated node (but there are only 2 DIFFERENT nodes).
    prakash@bmi-opt2-11:~> cat $PBS_NODEFILE
    bmi-opt2-11
    bmi-opt2-11
    bmi-opt2-11
    bmi-opt2-11
    bmi-opt2-12
    bmi-opt2-12
    bmi-opt2-12
    bmi-opt2-12
  • If you want specific number of CPUs to be allocated from each DIFFERENT node, then you would have to use
    qsub -l nodes=20,tpn=2
    This will allocate a total of 20 CPUs, but will also guarantee that only 2 CPUs are allocated from each DIFFERENT node. So a total of 10 DIFFERENT nodes will be allocated in this case. The output of cat $PBS_NODEFILE is shown here.
    prakash@bmi-opt2-07:~> cat $PBS_NODEFILE
    bmi-opt2-07
    bmi-opt2-07
    bmi-opt2-08
    bmi-opt2-08
    bmi-opt2-09
    bmi-opt2-09
    bmi-opt2-10
    bmi-opt2-10
    bmi-opt2-11
    bmi-opt2-11
    bmi-opt2-12
    bmi-opt2-12
    bmi-opt2-13
    bmi-opt2-13
    bmi-opt2-14
    bmi-opt2-14
    bmi-opt2-15
    bmi-opt2-15
    bmiwebd1
    bmiwebd1
  • And when nodes are allocated to jobs, they are assigned by way of First Available in a predefined order. The current order of systems can be obtained by executing
    pbsnodes -a | grep  "^[^  ]"
    on the cluster headnode (currently fructose). Generally the order would be in terms of least memory / CPU. So if you do not specify any memory requirements in your job request, you would get the default value, which may not be what you want. So if you have specific memory requirements for your job, you can ask like
    qsub -l nodes=4,mem=<number><mb/kb/gb>
    and that should allocate the correct resources for your job, if they are available.

Software requests

There are several software licenses managed by the Division of Biomedical Informatics including Matlab, Totalview, Discovery Studio, and GeneSpring.

Although most of these software are in a floating license model, some are not. Moab can talk natively to a FlexLM-based license manager to retrieve information on available licenses for a software. In cases where this is not possible, either because the license manager is not FlexLM-based or when the licenses are node-locked, Moab can be configured to make these software as Generic Resources rather than as software.

Some of these software could be accessed from the compute nodes, if required. Moab has been configured to allocate a node only when licenses for a specific software that has been requested by the user is available. This is how to request a software using Torque.

qsub <-I> -lnodes=<number_of_nodes>,software=<software_name>

If you want to request a software which has been configured as a generic resource, this is how you can do that.

qsub <-I> -lnodes=<number_of_nodes>,gres=<software_name>

Software that can be requested as a software:

Bioinformatics_Toolbox
Compiler
Curve_Fitting_Toolbox
MATLAB
Neural_Network_Toolbox
Optimization_Toolbox
Signal_Toolbox
Spline_Toolbox
Statistics_Toolbox
Wavelet_Toolbox
Biopolymer
CHARMM
CNX_XRAY
DS_Analysis
DS_ModelingVisualizer
DS_ProjectKM
DS_ProteinFamilies
DS_ProteinSimSrch
DS_SequenceAnalysis
DelPhi
LIGANDFIT
LIGSCORE
License_Holder
Ludi
Ludi_Genfra
Ludi_Score
ProteinFamilies_Client
health
modeler
xbuild

Software that can be requested as a generic resource:

genespring

Note

Matlab and its toolboxes are currently not available in the compute nodes. If you have a strong need for them, please let Prakash Velayutham know.

NEWSFLASH

  • We have implemented a set of scripts for some commonly-used interactive applications, which will transparently perform job submission.
  • Following are the applications that currently have a wrapper script.:
    • GeneSpring
    • R
    • Matlab
    • plink
    • Autodock (3 and 4)
  • The scripts accept the following parameters which are for the Torque batch manager:
    • pbs-walltime
    • pbs-mem
    • pbs-pmem
    • pbs-vmem

Please note that these parameters are not mandatory, and if not given, a site-defined default will be applied.

  • To start any of these applications, please do as below:

<software_name> <parameters for torque> <application parameters>

Example

  • plink --pbs-mem=2gb --pbs-walltime=24:00:00 --script _scr
  • GeneSpring --pbs-mem=8gb
  • R CMD BATCH Rbatch.bat
  • R --pbs-mem=12gb --pbs-walltime=10:00:00 CMD BATCH Rbatch.bat
  • matlab --pbs-walltime=10:00:00 example1.m
Document Actions