Policies

The DMZ cluster maintained by the Division of Biomedical Informatics is for authorized users only. If you have any questions regarding usage or would like to submit any feedback, please email the BMI Help Desk.

Also, please make sure you are subscribed to the bmi-cluster-users mailing list so that you receive important announcements regarding downtimes, policy changes, etc. You can subscribe yourself by browsing the following web page: http://mailman.cchmc.org/mailman/listinfo.

By using this cluster, you acknowledge that you understand and accept the the policies and scheduling limits outlined below.

  • Cluster walltime quota
  1. Starting October 2010, every user in the cluster will have a default wall time quota of 10000 hours for each quarter.
  2. Any additional wall time hours needed should be requested ahead of time (by emailing help@bmi.cchmc.org). They will be charged at 1¢/hour and any unused hours will be credited to your account to the next quarter.
  3. All compute nodes and queues are currently being charged at the same rate, but this may change in the future as the QoS requirements might differ.
  4. Gold allocation system has been integrated with LSF so user jobs will go to PENDING (Queued) state if he/she has exhausted their walltime allocation for the current quarter for the project that they are charging against.
  5. System sends hourly reports on PENDING jobs to BMI RT ticket system.
  6. Documentation on how to use Gold to get your current walltime usage and other things are available here.
  • We have implemented a set of scripts for some commonly-used interactive applications, which will transparently perform job submission.
  • Following are the applications that currently have a wrapper script.:
    • GeneSpring
    • R
    • Matlab
    • plink
    • Autodock (3 and 4)
  • The scripts accept the following parameters which are for the LSF batch manager:
    • lsf-walltime
    • lsf-mem
    • lsf-pmem
    • lsf-vmem

Please note that these parameters are not mandatory, and if not given, a site-defined default will be applied.

  • To start any of these applications, please do as below:

<software_name> <parameters for LSF> <application parameters>

Example

  • plink --lsf-mem=2gb --lsf-walltime=24:00:00 --script _scr
  • GeneSpring --lsf-mem=8gb
  • R CMD BATCH Rbatch.bat
  • R --lsf-mem=12gb --lsf-walltime=10:00:00 CMD BATCH Rbatch.bat
  • matlab --lsf-walltime=10:00:00 example1.m

Current Limits

  • Currently only users who have a job running on a node can gain SSH access to that node. This access policy prevents users from submitting jobs outside of the batch controller. This necessitates the following changes for you as a user:
    • When you submit a job, please make sure that within your job, all your temporary files are copied back to either your home directory or somewhere where you can access them after your job completes.
  • All users have a default CPU limit of 75. If you have jobs running currently in the cluster that account for 75 procs, any new jobs you submit will go into Q status and not be scheduled to run until one or more of your current jobs complete. For example, if you have five parallel jobs running in the cluster each with 20 CPUs, you have 75 CPUs allocated for your jobs. And if you request a serial job requesting one CPU, your job will not run until one of your parallel jobs completes.
    • Reasoning: This policy is set to prevent a single user from flooding the entire cluster with jobs. Even though 75 CPUs is the limit, please be considerate to others and break down your jobs into smaller (CPU time) jobs.
    • If you plan to use more than 75 CPUs at one time, please contact us to make a reservation.
    • If you want to run jobs on more than 75 CPUs (subject to availability) without going through a reservation policy, follow the procedure below.
    1. First off, know that the following condition applies to these additional jobs:
      1. If you are over your 75 CPU limit AND the cluster is full, AND if another user comes in and wants to run jobs in their original quota of 100 processors, their job will stop and requeue your over-quota job. Now that you know the policy, please read on to understand how to submit additional jobs over your quota.

Queue policies

  • By default, all requests from users go to "normal" queue.

Node/Resource requests

  • By default, all jobs are assigned a default resource requirement of 384 MB of memory and one CPU hour. If your job exceeds these limits and you did not ask for different values for these resources explicitly, your job will potentially be terminated (some exceptions apply).
  • We have nodes with different resources (some have faster CPUs, some slower; memory ranges from 4GB in some nodes to 32 GB in some). This has a direct effect in the scheduling of resources to jobs.
  • If all you want is 4 CPUs in the cluster, and if it does not matter how those processors are allocated (all in 1 node, or 1 CPU per node in 4 nodes, or 2 per node in 2 different nodes etc.), please use
    prakash@bmiclusterp1:~> bsub -n 4
  • If you need a specific number of CPUs per node, then you need to specify like
    prakash@bmiclusterp1:~> bsub -n 8 -R "span[ptile=2]"

    In this case, you will get a total of 8 CPUs, 2 processors per node.

Time-shared node

  • Justification

This should be used when all of the following are true.

  • You want to run a quick job for a project.
  • You find that the cluster is 100% reserved (both for jobs and users).
  • You know your job would only take 10 minutes to complete and it is not feasible to wait a long time to run a short job.
  • You are ready to run the job in a node that is over committed (you might be fighting for resources on the node with other similar users)
Solution

We will have one node with 8-cores which will be over committed to the scheduler as having 24 cores.

Usage
  • In the batch script (or in qsub command line), use "-l advres=ts1" and you should get this time-shared node (provided that the overcommitted 24-cores are not already requested by other users).
  • No more than 2 procs can be requested per job from the time-shared node.
  • The walltime of the job cannot exceed 2 days.
not-front not-logged-in node-type-page one-sidebar sidebar-right page-resources-clusters-policies section-resources page-resources-clusters-policies section-resources taxonomy-clusters