06 Oktober 2009

Queues and charging for resource usage on lightning

This section provides information about lightning queues and charging. Charges depend on the queue in which your job runs. To select a queue, it is important to consider the charging formula.

To make informed decisions about queues and charging, you need a basic understanding of lightning's system architecture. It is worthwhile to review the lightning main page if you are not already familiar with the system.

Charging

Charges for jobs run on lightning are assessed in General Accounting Units (GAUs).

Charging formula

This formula specifes how your computing account is charged for running jobs on lightning.

GAUs charged = wallclock hours used * number of nodes used * number of processors in that node * computer factor * queue charging factor

The "computer factor" is a multiplier that equalizes the way GAUs are consumed on different computing platforms. Faster computers have higher computer factors. The computer factor for lightning is 0.34.

The "queue charging factor" is a multiplier that reflects the priority given to jobs in a queue: higher-priority jobs are charged more.

Exceeding allocation threshold limits*

Jobs from NCAR divisions or CSL proposal groups that have exceeded either the 30-day or 90-day usage limits* will be placed in the hold queue and run at a priority below jobs in the economy queue. Affected jobs will be charged at 1/3 the rate they would have been charged if they had been run in a regular queue ("rg").

Jobs from NCAR divisions or CSL proposal groups that have exceeded both the 30-day and 90-day usage limits* will be rejected, and users will receive an email suggesting that they submit their jobs to a standby queue. Note that standby queue time limits are six hours, so users may need to change their job's time limit before resubmitting to a standby queue.

* The IBM SP-cluster systems reference manual provides details about how jobs are scheduled for execution when an allocation threshold limit has been exceeded. See the section Allocation thresholds for projects influence job scheduling.

Queue names and uses

The queue names and uses are:

  • The special queue is used for large, long-running jobs that must be approved by the SCD Division Director. To request use of this queue, send email to cal@ucar.edu
  • The premium queue is for jobs that need to run before jobs in the lower queues.
  • The regular queue is for medium-priority jobs.
  • The economy queue is for low-priority jobs.
  • The standby queue is for jobs that can be run only when idle nodes are available. This is a very low-priority queue that is designed to make otherwise unused cycles available to users. Therefore, turnaround time in the standby queue may be extremely long. Caution: If use of the standby queue increases to the point where it interferes with normal work, the standby jobs will be suspended and will remain in a "hold" state until unused cycles become available.
  • The share queue runs on the two interactive nodes only. Examples of jobs appropriate to run in the share queue include:
    • Short interactive jobs
    • Pre-staging data to the MSS before a simulation run
    • Post-staging data from the MSS after a simulation run
    • General script or application development or checkout
    • General simulation runs that do not require large amounts of memory, I/O, or CPU time.
  • The hold queue is an automated queue for jobs from divisions or groups that have exceeded their allocation threshold limit. Jobs in the hold queue have priority only over jobs in the standby queue.
  • The debug queue is a daytime queue dedicated to debugging applications that will be run on lightning. This queue is for jobs being converted to run on lightning and jobs being developed to run on lightning.

Queues for batch jobs on lightning

The queue structure for lightning is:

Queue

CPUs

Maximum
wallclock
hours

Memory

Queue
charging
factor

Availability

special 240*** Unlimited 4 GB 1.0 By special permission
premium 240*** 6 hrs. 4 GB 1.5 Any time
regular 240*** 6 hrs. 4 GB 1.0 Any time
economy 240*** 6 hrs. 4 GB 0.5 Any time
standby 240*** 6 hrs. 4 GB 0.1 Any time
share* 4 6 hrs. 8 GB 1.0 Any time
hold** 256 6 hrs. 4 GB 0.33 Automated
debug 16 0.5 hrs. 4 GB 1.0 10am - 6pm daily
* The share queue is available on the login nodes only.
** Jobs in the hold queue are automatically moved to lower priority when allocation threshold limits are exceeded.
*** Lightning originally offered 256 batch processors (128 nodes). But it is no longer under IBM maintenance. So when there is a severe problem with a node, it is removed rather than replaced. Also, there are some projects requiring dedicated nodes. These are also taken from the general usage batch pool for the duration of the project. Users who need to run their jobs on the largest possible number of lightning processors are encouraged to contact us via CISL Customer Support to learn how many processors are available. Or, you may run command "bhosts | grep ok | wc" to learn how many nodes are immediately available.

Full-capacity issues

Be aware that usage issues arise when nearly all of lightning's nodes are being used. Users running jobs that require the entire system are also affected.

To provide the debug queue in the daytime, 8 compute nodes (16 processors) are removed from the special, premium, regular, economy, and standby queues, and only 120 nodes (240 processors) are available to these other queues.

So jobs that need more than 240 processors cannot run until after 6:00 pm. And when the system is saturated with jobs, note that more processors are available in the evenings.

For more detailed system information

To obtain queue information, type:
bqueues -l
while logged on to lightning.

To check the status of all users' jobs on lightning, type:
bjobs -u all
while logged on to lightning, or to see a summary of all running jobs, type:
lsfq

To see your GPFS disk quota, type:
/usr/lpp/mmfs/bin/mmlsquota
while logged on to lightning. Note that disk space is oversubscribed to maximize the amount that is used. Because of this, everyone cannot use all of their disk space simultaneously.

Tidak ada komentar:

Posting Komentar

komen komen ke my blog..
hayoo..