HOWTOs

LaPalma3 (3b): Useful commands

Please note that all the SIEpedia's articles address specific issues or questions raised by IAC users, so they do not attempt to be rigorous or exhaustive, and may or may not be useful or applicable in different or more general contexts.

Executing your applications

Like other supercomputers, on LaPalma3 there is a Batch-queuing System (SLURM v17.11.0) that manages user's executions. Therefore, to run your application you have to send it to the queue specifying some parameters and the system will execute it when possible. Here there are the most useful commands to manage your jobs, but we recommend you check the SLURM Quick Start User Guide.

We have gathered a list of useful commands when working with LaPalma3:

  1. Submitting jobs
  2. Checking the status of jobs
  3. Deleting jobs
  4. Modifying jobs (updating/holding/suspending)
  5. Other useful commands (consumed resources, status of queues, etc.)

Useful commands that you may need before executing your application (connecting to LaPalma3, transferring files, compiling, etc.) are listed at Useful Commands (preparations).

Submitting jobs

Submission is performed using sbatch command specifying some information like number of processors, parallel environment, location of the executable file, etc. Although it is possible to specify this information in the command line, the usual (and recommended) way is to write all parameters in a script file that will be specify when submitting. To know how to prepare your script files, please, check these examples. Once your script file is ready, try next command:

  [lapalma1]$ sbatch <script_file>

When you submit the script file, your parameters will be checked and you will be informed if there are errors. If the job is accepted, you will receive some info, like the job id that you may need later to get further details about that job or even cancel it.

  [lapalma1]$ sbatch mytest.sub
  Submitted batch job 1234           # Your job id is 1234

If you just want to check if your submit script has no errors, or know an estimation about when it will be probably executed, you can use --test-only flag. That will simulate the submission, but it will not be performed:

  [lapalma1]$ sbatch --only-test mytest.sub

IMPORTANT: All parallel programs must be executed by the queue system. Do NOT attempt to run your parallel applications interactively on the login nodes. (see this FAQ)



There are a couple of scripts that show information about how many nodes (and cores) are idle at that moment. You might want to use that information when asking for resources:

  # Show number of idle nodes:
  [lapalma1]$ idlenodes

  # Show number of idle cores (basically 16 times the number of nodes):
  [lapalma1]$ idlecores

Note: Information shown by these commands could have a delay of some seconds in relation to the real current status of the queue



Checking the status of jobs

You can check the status of jobs using next commands:

  • Check status of jobs (you will see ONLY your own jobs)
   [lapalma1]$ squeue

Using this command you will get useful information:

  • JOBID: The ID of each job, this value will be required if you want to do some operations with one of those jobs (ask for further details, cancel it, etc.). If you are using array jobs, then JOBID will have format XX_YY, where XX is the array job ID and YY is the task ID
  • PARTITION: The queue were a job is being or will be executed (it will be usually express or batch)
  • NAME: The name of the job, given by -J parameter in the script file
  • USER: Owner of that job
  • ST: Status of the job, the most commons are R for running and PD for pending, but there are many more possible status, you can check job state codes here
  • TIME: running time
  • NODES: number of nodes that are being used or will be used (you can specify it using -N parameter in your script file)
  • NODELIST(REASON): if the job is being executed with no problems, it will show the list of nodes that are being. If the job is not running, it will show a short description of the reason (you can check the complete list of job reasons codes), but the most common are the following ones:
    • PartitionTimeLimit or PartitionNodeLimit: you are asking for more time or nodes than the available in the partition (queue). It is likely your job will never run, change the walltime or the number of nodes, respectively.
    • Resources (or None): at this moment there are not enough free resources (nodes) to satisfy your job, so it will wait till the needed resources get available
    • Dependency: this job depends on other job(s) that has not finished yet
    • Priority: the system is running jobs with higher priority
    • AssociationJobLimit: Global limit of hours might have been already reached

squeue command has many useful options (use man squeue to see all of them):

   [lapalma1]$ squeue -t RUNNING   # List only my running jobs
   [lapalma1]$ squeue -t PENDING   # List only my pending jobs
   [lapalma1]$ squeue -r           # When running arrayjobs, list one per line
   [lapalma1]$ squeue -o ...       # Specify the output format
   [lapalma1]$ squeue -S ...       # Specify listing order

You can also use jobtimes script that will display info about times of your jobs, like estimation about starting and ending time for pending jobs; or total, used and remaining time of the running jobs.

   # Show times of your jobs:
   [lapalma1]$ jobtimes


  • Show status and info of the job with id <job_id>
   [lapalma1]$ scontrol show job <job_id>

You will find there detailed information about that job. If your job is not being executed, search for text Reason= to know why it is still pending. Also take a look to StartTime= where you can find an estimation about when your job could be executed. Adding -d or -dd will show more details when available.

   [lapalma1]$ sstat -j <job_id>

With this command you can get a large set of information about the status of running jobs and the consumed hardware resources, like: CPU time, Virtual Memory size, I/O operations size, page faults, Resident Set size, etc. With sstat -e you get a complete list of parameters that can be displayed, and then you can use -o ... or --format=... options to specify which one(s) you want to show (see man sstat for more details).

   [lapalma1]$ sacct -j <job_id>
   [lapalma1]$ sacct --format=JobID,JobName,NNodes,NCPUs,AllocCPUs,MAXRSS,Elapsed,TotalCPU,State -j <job_id>

This command displays information about the accounting of running or complete jobs (you can specify a time range). With sacct -e you get a complete list of parameters that can be displayed, and then you can use -o ... or --format=... options to specify which one(s) you want to show (see man sacct for more details). This command is specially useful to monitor the Memory usage, and check whether the job is not trying to use more than the available memory.

Deleting jobs

  • Remove running or waiting jobs (you need to be the owner of those jobs):
   [lapalma1]$ scancel <job_id>

   # Examples:
   [lapalma1]$ scancel 1234        # Cancel job 1234
   [lapalma1]$ scancel 123[4-6]    # Cancel jobs 1234, 1235 and 1236
   [lapalma1]$ scancel 1234 1236   # Cancel jobs 1234 and 1236

Modifying jobs (updating/holding/suspending)

  • If after submitting a job you need to change some of the options you have specified (TimeLimit, Partition, etc.), you can cancel the job, edit the script and re-submit it again, or you can update the job with command scontrol (use the same name of options as those displayed when running scontrol show, case-insentive. Not all options are adjustable after submission, some of them also depends on the current state of the job):
   [lapalma1]$ scontrol update JobID=<job_id> <option>=<value> 

   # Examples:
   [lapalma1]$ scontrol update JobID=1234 TimeLimit=02:00:00       # Update job 1234: set new walltime to 2 hours
   [lapalma1]$ scontrol update JobID=1234 Partition=express        # Update job 1234: set new queue to express
  • Sometimes you may be interested in holding/suspending some jobs (you need to be the owner of those jobs, <job_id> could be one id or a list of them):
   [lapalma1]$ scontrol hold <job_id>               # Hold a pending job
   [lapalma1]$ scontrol release <job_id>            # Release a previously held job
   [lapalma1]$ scontrol suspend <job_id>            # Suspend a running job
   [lapalma1]$ scontrol resume <job_id>             # Resume a previously suspended job
   [lapalma1]$ scontrol requeue <job_id>            # Cancel a running job and queue it again
   [lapalma1]$ scontrol requeuehold <job_id>        # Cancel a running job and hold it



Other useful commands (consumed resources, status of queues, etc.)

Resources

You can use commands like sreport, sacct and sstat to see how much resources have been consumed by your jobs (time, memory, etc.). Some examples:

  • See how many hours you have used in a given period (for example, in March 2018):
   [lapalma1]$ sreport -t hour cluster UserUtilizationByAccount Start=2018-03-01T00:00:00 End=2018-03-31T23:59:59
  • See how much CPU time has been consumed by each of your jobs in a given period (for example, in March 2018):
   [lapalma1]$ sacct -T -X -D -S 2018-03-01T00:00:00 -E 2018-03-31T23:59:59 -o JobID,JobName,NCPUs,Submit,Start,End,CPUTime -s running

There are many options to show and format the results, try man sacct to get more information or sacct -e to display the complete list of fields (for instance, last command will show time with format HH:MM:SS, but you can use CPUTimeRaw instead of CPUTime to get the time in seconds, it could be useful if you want to perform some operations). Command sstat also displays useful information, but it only works when jobs are running.

  • See the utilization and fairshare:
   [lapalma1]$ sshare