edit · print · PDF

Please note that all the SIEpedia's articles address specific issues or questions raised by IAC users, so they do not attempt to be rigorous or exhaustive, and may or may not be useful or applicable in different or more general contexts.

FAQs about HTCondor

General Information:

  1. What is Condor? How can Condor help me? Who can use it? Who can help me if I have any problems?
  2. How does Condor work? My machine is in Owner/Unclaimed/Claimed state, what does it mean?
  3. Sometimes Condor runs jobs on my computer when I am working on it, can I avoid that?
  4. I am using Condor, should I add an acknowledgement text in my publications?

Preparing and submitting your jobs:

  1. I have developed a program, do I need to make any modification to run it with Condor?
  2. How do I run my application with Condor? (submitting jobs to the queue)
  3. How do I check the Condor queue and my submitted jobs?
  4. Where should I put my input files so Condor will be able to find them?
  5. If Condor runs my program on different machines, how I can obtain my output files?
  6. Can I specify that Condor uses a different directory for input/output files?
  7. I would like to use a loop to send my jobs with different arguments... can I do that?
  8. Do I need to rename my input/output files to work with Condor?
  9. Should I delete temporary files?
  10. What are the technical specifications of machines running Condor? Can I restrict and/or prioritize those specifications in my jobs?
  11. How can I get/set environment variables when running with Condor (python and other programs may need them)?
  12. Can I change my jobs attributes after submitting them?

Having some troubles with your jobs:

  1. I cannot submit jobs, what is wrong?
  2. How can I check that my program is running fine? Can I access to the inputs/outputs in the remote machine?
  3. My submitted jobs are always in Idle status, why do they never run (and what about users' priority)?
  4. Condor have problems with input and/or output files... are my paths wrong?
  5. My jobs are in on hold status and never finish, what does it mean?
  6. Condor is copying unwanted files to my machine, how can I avoid that?
  7. Some of my jobs are failing due to wrong inputs, can I fix that problem and then run again only those jobs that failed?
  8. Some of my jobs randomly fail (or I know they will fail in some specific machines)... how can I prevent that?
  9. I want to repeat (resubmit) ONLY some of my jobs, is that possible?
  10. I see that my jobs complete after being running for N+X minutes/hours, when they only need N to finish. Is that normal?
  11. I have submitted jobs that need some hours to finish. They have been running for days and just few have finished... what is going on?
  12. Some of my jobs that execute python programs work fine, but other fail...
  13. I receive an error when running HTCondor jobs that use python and matplotlib...
  14. I would like to get more information about the execution, is there an easy way to see the logs created by Condor?

Special needs:

  1. I am running many jobs, but some are more important than others, how can I prioritize them?
  2. I am receiving hundreds of emails from Condor, can I stop that?
  3. What happens with my IDL or Matlab jobs that require licences to run?
  4. I need to run some scripts before/after my executable, is that possible?
  5. Is it possible to limit the maximum number of concurrent running jobs?
  6. I need to do some complex operations in my submit file, is that possible?
  7. I would like to submit my jobs now, but they should run at a programmed time, can I do that?
  8. Jobs leave the queue after finishing. If something went wrong... could they be held or automatically re-executed instead?
  9. I want to do checkpoints of my normal programs (without using Condor) so I can restart them, is that possible?
  10. I have a fault tolerant application, can I save the state and restore it when executing with Condor?
  11. My jobs have some dependencies, is it possible to specify that?

More info:

  1. My question is not in this list or I need further information, where can I find it?



Responses:

Q1: What is Condor? How can Condor help me? Who can use it? Who could help me if I have any problems? ^ Top

A: Condor is a software that may help you to get your computational results in much less time. The underlying idea is to use idle computers to run your programs when they are not being used by their owners. When running your programs with Condor, you only need to specify the name of your program and its location, where to find the inputs and where to place your outputs, and that is almost all in most cases; everything else will be done by Condor. IAC researchers with access to a linux desktop PC should be able to use Condor; the SIE will give you support if you have any issue with it. Please, visit our introduction page for more general information about Condor.



Q2: How does Condor work? My machine is in Owner/Unclaimed/Claimed state, what does it mean? ^ Top

A: Condor has to deal with complex situations, but here we will just give some outlines of its basic operation. Condor uses several daemons to essentially manage a queue of submitted jobs and a pool of slots where jobs can be executed (usually each slot is a core of the machines in the pool). Jobs have several requirements (requested memory, disk space, etc.) and slots have different specifications: what Condor does is to match jobs with suitable slots, and then execute those jobs on them.

You can try condor_status command to check the status of the pool of machines. The first column shows the name of the slots and machines, then some more info is shown, like the Operative System, Architecture, System load, Memory, etc. But we will focus on State and Activity columns, they can be used to understand how Condor works:

STATE (Activity) Action Description
OWNER (Idle) User is working on his/her machine If a user is working on his/her machine, Condor will detect mouse/keyboard activity, active remote connections, etc. In this case, all slots of this machine will get the Owner state and Condor will not use it to run any job. The activity showed by Condor will be Idle since that slot is not doing any work for Condor, but it does not mean the machine is idle, most likely it will be busy working for her owner. Owner state can be assigned in other situations, like when the system load is high (Condor will not run any job to avoid interfering with user's programs), there are some time restrictions, etc. When user finishes working with the machine, Condor will still wait a prudential period of time (by default, 15 min.) since last activity was detected before using it.
UNCLAIMED (Idle) Slot is idle If the machine has not being used by his/her owner for a while, then Condor will run some benchmarks to update its info about performance and all slots will get the Unclaimed state and Idle activity. That means Condor is allowed to run jobs on the idle slots and the jobs queue will be checked to match any suitable job.
CALIMED (Busy) Slot is running Condor jobs If there is a positive match, Condor will begin to run the matched job on the slot. Condor will copy the executable and input files to the remote machine and run the program there. The slot(s) running Condor jobs will get the Claimed state and Busy activity, and the jobs will get the Running state.
CLAIMED (Suspended) User begins to work on a machine that is running Condor jobs When Condor is running jobs in a machine and any user's activity is detected, Condor will immediately suspend all running jobs in all the slots. Some seconds (or few minutes) may be needed in this operation, depending on the job(s), the number of involved files, etc. The machine may get unresponsive at that time, but after a short while the machine should be ready again for the user. At this time the machine has the Claimed state and the Suspended activity, and it will keep this state for a period of time (by default, 15 minutes). This is done to prevent killing jobs when there is no real activity (for instance, the cleaning service accidentally moved the mouse, etc.). If it was an isolated activity, the machine gets idle again and then Condor will "wake up" the jobs and continue running them from the last point, recovering the Claimed state and the Busy activity.
OWNER (Idle) User is working on his/her machine If there were suspended jobs in a machine and user is working again on it for a period of time (it is not an isolated activity), Condor will kill all suspended jobs and then the machine will get the Owner state. All killed jobs will go again to the queue with the Idle state to be executed when possible.

States mentioned above are the most common and representative, but there are other possible states, like Matched (only shown for a few seconds when there is a successful match), Preemting (job is being killed or vacating from that slot), Backfill (slot is idle and queue is empty, so it can run some low priority jobs assigned by the administrators), etc. Visit the Useful commands page to get more information about commands in Condor.



Q3: Sometimes Condor runs jobs on my computer when I am working on it, can I avoid that? ^ Top

A: Condor should run jobs only on idle computers that are not being used by their owners. Idle computers are those where there has not been keyboard/mouse activity for more than 15 minutes, system load is low enough (to avoid interfering with owner's programs), there are no active remote ssh connections, there are no time restrictions, it has enough free space, etc.

If Condor is running job(s) on your computer when you begin to use it, Condor will detect your activity and it will immediately suspend all running jobs. That process is usually quite fast, most users do not even notice it, but some jobs are heavy and complex and it could take a while to suspend them (it could take from several seconds to a few minutes). If that happens, your machine could get unresponsive for a few that moments, but you just need to wait a bit and it will be ready soon (this is a normal process, sorry for the inconvenience).

Anyway, performance problems could be caused by a wide range of different situations, like an exceeded home or disk quota, heavy load (check that your browser is not consuming a lot of CPU time if you have a large number of open tabs), configuration problems, etc. Please, use df -h and quota -s commands to get information about your available space and htop command to find out what processes are using your CPU and memory, that may help a lot to solve a low performance problem.

If you want to check whether Condor has been executing jobs on your machine at any time, you can use the Stats Web we have developed: http://carlota:81/condor_stats/. There you can get some stats about which machines have been used by Condor, when and for how long, etc. Anyway, if you still think that you are experiencing any kind of problems related to Condor, just contact us and we will find a solution.



Q4: I am using Condor, should I add an acknowledgement text in my publications? ^ Top

A: Yes, you should mention it in the acknowledgments of your papers or any other publications where you have used HTCondor. Although there is no standard format, we suggest the following:

"This paper made use of the IAC Supercomputing facility HTCondor (http://research.cs.wisc.edu/htcondor/)".

If you have used any other IAC Supercomputing facilities (LaPalma, TeideHPC, etc.), please, add also them in the acknowledgments:

LaPalma: "The author thankfully acknowledges the technical expertise and assistance provided by the Spanish Supercomputing Network (Red Española de Supercomputación), as well as the computer resources used: the LaPalma Supercomputer, located at the Instituto de Astrofísica de Canarias."

TeideHPC: "The author(s) wish to acknowledge the contribution of Teide High-Performance Computing facilities to the results of this research. TeideHPC facilities are provided by the Instituto Tecnológico y de Energías Renovables (ITER, SA). URL: http://teidehpc.iter.es/"



Q5: I have developed a program, do I need to make any modification to run it with Condor? ^ Top

A: For a basic execution in Condor, you do not need to compile your program with any special library or add calls to external functions to be executed by Condor, the program runs as is. According to our experience, in most cases you will not need to change anything in your program, or only a few minor modifications may be required:

  1. Your program should accept arguments, since changing arguments is the way used to specify different jobs with the same executable. For instance, if your program reads a file to make the same computations with the values of each line, you can modify it to accept the number of the line as an argument, and then Condor will launch a different job for each line. Arguments can be also different data, paths to files or whatever your application use as input.
  2. Paths to your input/output files may change when executing with Condor, so you should be able to change them in your application if needed.



Q6: How do I run my application with Condor? (submitting jobs to queue) ^ Top

A: All the information needed by Condor to run your program should be written in a Condor submit file. You must include in that file one (and only one) executable command to specify what your program is (the path can be either absolute or relative to the directory where the submission is done). Additionally, if your executable is not accessible from other machines, use should_transfer_files = YES command and Condor will copy it to the remote machines. With the arguments command you can specify your parameters (they can be either fixed values or depend on a counter) and then use queue <N> command to launch N jobs. You can repeat arguments and queue commands as many times as needed. A very basic submit file could be the following one, that assumes your application is located in the same directory where you will use the condor_submit command. Please, visit Condor submit file page for more info and examples.

  # Condor submit file
  # Running myprogram with arguments "-v 235" and "-kf 'cgs' -v 6543"
  universe = vanilla
  should_transfer_files = YES

  executable = myprogram

  arguments  = "-v 235"
  queue 

  arguments  = "-kf 'cgs' -v 6543"
  queue 



Once the submit file is ready, you can submit your jobs to the Condor queue using next command in your shell console:

  condor_submit submit_file

To check your jobs, use next command:

  condor_q

Visit the Useful commands page to get more information about commands in Condor.

Caution!: Before submitting your real jobs, always sdo ome simple tests in order to make sure that both your submit file and program work in a proper way: if you are going to submit hundreds of jobs and each job takes several hours to finish, before doing that try with just a few jobs and change the input data in order to let them finish in minutes. Then check the results to see if everything went fine before submitting the actual jobs. Bear in mind that submitting untested files and/or jobs may cause a waste of time and resources if they fail, and also your priority will be worse in following submissions.



Q7: How do I check the Condor queue and my submitted jobs? ^ Top

A: You can check the general status of the queue using condor_status, then you will see how many slots are used by their owners (state will be Onwer), how many are free to be used by Condor (Unclaimed state) and how many are already executing Condor jobs (Claimed state); these states are explained in this FAQ . If you use condor_status -submitters, you will get a summary of who has jobs in the queue and their status; there are many other useful commands and options, please, check them. To see some graphs and stats about Condor, you can visit nectarino (there you can also find information about Condor queue and machines states) and also http://carlota:81/condor_stats/.

If you want to check only your submitted jobs, then use condor_q. It will show the info related to your jobs, like the cluster and process ID, owner, submission date, time they have been running, state, priority, Size, Command, etc. For instance, following lines show a possible output of this command:

  [...]$ condor_q

  ID      OWNER     SUBMITTED    RUN_TIME    ST  PRI SIZE   CMD             
  418.0   jsmith    3/13 17:00   0+00:37:32  I   0   317.4  myprogram -c 7
  418.1   jsmith    3/13 17:00   0+00:30:25  <   0   488.3  myprogram -c 14
  418.2   jsmith    3/13 17:00   0+01:12:10  R   0   231.4  myprogram -c 21
  418.3   jsmith    3/13 17:00   0+02:15:52  S   0   423.5  myprogram -c 62
  418.4   jsmith    3/13 17:00   0+06:31:34  >   0   623.1  myprogram -c 28
  418.5   jsmith    3/13 17:00   0+03:41:52  H   0   432.6  myprogram -c 35

The first value is the Job ID, it is composed by two numbers, the first one is the Cluster ID that identifies the submission, all jobs submitted with the same submit file will share this Cluster ID (in this example Cluster ID is 418). The second number is the Process ID and it is a consecutive number, from 0 for the first job to N-1 for last job when N jobs are submitted. To understand what is happening to your jobs, check State column (ST), the common values are:

  • I: idle job, waiting for a slot to be executed on (it can take a while before your jobs are executed, but if they are always in this state, check this FAQ)
  • <: your job is about to be executed, executable and input files are being transferred to the remote machine
  • R: running, your job is being executed at this moment
  • S: suspended, the machine that was running this job is being used, jobs are suspended while waiting for the machine gets idle again
  • >: execution is finished, output files are being transferred
  • H: on hold, there are problems with your job that have to be solved (check this FAQ)
  • <q or >q: if you see those symbols, your transfers are waiting for the completion of other active transfers. This is done to avoid an excessive use of the available bandwidth.

Once your jobs are finished, they will leave the queue so they will be not listed when using condor_q (use condor_history command instead). There are other states that normally do not appear in basic executions, like C (completed) or X (removed). If you have a special need and want that your jobs stay in the queue after completion with these or other states, you can force that by using some commands in your submit files (check this FAQ or this one).



Q8: Where should I put my input files so that Condor will be able to find them? ^ Top

A: If you are using stdin as input (i.e. you directly specify your input data using the keyboard, or you run your program using ./myprogram < /path/to/input_file.txt), then you should use the input command to specify the file that contains the input data (you can use either an absolute path or a relative one to the submitting directory):

  input = /path/to/input_file.txt



If your program needs to read some input files, they have to be transferred to all remotes machines on which your application will be executed, so your program will be able to find them. You do not need to deal with copying files, Condor will do all the work, the only thing you need to do is to use transfer_input_files to specify the name and location of your files. For instance, suppose that your executable myprogram needs two input files as arguments: data1.in (it is now located in /home/myuser/mydata) and data2.in (it is located in the same directory where you will do the submission). Then, use next commands:

  ...
  should_transfer_files   = YES
  ...
  transfer_input_files = /home/myuser/mydata/data1.in, data2.in
  executable = myprogram
  arguments = "data1.in data2.in"
  ...

Although those input files are in different locations on your machine, Condor will copy them to the same directory where the executable will be placed on the remote machine, that is why we have used no paths when specifying files in the arguments command. You can also use transfer_input_files to copy directories (if you add a / at the end of the directory, then Condor will copy the content of the directory, but it will not create the directory itself). There are many possibilities when working with input and output files. Please, visit Condor submit file page where there is an example that explains how to work with files, step by step.

If you have a huge amount of input files and/or they are very big (GB or so), there is another solution to avoid the copy process that could last a long while. In these situations, you can place your files in a shared location (like /net/yourmachine/scratch) so all machines could directly have access to the files without copying them. But that is not recommended at all since an intensive use of the shared network system could produce blocking accesses and possibly a significant slowdown in your and others' machine performance. Files should be always copied to remote machines to let them work locally. Only when you are dealing with really huge files, it might be better to use shared locations, but then you should limit the number of concurrent running jobs to avoid stressing the network. Please, before submitting your jobs, contact us if you have doubts about this.



Q9: If Condor runs my program on different machines, how I can get my output files? ^ Top

A: Condor will copy your output files back to your machine after the execution is finished, you only need to use some commands to specify those files and Condor will do everything else.

If your output is written in stdout (printed on the screen), then you have to use output command in your submit file to specify a file where Condor will write the output of each job. Obviously, filenames have to be different or all jobs will write in the same file and it will not be valid. To avoid that, you can use the ID of each job to write in distinct files. This ID is composed by two numbers (X.Y), where the first one is the cluster ID (it changes every time you do a submission) and the process ID (it goes from 0 to N-1 where N is the number of queued jobs). Also you should indicate a file where Condor will write the errors (those in stderr) and a log file. Therefore, all your submit files should include next commands (note that ID and FNAME are not commands, but some macros we have defined to make it clearer):

 ID     = $(Cluster).$(Process)
 FNAME  = filename

 output = $(FNAME).$(ID).out
 error  = $(FNAME).$(ID).err
 log    = $(FNAME).$(Cluster).log


If your program also generates output files, most times you do not need to use any command since after the execution Condor will copy to your machine all files created or modified by your job that are located in the same directory where your application was executed. You only need to check your submit file to make sure that the file transfer mechanism is active with next commands:

 should_transfer_files   = YES
 when_to_transfer_output = ON_EXIT


But sometimes we want to specify that Condor must transfer only some specific files (and avoid transferring useless files, like temporary ones), or we want to also transfer whole directories or specific files placed inside some sub-directories. In those situations you should use transfer_output_files command to specify which files or directories(*) you want that Condor copies back to your machine (paths should be relative to the executable):

 transfer_output_files   = data$(Process).out, dir$(Process)_out, dir_outputs/ 

(*):if your directory ends with an slash /, Condor will copy the content of the directory, but it will not create the directory itself

Of course, output files should have distinct names (if you use the same name, files will be overwritten when copying them to your machine). If your application uses always the same name for output files, you can use transfer_output_remaps command to change their names in destination (it will only work with files, not with directories). For instance, suppose that your application creates an output file named data.out and you want to use distinct names to avoid overwriting those files, then you could use the $(Process) macro to include the process ID of the job to generate different names (data0.out, data1.out, data2.out, ...):

 transfer_output_files   = data.out
 transfer_output_remaps  = "data.out=data$(Process).out"


If you only want to get the output file from the screen (using output command) but not any other generated or modified file, you can use should_transfer_files = NO command. But this command will also affect your input files. If you want to copy input files, but NOT the output files, you should use:

 should_transfer_files  = YES
 +TransferOutput        = ""


Bear in mind that transfer_output_files command is not used to specify where you would like that Condor places output files in your machine (you can use initialdir command for that, check this FAQ), but where the output files will be located in the remote machine so Condor can find them (paths to your output files must be relative to the directory where your program will be run). There are many possibilities when working with input and output files. Please, visit Condor submit file page where there is an example that explains how to work with files, step by step.

Read this when dealing with huge input/output files: If your program generates a huge amount of output files and/or they are very big (GB or so), there is another solution to avoid the copy process that could last a long while. In these situations, you can prepare your program to write the output files directly in a shared location (like /net/yourmachine/scratch). But that is not recommended at all since an intensive use of the shared network system could produce blocking accesses and possibly a significant slowdown in your and others' machine performance. Files should be always copied from remote machines to let them work locally. Only when you are dealing with really huge files, it might be better to use shared locations, but then you should limit the number of concurrent running jobs to avoid stressing the network. Please, before submitting your jobs, contact us if you have doubts about this.



Q10: Can I tell Condor to use a different directory for input/output files? ^ Top

A: Sometimes you are submitting your jobs from the directory where your executable is located, but your input/output files are placed in a different location. You could add the path to that location every time you have to specify a file, but it is much easier to use the initialdir command. For instance, if your input data is located in /home/myuser/mydata and you want that your output data will also be placed there, you can add this statement in your submit file:

 initialdir = /home/myuser/mydata

Bear in mind that it will affect both your input and output files, but it has no effect over the executable file.



Q11: I would like to use a loop to send my jobs with different arguments... can I do that? ^ Top

A: Yes, a loop is the most natural way of submitting different jobs in Condor. Many users have created shell scripts to generate different submit files, one per each set of arguments, but this is unnecessary in most cases and it is not recommended: the shell script can be quite complex; managing dozens, hundreds or even thousands of submit files is bothersome, as it will also be managing all those independent jobs; and, even worse, efficiency will be reduced (every time you submit, Condor creates a cluster for that execution, which involves an overhead. So we should try to create only one cluster with N jobs rather than N clusters with only one job each, which makes also easier managing all generated jobs).

The easiest way to work with loops is to use the predefined $(Process) macro in your submit file. Condor will assign the id of each job to this macro, so if you are submitting N jobs, $(Process) will be 0 in the first job, 1 in the second one, till N-1 in the last job. This is the loop we need. For instance, next easy submit file will use perl to calculate the cube of the first N numbers creating one job per number (in this example we will use N = 50, beginning with 0):

 N = 50

 should_transfer_files   = YES
 when_to_transfer_output = ON_EXIT 

 output  = cube.$(Cluster).$(Process).out
 error   = cube.$(Cluster).$(Process).err                                                                                     
 log     = cube.$(Cluster).log  

 executable          = /bin/perl
 transfer_executable = False
 arguments           = "-e 'print $(Process)**3'"

 queue $(N)

As you can see, it is very easy to simulate a loop, we only need to use the predefined macro $(Process) to get the iteration value. We can use it in our arguments, inputs, outputs, name of files, etc... Since we have only submitted once, just one cluster will be created. Please, visit Condor submit file page to see more detailed examples. If you need a more complex loop including some arithmetic operations using the iteration value, then you can define your own macros using Condor syntax, see this example. Condor also has more predefined macros to generate random numbers, randomly choose one value among several of them, etc.



Q12: Do I need to rename my input/output files to work with Condor? ^ Top

A: Using filenames with a known pattern makes it much easier to specify files to transfer in your Condor submit file. When possible, we recommend you use a common text and then an index to refer to your files, for instance: data0.in, data1.in, data2.in, data3.in, ..., data154.in. Then it will be very easy to specify each input file: you only need to add a command similar to next one: transfer_input_file = data$(Process).in. This is the easiest situation, but this approach is also valid in more complex scenarios, like when the index depends on a expression and/or has some leading zeros, like 0001, 0002, 0003, ... (see this example). Also remember that you can run scripts (or other programs) before and after executing your main application (see this FAQ), so you could use this feature to change the name of your files as needed (for instance, using shell commands or scripts).

But sometimes you can use a known pattern in your files, or you have a variable number of files to transfer, or maybe your program does not generate any output file under certain conditions... In those situations, it is much better to transfer directories rather than deal with individual files. Then, you only need to place your input and/or output files in directories, and then specify that Condor has to transfer these directories and their content.

Remember that you can use transfer_input_file and transfer_output_file to specify which files and directories to transfer. Paths in the local machine can be absolute or relative to the directory where the submission is performed (or the one set using initialdir command). Paths in remote machines should be relative to the directory where your program is placed and executed (be careful if you use absolute paths, they have to exists in every machine).



Q13: Should I delete temporary files? ^ Top

A: No, that is not needed if you are only using local directories (those that belongs to Condor). Condor will run your program on remote machines, and once the execution is finished and the output files are transferred, Condor will delete files and directories related to that execution, so you do not need to delete any file. If you are using another locations like those in external or shared systems (/scratch, /home, /net/nas3, etc.), then you need to delete all unnecessary files since Condor will not check those directories.

Condor also has periodic checks on every machine and it will analyze all directories belonging to Condor in order to remove extraneous files and directories which may be left over from Condor processes that terminated abnormally due to either internal errors or a system crash.



Q14: What are the technical specifications of machines running Condor? Can I restrict and/or prioritize those specifications in my jobs? ^ Top

A: To see an overview of the hardware and software available when running with Condor, please visit the introduction page. You can also use next commands to get information about slots:

 condor_status -server          #List attributes of slots, like memory, disk, load, flops, etc.
 condor_status -sort Memory     #Sort slots by Memory, you can try also with other attributes

If you have an application that has hardware/software limitations, you can add restrictions or directly specify which machines you want to run your application on. Typical limitations are the OS version (due to dependencies on libraries), RAM, disk space, etc., but there are many more parameters. To apply your restrictions use the requirements command and the conditions (you can use several operators, predefined functions, etc.) in your submit file.

To see the complete list of parameters, try command condor_status -l <your_machine> and the values for each slot of your machine will be displayed. Most of the parameters showed in that list can be used to add requirements. Also you can use other commands in your submit file like request_memory, request_disk, etc.

If you want to specify preferences in one or several parameters, use the rank command in your submit file (Condor will always give more preference to higher values of the specified parameters). For instance, add next lines to your submit file if you want to run your jobs only in slots with Fedora19 and more than 1GB of RAM, prioritizing those slots with the highest amount of RAM:

 rank           = Memory
 request_memory = 1024
 requirements   = (OpSysAndVer == "Fedora19")

Rank is evaluated as float point expression, so you can weight several parameters in different ways. For instance, we want to choose slots with higher RAM, but those with at least 15GB of disk are also important for us, so will give them 100 extra points:

 rank         = Memory + (100 * (Disk >= 15120000))


Caution!: Be careful when choosing your restrictions, using them will reduce the number of available slots for your jobs so it will be more difficult to execute them. Also check that you are asking for restrictions that can be satisfied by our current machines, or your jobs will stay always in Idle status. Before adding a requirement, always check if there are enough slots that satisfy it. For instance, to see which slots have more than 1GB of RAM, try next command in your shell (you can filter and see only the available ones adding flag -avail):

 [...]$ condor_status -constraint '((Memory > 1024) && (OpSysAndVer == "Fedora19"))'

Please, visit Condor submit file page for more info and examples.



Q15: How can I get/set environment variables when running with Condor (python and other programs may need them)? ^ Top

A: If you are running a python program it is likely it will need to access the environment variables when importing some modules. Other programs and scripts also need to get or set environment variables to properly work. If you use getenv = True command in your submit file, Condor will copy your current shell environment variables and they will be available when running your job (copy will be performed at the time of submitting). If you need to declare or change the value of any variable, you can use the environment command in the submit file, like the following example:

 environment = "var1=val1 var2=""val2"" var3=val3"

If you use both commands, variables specified with environment command will override those copied by getenv if they have the same name. Using $ENV(variable) allow the access to environment variables in the submit file (for example, $ENV(HOME)).

Please, see also this FAQ about python for more details about how to define environment variables with python, and visit also Condor submit file page for more info and examples.



Q16: Can I change my jobs attributes after submitting them? ^ Top

A: Yes, most of the jobs attributes can be changed after the submission (but not all of them, for example, you cannot change the owner, clusterId, procId, jobStatus, etc.). Of course, you can only change your own jobs.

To change attributes, use command condor_qedit and specify the name of the attribute and its new value (new attributes can be defined, too). See next examples (attributes to be changed are underlined and new values are set just after them):

  [...]$ condor_qedit 1234 NiceUser TRUE                                # Enable NiceUser in all jobs belonging to cluster 1234 
  [...]$ condor_qedit 1234.6 Requirements '(UtsnameNodename != "arco")' # Job 1234.6 will not be executed on machine "arco"
  [...]$ condor_qedit -constraint 'JobStatus == 1' Environment '""'     # Clean environment variables in all idle jobs

Notes:

  • Remember to quote strings. For instance, to specify that attribute A has a value "foo" you should use: condor_qedit ... A '"foo"'.
  • Use condor_q with option -long to get the full list of attributes of each job and their current values. Depending on which attributes you have changed/added, new values may be valid only after those attributes are re-evaluated, usually when jobs restart (so changing attributes of running jobs may not work till those jobs are stopped and executed again).
  • Be careful when changing attributes like Requirements, Environment, etc., since new values will replace the old ones (they will not be appended to the previous values, so you may need to get them before and add them to your expression).
  • As you can see in the examples above, you can select which jobs you want to edit specifying their clusterId, clusterId.procId and/or with a constraint (use your username to select all your jobs), with the same syntax that you use in condor_q (and other commands like condor_release, condor_hold, condor_rm, etc.). We recommend that you use condor_q to check the selection of jobs before editing, to avoid making unwanted changes to other jobs.



Q17: I cannot submit jobs, what is wrong? ^ Top

A: If you use condor_submit <submit_file> and your jobs do not appear in the list when using condor_q, there might be errors in your submit file. If so, Condor should print those errors and some related information when doing the submission (use -debug flag if you do not see that info). Most problems are easy to fix since they are related to wrong paths, lack of permissions, missing required commands, etc., but if you do not have an idea about how to fix any error, please contact us.



Q18: How can I check that my program is running fine? Can I access to the input/output files in the remote machine? ^ Top

A: There are several ways to check in real-time what is happening on the remote machine while it executes your jobs, so you can see how results are being generated and whether they are fine or not. All these methods only work when processes are running, remember that you can get the job_id using command condor_q, be sure that the job you choose is running with state "R" (you can select them using condor_q -run).

A) You can check what your program is printing on the "screen" (stdout and stderr) and/or in output files on the remote machine while it executes your program. To display outputs, use condor_tail <job_id> command and it will show the latest lines of the specified output like the linux command tail does (you can also add -f option to keep showing new content). For instance, if you want to check the running job 123.45, just use next commands:

 [...]$ condor_tail 123.45                       # Show stdout (normal output on screen)
 [...]$ condor_tail -f 123.45                    # Show stdout, it keeps showing new content
 [...]$ condor_tail -no-stdout -stderr 123.45    # Show stderr (errors on screen)
 [...]$ condor_tail -no-stdout 123.45 file.out   # Show output file file.out (*) 

(*) The output file must be listed in the transfer_output_files command in the submit file.



B) You can also establish SSH connections to the remote machines where your jobs are being executed, using the command condor_ssh_to_job <job_id> (again, make sure that the job is running). Once the SSH connection is established, you will be placed in the directory where your program is being run, so you can check input and output files to see whether the program is running properly (you should NOT make any modifications in any file to avoid errors). To open an ssh connection you only need to specify the jobId, see example for job 123.45:

 [...]$ condor_ssh_to_job 123.45

If you need to upload or download files, you can open an sftp connection using flag -ssh sftp or you can also use rsync, see following examples:

 [...]$ condor_ssh_to_job -ssh sftp 123.45
 [...]$ rsync -v -e condor_ssh_to_job 123.45:<remote filename> <local directory>

Important: close ssh connection when you are not using it. Jobs with open connections cannot leave the queue, so they will appear as "running" even if they are already done, waiting till you close the connection.



C) There is a third method, but it is not recommended: the directory where Condor executes jobs is usually located in the scratch, so in most cases it will be directly accessible, you only need to know the name of the machines running your jobs. Use condor_q -run to get these names and then access to the working directory located in /net/<remote_machine>/scratch/condor/execute/dir_XXXXX (where XXXXX changes in every execution, but it should be easy to recognize due to owner's name). Note that this is the default configuration, but some machines have other configurations and/or you may have no permit to access to those directories.



Q19: My submitted jobs are always in Idle status, why do they never run (and what about users' priority)? ^ Top

A: If your jobs are always in Idle status, it may be caused by several reasons, like restrictions that you have specified in the submit file, a low user's priority, etc. With condor_q command you can find out what the reason is, just choose one of your idle jobs and use next commands:

  condor_q -analyze <job_id>
  condor_q -better-analyze <job_id>

Condor will display then some detailed information about machines that rejected your job (because of your job's requirements or because they are not idle but being used by their owners), machines that could run your job but are busy executing other users' jobs, available machines to run your job if any, etc. It will also display the reason why that job is idle and some suggestions if you have non-suitable requirements.

Check that information and be sure that your requirements can be satisfied by some of the current machines (pay attention to the suggestions, they may help a lot!). For instance, if you ask for slots with more than 6GB of RAM, there are just few of them and they need to be idle to run Condor jobs, so you may need to wait for a long while before running there (also check that there are no impossible values, like asking for machines with 16GB per slot, we have none of them). Before adding a requirement, always check if there are enough slots that satisfy it (for example, to see which slots have more than 6GB of RAM, try next command in your shell: condor_status -constraint 'Memory > 6144'. Please, visit Condor submit file page for more info and examples.

You can also get messages like Reason for last match failure: insufficient priority. Bear in mind that Condor executes jobs according to users' priority, so that message means that Condor is right now executing jobs submitted by users with a better priority than yours, so you will still have to wait a bit. You can check yours and other users' priority running condor_userprio -all -allusers: all users begin with a priority value of 0.5, the best one possible, and once you begin to run jobs with Condor, it will increase your priority value (that means worse priority) according to the number of machines you are using and the consumed time (the more you use Condor's resources, the faster your priority value will be increased). On the other hand, your priority will be gradually decreased when you are not using Condor.

If your Condor priority is important for you and you want to run some not urgent jobs, you can submit them using nice_user = True command in your submit file: those jobs will be run by another user called nice_user.<your_user> and they will not affect your real user's priority. But this new user has an extremely low priority, so its jobs can stay in the queue for a long while before being executed (but they can be run very fast if the Condor queue is almost empty).

Besides user's priority, all jobs have also their own priority, and you can change it to specify whether some jobs are more important than others so they should be executed first (please, check this FAQ).



Q20: Condor have problems with input and/or output files... are my paths wrong? ^ Top

A: If Condor is not able to find your input files, probably your jobs will get the "on hold" status (see this FAQ). It is not needed to place your input files in any special location, but you need to specify the path to each file (it could be absolute or relative to the directory where you will do the submission) and make sure that the path is correct and you have access permits. Check this FAQ to see which commands you can use to specify the input files.

On the other hand, you also have to specify the output files that will be generated by your program so Condor can copy them from the remote machines to your computer. Check this FAQ to see which commands you can use for this purpose.



Q21: My jobs are 'on hold' status and never finish, what does it mean? ^ Top

A: When there is an error, jobs change their state to "on hold". It means that Condor is expecting an action from the user to continue with those jobs. Most times the reason to hold jobs are related to permissions or missing files. A common problem is to specify files that cannot be accessed from other machines, like those in your home or desktop directories (use Condor commands to copy files instead), or the destination directory for output files does not exist or is not reachable, etc. You can check all your held jobs and the reason for that running next command in your shell: condor_q -hold. Once you have fixed the problems, run command condor_release -all and Condor will check all held jobs again and change their status accordingly.



Q22: Condor is copying unwanted files to my machine, how can I avoid that?^ Top

A: By default, Condor will copy all files generated or modified by your application that are located in the same directory where your program was executed on the remote machine, what could include some unwanted content like temporary files, etc. If you want to avoid that, then you can use the transfer_output_files command (see this FAQ) to specify which files and/or directories you want that Condor copies from the remote machine to your machine once your application has finished (then Condor will copy only those files and ignore all remaining ones).

If you only want to get the output file from the screen (using output command), but not any other generated of modified file, you can use should_transfer_files = NO command. That command will deactivate the Condor transfer mechanism, affecting both your input and output files, so it can be only used when you have none of them. If you want to copy input files, but NOT the output files, then you should use next commands:

 should_transfer_files  = YES
 +TransferOutput        = ""



Q23: Some of my jobs are failing due to wrong inputs, can I fix that problem and then run again only those jobs that failed? ^ Top

A: First of all, we strongly recommend you always perform some simple tests before submitting your actual jobs in order to make sure that both your submit file and program work in a proper way: if you are going to submit hundreds of jobs and each job takes several hours to finish, before doing that try with just a few jobs and change the input data in order to let them finish in minutes. Then check the results to see if everything went fine, and if so, then submit your real jobs. Bear in mind that submitting untested files and/or jobs may cause a waste of time and resources if they fail, and also your priority will be worse in following submissions.

Sometimes we discover too late that there were some problems, most times related to the executable and/or the input files. If some of the jobs have run correctly while others have failed, we will try to fix the problems and execute again only those that have failed, to avoid wasting time and resources executing again jobs that worked fine. For the same reason, we should stop as soon as possible all those running (or idle jobs) that will fail. Every submission is different, and it is not possible to give general advice, but next steps should help you (and you can always contact us to study your particular situation):

  1. Identify those jobs that failed: if your queue only contains failing jobs since all correct ones have already finished, then it will very easy to manage them. But most times you will have different jobs in your queue: correct ones that are running, incorrect ones also running, some of them that are held, others that are idle so we do not know whether they are correct or not, etc. The first thing we have to do is to find an expression to identify all failing jobs. Usually when jobs fail there is a way to recognize them, for example, they have been executing for a very long time (many hours when they only need a few minutes to finish), or very short, or the exit code is not the expected one, etc. Use condor_q command to list them with -constraint option and an expression, we will give you some tips to find those jobs:
  • All held jobs (JobStatus == 5) from Cluster with ID 453 (ClusterID == 453) are not correct. To list them we simply use next command:
    condor_q -constraint '((JobStatus == 5) && (ClusterID == 453))'
  • Our jobs need less than 10 minutes to finish, those that are running (JobStatus == 2) for more than 30 minutes ((CurrentTime - JobStartDate) > 1800) are not correct. Then we can list them using next command:
    condor_q -constraint '((JobStatus == 2) && ((CurrentTime - JobStartDate) > 1800))'
  • All those idle jobs (JobStatus == 1) that have been running for more than 2 hours (CumulativeSlotTime > 7200) are wrong:
    condor_q -constraint '((JobStatus == 1) && (CumulativeSlotTime > 7200))'
Note the difference between cumulative time (the sum of the time consumed in different executions if the job have been evicted) and the consumed time of the present execution. To see the consumed time of all running jobs you can use condor_q -run -currentrun (or use -cputime to see the real CPU time consumed without being suspended), and you can also use condor_ssh_to_job to connect and see what is happening (check this FAQ).
  • Jobs have many other attributes that can be used in the constraints, just choose a incorrect job (for example job with ID XXX.YYY) and run condor_q -long XXX.YYY to get all attributes of that job. Then try to find which attributes can be used to identify all wrong jobs. All valid JobStatus could be displayed using shell command: condor_q -help status
  1. Stop all failing jobs and run them again with correct data: Once you have all your failing jobs listed and the problem with input if fixed, we will try to stop those wrong jobs and re-execute them with the right data. There are two situations, depending on how you solved the problem:
  • Situation A: to solve the problem you only need to correct the executable and/or the input files, but the submit files was NOT changed. This is the easiest situation, you have to be sure that the new input files are in the same location that they were previously and exactly with the same names. Then you only need to hold all those wrong jobs and release them again, so the new executable and input files will be copied. To do it, use the same expression you had before to list the jobs, but change command to condor_hold:
    condor_q -constraint '(XXX)'      List all failing jobs, XXX is the expression to identify them
    condor_hold -constraint '(XXX)'   Hold all failing jobs
    condor_release -all               Execute again all held jobs (we assume that all held jobs are those failing,
                                      if not, just find and use a -constraint expression)
And that should be all, now all released jobs will have the correct input files, so executions should go fine.
  • Situation B: to solve the problem you need to change the submit file. Sometimes we cannot avoid changing the submit file because we have to modify the commands to add or remove input files, change the arguments, etc. In those situations, holding and releasing failing jobs will not work because the submit file is only processed at the submission time. Then we need to use condor_qedit (see this FAQ) to change the values of the attributes specified in the submit file; or you can also remove the wrong jobs and submit them again. For the last option, simply follow next steps (we are assuming here that all jobs belong to the same Cluster, if you have done several submission, then you will have to repeat these steps several times):
    1. Get the list of Process ID of all failing jobs (use the same expression (XXX) that you get in the first step):
    condor_q -constraint '(XXX)' -format "%d," ProcID
For example, assume that the output of the last command is 0,4,67,89,245,
  1. Remove all those failing jobs:
    condor_rm -constraint '(XXX)'
  1. Change your submit file as needed and add next command to only execute the failing jobs
    noop_job = !stringListMember("$(Process)","0,4,67,89,245")
Important!! When re-submitting, output, log, and error files of ALL jobs (even those correct ones) may be overwritten, so save the old ones if they are important.
  1. Submit again.
    condor_submit your_submit_file
  1. Jobs that are not in the noop_job list will not be executed, but they may stay in the queue with Complete status, use next command to remove them from the queue (read this FAQ for more info)
    condor_rm -constraint 'JobStatus == 4'

As you can see, the steps to follow strongly depend on each particular problem, so it might be easier if you just come to our office.



Q24: Some of my jobs randomly fail (or I know they will fail in some specific machines)... how can I prevent that? ^ Top

A: If you see that some of your jobs fail with apparently no reasons, but they properly run when resubmitted, the problem might not be in your program, but on the machine(s) where they were executed (for example, an application or library that is used by your program is not installed on those machines, or its version is too old/new, or it is misconfigured, etc.). To detect this, simply check the machine where the failing job was executed, which is written in your condor log file, though it is easier to check it using the condor_history command. For instance, to check where job XXX.YYY was run, launch next command in the same machine where you did the submission:

 [...]$ condor_history XXX.YYY -af LastRemoteHost

Maybe some of your jobs finished with no problems, but others finished abnormally soon. You can use condor_history to get a list of those jobs. For instance, suppose that you have submitted some jobs with clusterId=XXX and each job needs at least 30 minutes to properly finish, so you are sure that those that lasted less than 10 minutes (600 seconds) failed. Then you can use next commands to get those jobs (first command will give you a list of the jobs that failed and the second one will show two lines for each of them, the first line is where the jobs was executed on and the second line is the procId of the job):

 [...]$ condor_history -constraint '((ClusterId==XXX) && ((CompletionDate-JobStartDate) < 600))'
 [...]$ condor_history -constraint '((ClusterId==XXX) && ((CompletionDate-JobStartDate) < 600))' -af ProcId LastRemoteHost


Most times these problems are simply solved by forcing these failing jobs to go again into the queue after an unsuccessful execution to be re-executed (see last paragraph). If you see that all jobs that failed were executed on the same machine(s) or you already know that your application is not able to run on some machines, then you can force Condor to avoid sending your jobs to those machines. For instance, suppose that your jobs have problems in machines with names agora, lapiz and madera and you want to avoid them. Then, add either of the next lines (both are equivalent) to your Condor submit file (if you had some previous requirements, append the new ones to them):

 requirements = ((UtsnameNodename =!= "agora") && (UtsnameNodename =!= "lapiz") && (UtsnameNodename =!= "madera"))
 requirements = !stringListMember(UtsnameNodename, "agora,lapiz,madera")

You can also block all machines that satisfy a pattern. For instance, to avoid executing your jobs in those machines with names beginning with "a", "l" and "m", add next lines (you can specify more complex patterns using the predefined functions and macros):

 letter       = substr(toLower(Target.Machine),0,1)
 requirements = !stringListMember($(letter), "a,l,m")

On the opposite situation, if your application can ONLY run on those machines, then you only need to negate the previous expressions (or remove the negation):

 requirements = ((UtsnameNodename == "agora") || (UtsnameNodename == "lapiz") || (UtsnameNodename == "madera"))  
 requirements = stringListMember(UtsnameNodename, "agora,lapiz,madera")
 ...
 letter       = substr(toLower(Target.Machine),0,1)
 requirements = stringListMember($(letter), "a,l,m")


Then you should execute again only those jobs that failed (check this FAQ to see how). Please, do not execute again all your jobs to avoid wasting time and resources. If your program could fail and never end (for example, for some sets of data it never converges), you can use utilities like linux command timeout to limit the time it can be running. Failing machines can cause a problem called black hole that could produce that most of your jobs fail. Please, visit Condor submit file section for more info and examples to avoid that. In this section we also describe some Condor commands that you can add in your submit file to deal with failing machines, like on_exit_hold and on_exit_remove. For instance, using these commands you can specify that any job that finishes with a non valid exit code and/or before X minutes, has to be held or sent to the queue again to be re-executed, respectively. Some examples (before using these commands, make sure that the problem is on remote machines and not on your code in order to avoid re-executing failing jobs):

 # Held jobs if they finished in less than 10 minutes. Later we can check what was wrong with those jobs and re-execute again them
 # using condor_release (we can also use periodic_release to automatically release held jobs every X minutes)
 on_exit_hold = ((CurrentTime - JobStartDate) < (10 * 60)

 # Remove from the queue only those jobs that finished after 10 or more minutes. If a job finished before that period of time,
 # it will be sent again to the queue with 'Idle' status to be re-executed (most probably on a different machine)
 on_exit_remove = ((CurrentTime - JobStartDate) > (10 * 60)



Q25: I want to repeat (resubmit) ONLY some of my jobs, is that possible? ^ Top

A: If you submit a large number of jobs and for any reason some of them fail and leave the queue, you should not waste time and resources running again all of them, just try with those that failed (after solving the problems they had). Unfortunately there is not a condor_resubmit command to easily resubmit jobs that have already left the queue. You could try to obtain the ClassAd of those jobs using condor_history -l <job.id>, but Condor will not accept it as input when using condor_submit.

If there are just a few jobs to resubmit, you could try to add pairs of arguments and queue commands to execute only those jobs, but there is an easier way to do it using noop_job command. For instance, suppose you want to repeat jobs with Process ID 0, 4, 9, 14 and those from 24 to 32. Then, add next line to your submit file and submit it again:

  noop_job = !( stringListMember("$(Process)","0,4,9,14") || (($(Process) >= 24) && ($(Process) <= 32)) )

Condor will not run jobs where that expression is True, so only jobs in the list will be executed. Note that we have added an exclamation mark symbol (!) before your expression to change its value: noop_job = !(...). When using noop_job, Condor will still create output and error files for all jobs, but they will be empty for those jobs that will not be executed (be careful to avoid that new executions overwrite output files of previous ones).

Jobs that are not executed may stay in the queue with Complete status (when using condor_q you will see that ST column is C). To remove all C jobs from the queue, try next command in your shell (use the second one to only remove Complete jobs that belongs to cluster XXX):

  condor_rm -constraint 'JobStatus == 4'
  condor_rm -constraint 'JobStatus == 4 && clusterID == XXX'



Q26: I see that my jobs complete after being running for N+X minutes/hours, when they only need N to finish. Is that normal? ^ Top

A: Yes, it is normal. Bear in mind that executing a Condor job in a machine is only possible when it is not used by its owner. If Condor detects any user's activity in a machine when executing jobs, they will be suspended or moved to another machines, increasing the consumed time (and that may happens several times, so the extra time could be quite long).

Condor has several ways to show the time that jobs have been running. If you use condor_q, the time showed is the cumulative one by default (the result of adding the time consumed in all executions), so it could be really high if the job has been killed and restarted several times. If you use -currentrun option, then Condor will only display the time consumed in the current execution, which is a more realistic time (although if the job has been suspended, that time is also included). You can also use -cputime option to get only the CPU time (but if the job is currently running, time accumulated during the current run is not shown).

If your jobs finish in a reasonable amount of time, everything is fine. If they never finish or need an excessive amount of time to complete, please, read this FAQ.



Q27: I have submitted jobs that need some hours to finish. They have been running for days and just few have finished... what is happening? ^ Top

A: First of all, check that your program is properly running. Maybe there are some problems with the data, input files, etc. You can open a shell and check the running job using the condor_ssh_to_job command (see this FAQ). If you discover that there are some problems with your job and it will not produce valid results, you should stop it as soon as possible to avoid wasting more time and resources, see this FAQ for more details. If your job is working fine, maybe your job has been killed and restarted several times. Condor shows the cumulative running time by default, you can see the consumed time of the present execution using condor_q -run -currentrun command.

The reason why Condor kill and restart jobs is that it has several runtime environments called universes. By default, all your jobs will go to the most basic (also the simplest) one called vanilla universe. In that universe, when Condor detects that a machine is not idle anymore, it will suspend all running jobs for a while, waiting the machine to get idle again. If that does not happen in a given (short) time interval, then Condor will kill all jobs and send them again to the queue with Idle status, so those jobs will start from the beginning. If your jobs need some hours to finish, probably some of them will be killed before their completion and restarted in other machines, that could happen even several times.

To avoid that, you can use standard universe, since in that universe checkpoints are created so killed jobs can continue in other machines starting from the last checkpoint. But standard universe is a little bit more complex, you have to compile your application using condor_compile together with your normal compiler, and not all of them are supported... Please, check the official documentation and/or contact us if you need this feature.

However, most times we can solve this problem simply changing the arguments of our jobs. For instance, suppose you have to process 10,000 inputs and each input needs about 2 minutes to be done. You can create 100 jobs to process 100 inputs each, but they will need more than 3 hours to finish and it is likely they will be killed several times. It is better to choose faster jobs that can be finished in about 15-30 minutes so they will have more possibilities to be processed on the same machine without being killed and restarted on other machines. If you choose that each job works with 10 inputs, then you will have 1000 jobs and they will need about 20 minutes to finish, that could be a good approach.



Q28: Some of my python programs work fine, but other fail... ^ Top

A: If you are executing python programs with HTCondor and some jobs work fine and other fail, most probably you are experiencing problems related to the version of Fedora. Most of the old Linux Desktop PCs have installed Fedora21, but newer machines have a more recent version, mostly Fedora26 (although we also have a few with Fedora25). Paths to python libraries are different on the old and new machines, therefore your programs will only work properly on those machines that have the same Fedora version as the machine where you have submitted the jobs.

To fix this issue, you can force HTCondor to only execute jobs on machines with your same version of Fedora, then your environment variables and paths will work. For instance, if you are working on a machine with Fedora21, add the next requirement to force that all your jobs will be executed on machines running Fedora21:

  requirements =  (OpSysMajorVer == 21)

But adding that requirement will limit the number of available machines to execute your program: if you only run on machines with Fedora21, you will be missing all new and faster machines, and if you only run on machines with Fedora26, then you will be losing a big amount of slots since still most of the machines run Fedora21. We recommend you change a bit your submit script to be able to run your python programs on all machines, independently of their O.S (we will only avoid the few machines beginning with 'f', since they have a special purpose and python installation there is not the usual one):

  ... 
  transfer_input_files = your_program.py

  getenv       = True
  environment  = "PYTHONPATH=/usr/pkg/python/python2.7/lib/python2.7/site-packages"
  requirements = (!stringListMember(substr(toLower(Target.Machine),0,1), "f"))

  transfer_executable = False
  executable   = /usr/bin/env
  arguments    = python your_program.py

  queue ...

Example above is just a basic one, you might need to adapt it adding some other commands to transfer your input/output files, add requirements, etc., and, of course, all common commands (see common template). Contact us if you have any doubts.



Q29: I receive an error when running HTCondor jobs that use python and matplotlib... ^ Top

A: If you are running some python jobs that use matplotlib (for example, to make some plots and save them to png images) and receive errors like:

  • no display name and no $DISPLAY environment variable
  • : cannot connect to X server :0

it might be caused because matplotlib (and/or some other packages) needs a DISPLAY environment variable, which means you have to execute it in a X server, and that is not available when running on HTCondor. In this case, simply use another background that does not need a X server, like Agg. For instance, you can adapt next python code when using matplotlib:

  import matplotlib as mpl
  mpl.use('Agg')
  import matplotlib.pyplot as plt

  # Now use plt as usual
  ...
  fig = plt.figure()
  ...
  fig.savefig('image.png')
  ...

You can find more info about this issue here.



Q30: I would like to get more information about the execution, is there an easy way to see the logs created by Condor? ^ Top

A: Yes, there are several possibilities for that. The first step is to create the condor log file adding the next command to your submit file:

  log = file.log      #(we recommend you use your_executable_name.$(Cluster).log as name for your log file)

Once you have your condor log file, you can display the information using the following options:

  1. Directly check the content of the condor log file with any text editor (not recommended)
  2. Use condor_userlog <file.log> to get a summary of the execution.
  3. Run condor_history -userlog <file.log> command in your shell to list basic information contained in the log file.
  4. Use condor_logview <file.log> to open the Condor log viewer and see more detailed information in graphical mode, showing the timeline of your jobs and allowing you to perform zooms, filter jobs, etc.
  5. There is also an online tool to analyze your log files and get more information: Condor Log Analyzer (http://condorlog.cse.nd.edu/).

If you just want some general information about Condor queue, the pool of machines, where jobs have been executed on, etc., you can also try our online stats about Condor: http://carlota:81/condor_stats/ and nectarino.

Q31: I am running many jobs, but some are more important than others, how can I prioritize them? ^ Top

A: You can prioritize your jobs (and only your jobs, not other users' jobs) using priority = <value> command in your submit files (the higher value, the better priority). Once you have submitted your jobs, you can check or modify their priority by running condor_prio in a console. Please, check Condor submit file page to see more examples, and also this FAQ for more info about users' priorities.



Q32: I am receiving hundreds of emails from Condor, can I stop that? ^ Top

A: Yes, by default Condor send an email notifying any event related to each job (termination, errors, etc.). If you launch 1000 jobs, that could be really annoying. To avoid that, use next command in your submit file: notification = Never (use Complete if you only want to know when they finish, Error when they fail or Always to receive all notifications; we recommend you use Error). Also you can change the email address using notify_user = <email>.

Please, visit Condor submit file page for more info and examples.



Q33: What happens with my IDL or Matlab jobs that require licences to run? ^ Top

A: There is a limited number of IDL licences, so if you try to run a large number of IDL jobs they could fail since there may not be enough licences. But using IDL Virtual Machine does not consume any licence, so there will not be limit in the number of simultaneous IDL running jobs, just the number of available slots. See detailed information here.

There is a similar limitation with Matlab licences, that could be saved if it is possible for you to create Matlab executables using the Matlab Compiler. You have more info about this topic here.



Q34: I need to run some commands or scripts before/after my executable, is that possible? ^ Top

A: Yes, it is possible adding the +PreCmd and +PostCmd commands to your submit file, respectively. Running scripts before/after jobs could be useful if you need to do some operations in your input or output files, like changing their names, moving or copying them to other locations, etc. Also you can use these commands for debugging purposes, like using the shell command tree to check where your input/output are placed:

 +PreCmd        = "preScript.sh"
 +PreArguments  = "-iv"
 +PostCmd       = "tree"
 +PostArguments = "-o tree.out"

 should_transfer_files = YES
 transfer_input_files  = preScript.sh, /usr/bin/tree

Generally, you also have to add or update the transfer_input_files command to include your scripts in the list of files to be copied to the remote machines (make sure that command should_transfer_files = YES is present, too). These commands are intended to be used with user's scripts. If you want to run shell commands (like tree in the example), you have to transfer that command (use which <cmd> to know its location).



Q35: Is it possible to limit the maximum number of concurrent running jobs? ^ Top

A: There are some situations where it could be interesting to limit the number of jobs that can concurrently run. For instance, when your application needs licences to run and few of them are available, or when your jobs access a shared resource (like directly reading/writing files located at /scratch, too many concurrent access could produce locks and a considerable slowdown in your and others' computer performance). Please, visit Condor submit file page to see details and example about how you can add these limits.



Q36: I need to do some complex operations in my submit file, is that possible? ^ Top

A: Yes, Condor has some predefined functions and some special macros that you can use in your submit file:

  • evaluate expressions: eval(), ...
  • flow control: ifThenElse(), ...
  • manipulate strings : size(), strcat(), substr(), strcmp(), ...
  • manipulate lists: stringListSize(), stringListSum(), stringListMember(), ...
  • manipulate numbers: round(), floor(), ceiling(), pow(), ...
  • check and modify types: isReal(), isError(), int(), real()...
  • work with times: time(), formatTime(), interval(), ...
  • random: random(), $RANDOM_CHOICE(), $RANDOM_INTEGER(), ...
  • etc.

Check documentation to see the complete list of predefined functions, and also the function and pre-defined macros.



Q37: I would like to submit my jobs now, but they should run at a programmed time, can I do that? ^ Top

A: Sometimes it might be interesting to run your jobs at a specific time, maybe your application depends on some data that are automatically generated at a given time and you want to run your jobs after that moment. Or you want to submit your jobs now, but for any reason they have to run in X hours from the submission time, or you want to regularly run the some jobs several times every day, or every week... Condor has several commands to deal with these situations, please, visit Condor submit file page to see details and examples about how you can specify that jobs begin at a programmed time, and also how to program periodical programmed executions.



Q38: Jobs leave the queue after finishing. If something went wrong... could they be held or automatically re-executed instead? ^ Top

A: By default all your jobs will leave the queue after completion. But it could happen that some of your jobs get complete status because they failed (for instance, they could fail due to bad inputs, or there is a missing software package in an specific machine, etc., see also this FAQ). If that happens and you are able to detect it, you can force that they stay in the queue with 'on hold' status or get the 'Idle' status so they will be executed again. You can control which jobs you want to change the status according to their execution time (if it is abnormally short or long), their exit code, etc. Use on_exit_hold command to change its state to "on hold"; or on_exit_remove command to re-execute the job (it will get the Idle status again), adding a reason and/or subcode if you want to do that. Please, visit Condor submit file section to get detailed info and examples about this feature.



Q39: I want to do checkpoints of my normal programs (without using Condor) so I can restart them, is that possible? ^ Top

A: Yes, if your program is written in C, C++ or fortran (and you compile/link it using cc, c89, CC, f77, gfortran, gcc, g++, g77 or ld) it is likely that you can do checkpointing without using Condor to run it. This is a powerful feature, it means that you can do specific or periodic checkpoints of your program when it is running and, if something happens, you can restart your program from any of those checkpoints.

We will present a full example, suppose we have a C program with name myprogram.c, then, follow next steps:

  1. First, you have to compile it using condor_compile command and your normal compiler, this is the only procedure where Condor is involved. Compile it as you normal do, just add condor_compile command before your compilation line.
    condor_compile gcc myprogram.c -o myprogram
  1. Optional: Now you will have an executable called myprogram. These executable files created by condor_compile are usually quite big since some extra information is added (like debugging info, symbol tables, etc). This extra information is not needed when running your program, so if you want a smaller executable file you can remove it using linux strip command, it may also increase its performance:
    strip myprogram
  1. Our program is ready to be executed. We will not use condor_submit, we will run it directly in our shell as any other program: this is called Standalone Checkpointing Mechanism. Since Fedora has a feature called address space randomization which is not compatible with the checkpointing mechanism, we have to use linux setarch command to disable it:
    setarch x86_64 -R ./myprogram   
The application will then run and we may receive Notice lines, that is normal. In this case, checkpoint files will be created in ./myprogram.ckpt
    Condor: Notice: Will checkpoint to ./myprogram.ckpt
    Condor: Notice: Remote system calls disabled.
  1. To deal with checkpoints, we need to know the Proccess ID (PID) of your running program. We can get it using linux ps command (PID should appear in the second column, after the username):
    ps aux | grep myprogram    #Suppose that the PID is 12233
  1. Once we have the PID, we can force that your application writes a checkpoint sending it a SIGUSR2 signal at any time:
    kill -USR2 12233
After sending that signal, the checkpoint should have been created and the application may be still running, check that a file called myprogram.ckpt exists in the same directory. If you want to do periodic checkpoints, you can write a simple shell script or use cron to regularly send SIGUSR2 signals to your program. Bear in mind that all checkpoint files are created with the same name, so if you want to keep all checkpoints, you should rename them before creating a new one.
  1. We can also force your application to write the checkpoint and stop the execution immediately afterwards. To do that, use SIGTSTP signal instead, or press Ctrl+Z:
    kill -TSTP 12233
Now you can check that the checkpoint was created and your program is not running.
  1. To restart the execution using an specific checkpoint, run your program again and add -_condor_restart option and the name of the checkpoint:
    setarch x86_64 -R ./myprogram -_condor_restart myprogram.ckpt
Your application should be now running from the same point where the checkpoint was created.

Notes:

  • If you have problems creating the checkpoints or running/restarting your application, add -L and/or -B options to setarch.
  • Bear in mind that programs should meet some limitations.
  • If your application is written in the C language and you want a deeper control of the checkpoint feature, you can add some functions provided by the Condor Checkpoint Library to your program.



Q40: I have a fault tolerant application, can I save the state and restore it when executing with Condor?^ Top

A: Yes, Condor allows you to use your fault tolerant programs. You only need to use next command to specify that Condor has to save files when your program fails or Condor needs to evict it:

 when_to_transfer_output = ON_EXIT_OR_EVICT

You have more information in the Condor manual: The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. In this case, files are spooled to the submit machine any time the job leaves a remote site, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. The files spooled back are placed in a directory defined by the value of the SPOOL configuration variable. Any output files transferred back to the submit machine are automatically sent back out again as input files if the job restarts.



Q41: My jobs have some dependencies, is it possible to specify that? ^ Top

A: Yes, if you have some dependencies in your inputs, outputs or execution order, you can specify it using a directed acyclic graph (DAG). Condor has a manager (called DAGMan) to deal with these jobs, but you must use special commands, like submitting your jobs with condor_submit_dag. Please, visit Condor submit file page for more info and examples.



Q42: My question is not in this list or I need further information, where can I find it? ^ Top

A: There are more FAQs and How-to recipes available at Condor site and the official Users' Manual is useful, too. Also you can visit other sections of Condor at the SIEpedia, like useful commands or Submit files. If you need further information, please, contact us (you can find here our contact data).

Check also:





Section: HOWTOs

edit · print · PDF
Page last modified on December 04, 2017, at 04:27 PM