Here we list the most relevant news and announcements about the Beowulf cluster (Diodo) and the Condor queue management system, sent to the Beowulf and Condor mailing lists.
After the failure of several of its aging computing nodes, we decided to retire Diodo from service.
All users are encouraged to make use of the LaPalma supercomputer, which has been recently expanded and now boasts 1024 CPU cores, for a total of 1 TB of RAM. The percentage of time reserved for IAC users is now 50%.
Diodo is "open for business" again. I intend to do some more changes to Diodo, but those will have to wait (provisionally scheduled for the week starting April 25th), in order not to have Diodo unserviceable for too long.
The main changes to Diodo during this downtime were:
Diodo (http://diodo/ formerly Chimera) is showing signs of its age, so I'm going to install it from scratch with a more modern version of its operating system (and libraries, compilers, etc.). A tentative schedule (if no complications arise) is:
So, it is IMPORTANT that you backup any data that you have in Diodo and that you want to keep BEFORE Sunday February 13th. Otherwise, all data in ALL partitions (INCLUDING HOME DIRECTORIES IN DIODO) will be lost. Also, if you want to start using the new diodo as soon as possible, please let me know what software/libraries/compilers you would like installed, so that I can prepare their installation, if possible, by March 1st.
As we do every six months, it's the time to publish the usage statistics of the Supercomputing resources at the IAC for the second semester of 2010. In total, 814768.9 CPU hours were delivered during this period. By resource, Condor delivered 577898.8 CPU hours, LaPalma 218551.43 and Diodo 18318.67. Full details of the breakdown by users can be found at the SIE Forum for Condor, LaPalma and Diodo. If you want a piece of this pie and don't know how to start, just let us know.
Tradition dictates that it is now time to publish the usage statistics of the Supercomputing resources at the IAC for the first semester of 2010. In total, 914353.9 CPU hours were delivered during this period. By resource, Condor delivered 440990 CPU hours, LaPalma 282620 and Diodo 190743.90. Full details of the breakdown by users can be found at the SIE Forum for Condor, LaPalma and Diodo. If you want a piece of this pie and don\'t know how to start, just let us know.
We have published the Condor usage statistics for the first semester of 2009 at http://venus/SIE/forum/viewtopic.php?f=8&t=38&p=686#p686
At the same time, just a reminder that due to the Condor license "Any academic report, publication, or other academic disclosure of results obtained with this Software will acknowledge this Software's use by an appropriate citation." (http://research.cs.wisc.edu/htcondor/license.html). A description of what "an appropriate citation" means can be found at https://lists.cs.wisc.edu/archive/htcondor-users/pre-2004-June/msg00542.shtml
We have published the Chimera usage statistics for the first semester of 2009 at http://venus/SIE/forum/viewtopic.php?f=8&t=154&p=687#p687
Due to lack of sufficient air-conditioning in the servers' room, some machines had do be turned off. Since the 32 bits partition in Chimera is quite old now, and not many people were using it, I'm afraid they had to go... They will remain switched off for the foreseeable future. But remember that if you have software compiled for 32 bits, in many cases it should run without changes in the 64 bits machines. If you have problems or any doubts, please do get in touch.
With dual- and quad-core workstations becoming the norm these days, we have revisited the list of machines available to Condor, and we have managed to set a new record, breaking the 400 CPUs barrier.
Roughly 75% are 64 bits CPUs, the remaining ones being 32 bits ones. The details
The future is 64 bits, but while we still have 32 bits machines, you should know that the 64 bits CPUs can also run 32 bits codes. This is important because with Condor is very easy to do an "Heterogeneous Submit", for example: submit from a 32bits machine, but ask Condor to execute the code in either a 32bits or 64bits machine. This is not the default behaviour (the default will be to execute in those machines with exactly the same architecture and operating system as the one from which you submit), so if you want to make use of this feature have a look at section 2.5.6 of the manual http://research.cs.wisc.edu/htcondor/manual/v7.0/2_5Submitting_Job.html or ask me, if in doubt.
From the various file systems/partitions available at Chimera, the /scratch partition (which is shared by all nodes in the cluster) is getting full very quickly, so it is time to set a program to automatically delete files that are not accessed in a given period of time. Those of you who were using Beoiac will remember that this was how its /scratch partition worked as well.
To give you enough time to recover any data that might be useful, this automatically deletion mechanism will not be set up until the 9th of November (in about 15 days). On that date, all files in the /scratch partition that have not been accessed (read, modified, etc.) in the last 60 days will be automatically deleted (it works file by file, so accessing a directory does not save its individual files). This time limit will be reconsidered at a later point if the measure is considered ineffective or excessive.
You should consider the /scratch partition only as temporary storage, and back up all important data somewhere else outside the cluster.
Sorry for the inconvenience, but it is the only way to keep the cluster operational.
We have published the Condor usage statistics for the first semester of 2008 at http://venus/SIE/forum/viewtopic.php?p=456#456
At the same time, just a reminder that due to the Condor license "Any academic report, publication, or other academic disclosure of results obtained with this Software will acknowledge this Software's use by an appropriate citation." (http://www.cs.wisc.edu/condor/license.html). A description of what "an appropriate citation" means can be found at https://lists.cs.wisc.edu/archive/htcondor-users/pre-2004-June/msg00542.shtml
We have published the Chimera usage statistics for the first semester of 2008 at http://venus/SIE/forum/viewtopic.php?p=457#457
As you can see in the post, the cluster has been used roughly to 73% of its capacity. This is indeed pretty good, given that: some jobs need to occupy a full node, but only use two of the four available CPUs (due to memory constraints); that one of the nodes is often reserved for testing and thus not fully occupied; etc.
Until recently Chimera had a static reservation of one of the nodes that avoided long running jobs during working hours in that node. This was meant for code development or small tests, but since this was seldom used and meant that 4 CPUs were most of the time unused, I have deleted this static reservation. So Chimera is close to its full 128 CPUs again (one of the nodes needs to be repaired, which should happen during the coming days).
If you need a cluster for tests, you can now use our new mini-cluster Xerenade. This mini-cluster has 16 AMD CPUs, and should be perfect for code development. This is not yet in full production, but if you would be interested in trying it out, please do get in touch.
Perhaps the news about the new installed compiler has gone unnoticed, but we have recently installed in the IAC network the C/C++ and Fortran Portland Group's compilers. For those of you developing parallel codes, it is perhaps interesting to know that this compiler supports "High Performance Fortran (HPF)" (http://www.pgroup.com/doc/pghpf_ug/hpfug.htm).
Our version of Condor at the IAC had become a bit old, and it was having some problems with multi-core PCs, so we have upgraded to the latest stable Condor version, 7.0.1, still warm from the oven (released February 27, 2008).
The upgrade went smoothly and without any problems, but if you find anything odd during the coming days, please let us know.
As it is now usual towards the end of each year, I have calculated the yearly usage of Chimera, who during 2007 (up to the 26th of December) has delivered 447262.71 CPU hours. You can see all the details at the SIE Forum (http://venus/SIE/forum/viewtopic.php?t=154).
And by the way, if you missed our last SIEminar (Supercomputing resources at the IAC), you can see the slides at http://www.iac.es/sieinvens/SINFIN/Sie_Courses_PDFs/resources_supercomputing.pdf Happy holidays, and let's try to make Chimera work even harder during 2008!
Just to let you know (in case it can be of interest),
that due to a user request I have installed in Chimera (chi64, the 64 bits
machines) the ABINIT software. According to its main page (http://www.abinit.org/):
"ABINIT is a package whose main program allows one to find the total energy, charge density and electronic structure of systems made of electrons and nuclei (molecules and periodic solids) within Density Functional Theory (DFT), using pseudopotentials and a planewave basis. ABINIT also includes options to optimize the geometry according to the DFT forces and stresses, or to perform molecular dynamics simulations using these forces, or to generate dynamical matrices, Born effective charges, and dielectric tensors. Excited states can be computed within the Time-Dependent Density Functional Theory (for molecules), or within Many-Body Perturbation Theory (the GW approximation). In addition to the main ABINIT code, different utility programs are provided."
If you are interested in trying it out, do get in touch.
As planned, Beoiac is gone. In its place we have now Chimera. It will have 32 nodes, but as of today only 30 have been configured. If you have used Beoiac before, much will be familiar,but there are new things to be learnt, like compiling for 64 or 32 bits, using PVFS, etc. Instructions (very preliminary yet) on how to use the cluster are in the Cluster Documentation Page.
Next week I will be away, so if you would like to use the cluster during that time and you don't have an account yet try to get in touch with me tomorrow.
Overall the cluster is working OK, but there are still a number of things to be configured, so if you find any problems please get in touch. Also, remember that although the cluster is primarily for parallel codes, serial codes are also permitted, though they will have a very low priority in the queueing system.
At last the air-conditioning in the machine room was installed and is working fine, so we got the green light to start the installation of the new 64 bits Beowulf. This installation is relatively complex, as there are many things to test, and many things to install before putting it into production. This is specially true since the transition from 32 to 64 bits adds a number of new challenges, so I cannot tell you when it will be ready for regular use, but I will keep you informed.
In order to make this new cluster a better experience for everyone, I would ask you two things:
Condor is now beyond its testing phase, as it has proven very stable and useful, but in order to avoid affecting other users, we have written a small Code of Conduct to which you should stick when using it. I include a copy below.
Condor is a terrific tool for performing parametric studies and other type of jobs that can run simultaneously and independently in a number of machines. Nevertheless, under certain circumstances, if you are not careful you can bring the network to a crawl. To avoid these situations, please stick to this simple code of conduct:
Please stick to these basic rules so that we can avoid Condor affecting other users' work.
Until now the priority policies in the cluster took into consideration a number of parameters, but not how much each user had used the cluster during the preceding days, as the cluster was not heavily used and this didn't seem necessary.
As you might have noticed, this has changed, and the queueing time can be now large. Thus, to ensure fairness amongst all users, a new "fairshare" parameter is now taken into consideration when calculating job priorities. Basically, the less you have used the cluster during the preceding days (right now 7 days), the greater this fairshare component will be, thus giving an advantage to your jobs over the jobs of users who have used the cluster recently. This should make the use of the cluster fairer for everyone.
As always, don't hesitate to let me know any suggestions/comments/etc.
PS. The details about the current scheduling policies for the Beoiac can be found at http://beoiac/Maui/policies.html
As you know, Condor is great for short jobs, but when running long jobs the efficiency can decrease due to evictions. One type of eviction happens when your Condor job is running in a machine and the "owner" comes back to his/her workstation. If your job had been running for only 20 minutes it is not a big deal, but if your job was about to complete after running for 10 hours, then the efficiency suffers a bit.
You could avoid sending long jobs to Condor by splitting the execution in small
parts. If this is not possible, then you could consider submitting your jobs
to the Standard Universe. From the Condor manual: "Jobs submitted to the
standard universe may produce checkpoints. A checkpoint can then be used to
start up and continue execution of a partially completed job." More info
about the Standard Universe at:
If the standard universe is not possible either, then you could make use of the brand new feature I have just implemented. The idea is to submit your jobs to those workstations in which the owner has had very little activity in the past, thus (assuming that past behaviour is a predictor of future behaviour) reducing the risk of possible evictions. So, how does it work? Very simple, you should just add to your submit file
"Rank = owner_inactivity"
owner_inactivity is a value which is increased by one every 15 minutes if the machine is not being used by its "owner". If you are curious about these values for all the machines in the Condor pool, you could run the command: condor_status -format "%d " owner_inactivity -format "%s \n" Machine -sort owner_inactivity which will print all the machines with their corresponding owner_inactivity values in ascending order (right now the values are still very small because I just started the feature, but you will see them growing).
If you want to mix this rank expression with another rank expression, check the examples at:
With this feature, your jobs will try to run first on machines that have been unused by their owners for a long time, which should improve your chances of avoiding eviction. As always, let me know if you find any issues with this.