Benchmarking and Tuning the GROMACS Molecular Dynamics Package on Beowulf Clusters
Decide where to publish, Crossroads of the ACM? Guidelines, deadlines, etc. Charlie ok as an author? JoshH.
References tour. Charlie.
http://www.earlham.edu/~charliep/mt/archives/002711.html, other entries?
PDF rendered very unusually on bubba, PS ok. JoshH.
Reconcile database results for 1-1-1, 2-2-2, 4-4-4, 6-6-6, 8-8-8; bazaar and cairo (only audit cairo now but we'll ultimately need bazaar as well); villin, dppc, then proteasome;use detailed version of the scheduler; run under b-and-t-g user; use c10-13 for now; gromacs-optimal-3.2.1 and gromacs-baseline-3.2.1; rerun 1-1-1 and 4-4-4 for villin and dppc, baseline and optimal (2^3 runs total); report back. JoshM.
Confusion between Methodology and Experimental Design. Some material from each needs to move to the other. Are they parallel universes at some level? We all need to think about this a bit. J^2 and Charlie.
How to handle definitions? Journal, Zobel? JoshH.
Other Considerations merged into Results with appropriate changes. JoshH.
How to handle general citations (rather than specific ones). I don't even know if there is a commonly accepted way of doing this in scientific literature. It may be that re-reading the articles, after our prose is relatively stable, looking for specific citations is the way to handle this. JoshH.
MP_Lite is a subset of the MPI-1 standard. Here are some of my notes from porting GROMACS from MPI to MP_Lite:
For workstations, type 'make tcp' and link libmplite.a into your code.
If you are sure you won't pass messages larger than the TCP buffer
size, you can use the synchronous version by doing 'make tcp_sync'
which may increase performance by a few %. The TCP buffer size is reported
in the .nodeX log files after each run.
Hi All,
Charlie mentioned last thursday that he had edited the number of pico-seconds that the simulation ran through. How did he do that? I ask because many of the molecules sitting in the folder JoshM pointed me towards have default run times of 1000.0ps which would take days. I would really like to chop that down to 10-20ps so that I could put together my table of "the most flop consuming subroutines" with a wide selection of molecules.
Thanks.
here is an outline of how the current scheduler works. First notw that there are two versions of the scheduler [I have forgotten why exactly] there is detailed-scheduler.pl which is the current version and the one that should be used, and there is scheduler.pl which is old and should not be used. The latter does not have the 'find the dominate inner loop' code.
Some General Notes:
I submitted the following to the Ohio Linux Fest 2004:
http://cluster.earlham.edu/detail/project/b-and-t-gromacs/presentations/linux-fest-2004.html
I may try to convert this to actual HTML in the near future, but I may wait until we start pounding out some prose/presentation materials for it first, so we can define its structure.
In order for GROMACS (3.1.x and 3.2.x) to build and use AltiVec instructions on PowerPC chips running Yellow Dog Linux/gcc 3.3.2 there are two files in the distribution which need a header file added to them.
In configure "#include <altivec.h>" should be added before main() in the generated C code in the AltiVec support test section. You can find this by searching for "supports altivec".
In include/ppc_altivec.h "#include <altivec.h>" should be added before the first function definition.
I have manually compiiled a list of the molecule runs that [have | have not | will not be] completed for both bazaar ad cairo.
Bazaar Cluster
Cairo Cluster
These are automatically updated from the database when you refresh the page. The key has changed a bit from previous iterations of this chart. I am working on a Time Approximation scheme to place on the page as well.
The pages list the runs in 4 catagories:
I placed the "villin and URE in 6 A cubic box in water" molecule set in the b-and-t-gromacs CVS repository with the other molecules. It is under the directory villin-urea or via the softlink 'urea'.
I have been doing some testing with this molecule, and cannot seem to get it to span more than 3 processes with out major failure [i.e. Application death]. This is a very large molecule, and if we can only use at most 3 processes with it I am interested in finding out why. This is one of those question that we need to answer in order to produce some stable F@C code. grompp is able to split it fine, but mdrun chokes. I am playing around with the other versions of GROMACS to see if there is any difference, specifically I am interested in testing with 3.2.1.
I have installed the latest stable release of GROMACS (3.2.1) on the clusters. I also installed GROMACS 3.1.4 on cairo. For both versions I installed a Baseline and an Optimal Config.
So on both clusters we have the following versions of GROMACS with both Baseline and Optimal configurations:
Here is a table of the Future runs that I would like to run to answer the question:
For a given molecule, what is the optimal Number of Processes, taking into consideration SMP vs Uni-processor machines running both x86 and PPC hardware [Bazaar and Cairo respectivly]?
The file is here:
chart.html
Here are some notes about reading SMP vs Uniprocessor runs in the Database.
SMP Collection
cpus | nodes | processes | label | molecule | cluster_name | finish_time
------+-------+-----------+---------------------------+----------+--------------+---------------------
2 | 1 | 2 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-02-24 21:52:37
4 | 2 | 4 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-03-02 16:24:51
6 | 3 | 6 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-03-02 16:10:04
8 | 4 | 8 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-03-02 15:58:00
2 | 1 | 2 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-02-24 20:44:34
4 | 2 | 4 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-03-02 20:59:57
6 | 3 | 6 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-03-02 20:51:44
8 | 4 | 8 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-03-02 20:45:20
cpus | nodes | processes | label | molecule | cluster_name | finish_time
------+-------+-----------+-------------------------------------+----------+--------------+---------------------
2 | 2 | 2 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 16:28:06
4 | 4 | 4 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 16:06:36
6 | 6 | 6 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 15:50:56
8 | 8 | 8 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 15:37:34
2 | 2 | 2 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:40:55
4 | 4 | 4 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:27:23
6 | 6 | 6 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:18:47
8 | 8 | 8 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:11:57
The difference is in the conbination of nodes, and processes. In the Uniprocessor runs nodes = processes, in SMP (Dual CPU) runs nodes = processes/(cpus per node) or nodes = processes/2.
Note that in these runs cpus = processes, but this may not be so in the furture. This is only true because we only tested by running one process per cpu, but we may find that running more than 1 processes on a cpu is the optimal configuration.
I am going to setup some runs on Cairo and Bazaar to fill out our table.