Folding@Cluster
This is a description of the new process architecture developed after our experiences working with the a1 release.
Overall Plan:
Mother:
Nanny:
mdrun:
Notes:
2. We want to limit the changes we make in mdrun, but script based changes that are easy to apply are ok.
3. There are some kludges in the way that the nanny 'finds' the mdrun
process it is matched with. There are better ways to do this, but for the
moment the kludges allow for a proof of concept and quick solution.
4. We are using MPI_Spawn() instead of system(mpirun ...) because the former
allows a bit more control over the MPI_COMM group for the mdrun process(s),
whereas the latter completely separates the processes and adds some more
challenges that are harder to overcome.
Questions:
2. Redirect stdout to a log file [via freopen].
Answer: Init_FATC function in mdrun. nannies will transmit the logs back to the mother on completion.
3. Get Work Unit from mother [tpr, gro(?) files]
Answer: Mother pre-populates files on the nanny0 node as JoshH suggested.
This works if nanny0 and the mother are on different nodes and on NFS and
non-NFS systems.
4. Notification to the mother that we are finished.
Answer: Finialize_FATC sends message to mother from mdrun0
o The importance of SMP resources is increasing, make sure we install easily and scale well in this environment. What is the current level of support for threads in GROMACS? How does GROMACS running threaded compare in terms of performance to F@C? The CCG will locate SMP boxes that we can test on.
(Postscript: There is only partial support for threads in GROMACS currently (non-bonded interactions), F@C, ie MPI, scales pretty well on SMP boxes.)
o Ease of installation, what can we do to make it as simple as possible to install and use F@C? LAM is a hurdle in this respect. Simple how-to documents for ssh (assumed), key exchange, LAM, using shared file systems, other topics?
o Does hyperthreading do anything for F@C?
(Postscript: Not really. HT is duplicated processor state resources which facilitate very fast context switches between threads in a single process. Since GROMACS doesn't use threads F@C can't leverage this.)
o Windows port, Vijay has a person that can work on this with us.
#include http://www.earlham.edu/~charliep/mt/archives/002762.html
There is value in converting to MP Lite since it is much easier to install than LAM/MPI.
FFTW uses MPI functions that MP Lite doesn't support. This would limit this port to a subset of the analysis methods currently used by GROMACS.
MP Lite is limited to a single communicator, MPI_COMM_WORLD. F@C currently uses three communicators.
MP Lite requires either a shared filesystem or installation of the application binaries on all the nodes. The latter is a problem we were able to solve in F@C by using LAMs ability to launch a binary from the rank 0 node on all the other nodes in a MPI world (it just ships the binary to each LAM daemon before startup). There is that chicken and egg problem though of having to install LAM on each node, or a shared filesystem with the LAM binaries.
-------------
man mpirun
-------------
* LAM directs UNIX standard input to /dev/null on all remote nodes.
* LAM directs UNIX standard output and error to the LAM daemon on all re-
mote nodes. LAM ships all captured output/error to the node that in-
voked mpirun and prints it on the standard output/error of mpirun. Lo-
cal processes inherit the standard output/error of mpirun and transfer
to it directly.
------------
/cluster/cairo/src/lam-7.0.6/HISTORY
------------
* stdout/stderr of the local lamd is left open so that tstdio(3) will work properly
- tstdio -> trillium stdio file
------------
Useful files to look at
------------
share/kreq/clientio.c
otb/mpirun/mpirun.c
- set_stdio()
- lam_mktmpid
- Create a temporaty file name based on an id [/tmp/lam-12]
- lam_lfopenfd
- sfh_send_fd
- pass a single file descriptor over a stream
share/include/kio.h
#includeint main(){ char str[256] = "This is a string of text\n"; FILE *fp; // Print to stdout printf("%s",str); // redirect stdout if( (fp = freopen("file.txt","w",stdout)) == NULL){ perror("Unable to open file.txt:"); return 1; } // try to print to stdout again // This goes into "file.txt" directly, and is NOT printed to the terminal printf("%s",str); return 0; }
ACS Funding
F@C
Vijay is having lunch with Adam Begerg (now a grad school student in CS at Stanford), he's going to see if Adam has time/interest on working on COSM, etc.
MP-Lite would greatly simplify the installation of F@C. We'll need to look at FFTW/MPI and see if we can build GROMACS with FFTW libraries that don't have MPI calls (it's also possible to not use FFTW, how much science would that exclude?), stdout/stderr mapping from mdrun/child to mother via LAM's filehandle mapping, and possibly other areas. See JoshH's MT entry for a start on this.
export GMXLIB=/full/path/to/This is because once lamboot is executed it sets the environment variables, and any changes or additions are not propagated until the lamd is restarted./release/top
export GMXLIBDIR=/full/path/to//release/top
lamboot -v
There are now complete instructions for building Folding-at-Clusters in folding-at-clusters/source/README.
If you have any problems building with those get in touch with charliep.
* What should we plan on sending back as a result?
science - trr, xtc (if present); last frame in trr used to generate next, xtc for analysis; mdp file controls whether an xtc file is generated.
log (unique extension) - CPU time, wall time, # restarts, cluster characteristics
8.3 file naming restrictions still apply
Distribution - package for Un*x systems? For beta do a tarball, later do packages.
Windows port - POSIX issues are handled by the C compiler
Future - can we use MPI-Lite or something similar so that we can embed it with F@C. The nanny could be a Windows Service.
Build all static binaries.
More large molecules? Ribiosome? not needed now.
* Where we are:
Capability discovery
Startup - molecule preparation (grompp), lam startup, GROMACS startup
Progress monitoring
Checkpointing
Restarting
Results
* Tracked-down 2 of the 4 compiler/optimizer errors we spoke and we have a fix for them.
* Load distribution with GROMACS and MPI coming soon. Don't let this hold-up the beta.
* How does hyper-threading affect us? Probably not at all.
* Using COSM for communication between mother and nannies with HTTP.
* Beta around the middle of the month?
Build a roadmap for the future.
How to test the quality of a GROMACS result? Not needed now.
How are we going to get molecules when in production? Folding@Home template core.
grompp -f mdout.mdp -c d.villin.tpr -p topol.top -e ener.edr -t d.villin.trr -np 8 -o new.villin.tpr
tools used:
files initially required:
files created:
Process:
Notes:
General Notes
here is an outline of how the current scheduler works. First notw that there are two versions of the scheduler [I have forgotten why exactly] there is detailed-scheduler.pl which is the current version and the one that should be used, and there is scheduler.pl which is old and should not be used. The latter does not have the 'find the dominate inner loop' code.
Some General Notes:
So I have been investigating why I am seeing some weird behaviour in the F@C framework when using POSIX signals. I found a couple of bits:
Signal Cataching Changes in 6.5.9 Release
Signal catchingLAM MPI now catches the signals SEGV, BUS, FPE, and ILL. The signal handler terminates the application. This is useful in batch jobs to help ensure that mpirun returns if an application process dies. To disable the catching of signals use the -nsigs option to mpirun.
Internal signal
The signal used internally by LAM has been changed from SIGUSR1 to SIGUSR2 to reduce the chance of conflicts with the Linux pthreads library. The signal used is configurable. See the installation guide for the specific ./configure flag that can be used to change the internal signal.
2.9.2. Interaction with Signals
MPI does not specify the interaction of processes with signals and does not require that MPI be signal safe. The implementation may reserve some signals for its own use. It is required that the implementation document which signals it uses, and it is strongly recommended that it not use SIGALRM, SIGFPE, or SIGIO. Implementations may also prohibit the use of MPI calls from within signal handlers.In multithreaded environments, users can avoid conflicts between signals and the MPI library by catching signals only on threads that do not execute MPI calls. High quality single-threaded implementations will be signal safe: an MPI call suspended by a signal will resume and complete normally after the signal is handled.
In short if we use LAM-MPI then we should stay away from the following signals:
SEGV,BUS,FPE,ILL,TERM,USR2
I have changed the code from using SIGUSR2 to using SIGCHLD (a signal that is currently ignored by default according to signal(7) manpage), and things are working much better.
The goal of this post is to start the conversation about MD modules in F@C, and the requirements that new modules must adhere to in order to be classified as a potential module.
In order for GROMACS (3.1.x and 3.2.x) to build and use AltiVec instructions on PowerPC chips running Yellow Dog Linux/gcc 3.3.2 there are two files in the distribution which need a header file added to them.
In configure "#include <altivec.h>" should be added before main() in the generated C code in the AltiVec support test section. You can find this by searching for "supports altivec".
In include/ppc_altivec.h "#include <altivec.h>" should be added before the first function definition.
Here is a link to the MPI Form's section on the error handler:
Comm unicator Error Handler
We need to create a function that is using the type:
typedef void MPI_Comm_errhandler_fn(MPI_Comm *, int *, ...);
int MPI_Comm_create_errhandler(MPI_Comm_errhandler_fn *function, MPI_Errhandler *errhandler)
I have manually compiiled a list of the molecule runs that [have | have not | will not be] completed for both bazaar ad cairo.
Bazaar Cluster
Cairo Cluster
These are automatically updated from the database when you refresh the page. The key has changed a bit from previous iterations of this chart. I am working on a Time Approximation scheme to place on the page as well.
The pages list the runs in 4 catagories:
I added the Project1012 Molecule to CVS. Here is the link if you want to view the files:
http://cluster.earlham.edu/project/b-and-t-gromacs/tests/etc/orignal-molecules/project1012/
It is a bit smaller than the last molecule that we recieved, but this will 'stretch' to 100 processes with out error [in one set of testing].
So I have typed up an protocol outline for v0.5 which is a bit more detailed then what we have on the board now. There are some notes at the bottom which we should address when we meet next. Here is the link:
http://cluster.earlham.edu/home/joshh/dev/fac/protocal.txt
Here is a link to the current frame work:
http://cluster.earlham.edu/home/joshh/dev/fac/
I placed the "villin and URE in 6 A cubic box in water" molecule set in the b-and-t-gromacs CVS repository with the other molecules. It is under the directory villin-urea or via the softlink 'urea'.
I have been doing some testing with this molecule, and cannot seem to get it to span more than 3 processes with out major failure [i.e. Application death]. This is a very large molecule, and if we can only use at most 3 processes with it I am interested in finding out why. This is one of those question that we need to answer in order to produce some stable F@C code. grompp is able to split it fine, but mdrun chokes. I am playing around with the other versions of GROMACS to see if there is any difference, specifically I am interested in testing with 3.2.1.
I have installed the latest stable release of GROMACS (3.2.1) on the clusters. I also installed GROMACS 3.1.4 on cairo. For both versions I installed a Baseline and an Optimal Config.
So on both clusters we have the following versions of GROMACS with both Baseline and Optimal configurations:
MPI_Comm_spawn sequentally gives processes to configured processors. For out application we would like the user to define exactly which machines this program will run on. MPI_Info can specify a file that will be used to specify such nodes. The file needs to look something like:
< MPI_Info File >
n1 -np 1 nanny
n4 -np 1 nanny
n5 -np 1 nanny
< /MPI_Info File >
To abstract the user a bit from the details of this level of detail I have created a function to generate the Nanny and Child MPI_Info files from a hostfile.conf that contains a comma seperated list of nodes in the lamnodes format above. For Example:
< hostfile.conf >
n1,n4,n5
< /hostfile.conf >
It is vital to note that MPI_Comm_spawn, and MPI_Info are part of the MPI-2 standard but many implementations do not support it fully. LAM-MPI is one of the few that support both MPI_Comm_spawn and MPI_Info. However LAM-MPI does not currently have some additional functions implemented for intercommunicators. Some of these functions are MPI_Bcast and other functions that send/receive/reduce messages globally to a group. Also there is not function [that I have thus found] to poll the side of a group via an intercommunicator. The work around here is to have the first member of the Child group (rank == 0) to send its size to the mother over the intercommunicator. Man pages provide useful information about each of these commands and their limitations.
In order to have a speerate Mother, Nanny, and Child process and play in the MPI sandbox we will need to use MPI_Comm_spawn to launch our Child and Nanny binaries from the mother. I have been playing around with this and produced a basic framework that uses this functionality. You can play wiht the files they are located here:
http://cluster.earlham.edu/home/joshh/dev/discover/dev/spawn/
Some points to mention before running to make a huge sandcastle:
1. There is an difference between intracommunicators (standard usage of MPI_COMM_WORLD) and intercommunicators. intracommunicators are used to speak with those members of your own tribe (the Children that are spawned are in a tribe all of their own, and the mother is in a seperate tribe). Intercommunicators allow tribes to talk together. So the Children need to know the handle (MPI_Comm mother) to reach the mother, and the mother needs to know the handle (MPI_Comm everyone) to reach the children.
2. The Children are able to call MPI_COMM_WORLD directly and it will allow the children to talk amongst themselves without talking to the Mother. If they want to talk to the mother they need to use the mother MPI_Comm 'channel'.
We should be able to Spawn Nannies and Children as seperate Tribes, and join them as necessary. It is possible to, after creating 1 or more intercommunicators, to join them into one big, happy intracommunicator.
To start the program you only need to start the mother, and pass it one node to start from. MPI_Comm_spawn wakes up the additional processors.
$ mpirun n0 mother
The installation of cflowd requires the arts and GNU flex libraries. There seems to some problem with arts communicating with flex at the moment. I'll taker a closer look asap.
It seems that cflowd analyzes flow files from network communication developed by Cisco. Is that what we where looking for?
I cleaned up the Capability Discovery code a bit.
As near as I can tell the SIGTRAP error is an interaction between MPI and the AltiVec code although that doesn’t sound reasonable on the surface of it. I made a copy of discovery.c san all the MPI calls and it works fine.
I think we will have to trace down exactly where in inl1130() the trap is occurring in order to ultimately fix this. One approach would be to have our signal handler tell us where it was invoked from. Another would be to put some code in to pause stresscpu() long enough for us to attach with gdb before the offending call is made.
I have finished the mother/nanny Capability Discovery Code. I have an MPI Version and a singleton version (seperate mother and nanny programs).
These are located here:
MPi Version:
http://cluster.earlham.edu/home/joshh/dev/discover/mpi/
Singleton Version:
http://cluster.earlham.edu/home/joshh/dev/discover/single/
I have a version for the x86 SSE enabled Linux systems, PPC Altivec Linux systems, and a generic Linux version. Currently the Altivec MPI Version is broken due to a Trap/Breakpoint error that I am struggling to track down.
in the directories above there is a create.pl script, which will build the approprate collection of source for your specified (on the command line) system. I need to make this into a configure script in the future.