September 30, 2004
Update - Josh H - Sept. 30, 2004
Regard this entry as fluid if accessed before Sept. 30, 2004
Worked on: Folding@Clusters
- Mother/Child [Rank 0]/Nanny[Rank 0] now print a banner when they start which indicats the version of the code they are running, both Tag and Revision from CVS.
- Implemented a command line option that just prints the tag and revision from cvs [in a pretty format] to the command line, then exits. This is implemented for mother/child/nanny binaries.
./mother --version
./mother -v
- I cleaned up the chattyness of the print statements by adding a logging level argument to the Print command, and taking a tour through the code to put the correct print statement in the right log level.
- Nanny no longer sends the same checkpoint file multiple times. The decision is based on the size of the file. If the file did not change size since the last time I sent it, then don't send it again. Since the file always grows (accumulates data), this is a good metric to use.
- Implemented switched sections for using assignment server OR local testing directory.
Uncomment the tag USE_ASSIGNMENT_SERVER in mother.h to toggle this option. Currently this is a stub if enabled, since we don't have the assignment server stuff.
To Do:
- Clean up error checking WRT conf files, grompp/mdrun errors, etc. Useful error messages.
- Mother/Child/Nannny should clean up after themselves
- Have child send the result files to the mother when finished. Put in the infrastructre for when we know exactly the number and type of files to send.
Posted by hursejo at
02:23 PM
|
Comments (52)
September 24, 2004
Meeting Minutes - Sept. 24, 2004
Folding@Clusters
- Can we put a logging level entry in mother.conf? We shouldn't use a command line option because they may not be very portable.
Should we implement --version in mother, child, nanny or should we always display this no matter what? The latter means that the are more portible to systems without command lines (Windows?) [Josh H]
Clean up the chattyness (word?) of the display. Possibly add an extra int to the Print function.
--version in both places [command line + exit, always print].
- Need to put in a copyright and developer info [Charlie?]
- Josh M has F@C running.
- Should we put a moleculeDirectory entry in mother.conf for use during testing or just leave it hardcoded? [Josh H]
One way it could work would be if moleculeDirectory is present and populated in mother.conf than the molecule comes from there, if not then we go to the assignment server.
Keep this as compile time option is the best case to keep users from this flag. use defines to switch between the two.
- Tabs and 80 column tour through the source code. [Charlie]
- It appears that molecules use different sets of input files, both in number and type.
- How should we handle this?
Assignment server should send us a list of filenames to create, then download them all.
- Do the command lines we issue to grompp and mdrun need to adapt?
Yes. We need to adapt to how they want to run. 2 classes of input, via input and output flags. Josh M can elaborate.
- We need to get a list of what they want ran, and what they need back to the server.
- Mother/Child/Nannny should clean up after themselves [Josh H]
- Testing
- The README in testing needs to be updated and a grid of tests to be performed should be developed. [Charlie]
- JohnS needs a tour and list of testing to do starting RSN. If JoshM can give him the tour, then CharlieP will get a list of tests to perform ready.
- Is the molecule repository complete and correct now? Complete with pathless cpp, checkpointing, all original files, all molecules present, etc.? [Josh M]
Yes, except for project-1012. We will incorportate the new molecules as they come in from Stanford.
- Hopper immutable function is chflags.
- Checkpointing and restarting with same number of nodes. [Josh M]
- Instructions for developers for making a release. [Charlie]
- Timezone problem with COSM. [Charlie]
- CPU count detection problem with COSM. [Charlie]
- Building and running correctly with AltiVec instructions on PPC. [Charlie]
- Clean up error checking WRT conf files, grompp/mdrun errors, etc. Useful error message. [Josh H]
- Nanny should avoid sending the same checkpoint file multiple times. [Josh H]
- Have child send the result files to the mother when finished. Do we know exactly the files that we need? [Josh H]
Conferences
- LinuxFest
- Review the presentation outline from Grab-a-Byte and get your feedback to CharlieP. We should talk about this next week. [All]
John is not going to attend this with us.
- SC2004
- The projector, table-top screen, etc. arrived on Thursday. Purdue University is going to give us space in their booth and possibly a presentation slot. Next week we should discuss the logistics of the trip.
Why not Indiana's Booth?
General
- MT updates pre-meeting.
- Plumbing: Bugzilla setup on admin. [Charlie]
- Logistics: Left over money ideas: power supply/backup infrastructure -- In light of the situation the other day with the power outage.
Remote console would be nice, but the first point is more important.
Plane tickets to California for F2F meetings at Stanford.
Food is always good
Posted by hursejo at
07:35 AM
|
Comments (86)
September 23, 2004
Update - Josh H - Sept. 23, 2004
Worked on: Folding@Clusters
- Fixed One Node problem with network capability discovery.
Should note that is problem may or may not arrise depending upon network configuration. On cairo I was able to execute the test fine using just n0 [or the localhost] however this case could fail depending upon the platform. I put in a check that if we were running on one node then we don't run the network tests, and just zero out the results for those two items.
- Tested on non-NFS cairo, and posted a few notes in a previous message.
- The grompp shared library issue has been fixed with an environment variable. This is also in the previous post to MT about non-NFS systems.
- Tested on Bazaar [NFS and Non-NFS], everything was fine. The Non-NFS takes a while to complete the network capability discovery stuff when working with the Hub in bazaar annex.
To Do List
- Clean up error checking WRT conf files, grompp/mdrun errors, etc.
- Nanny should check to make sure that the checkpoint file has changed since the last time that it sent it to the mother. This is to avoid sending the same file multiple times.
- Have child send the result files to the mother when finished. Do we know exactly the files that we need?
- Implement --version [--help as well?] in mother [consider child and nanny as well]
- [With Charlie] Fix the altivec linking issue with the child on PPC/linux. Does this happen on PPC/Mac OS X?
Posted by hursejo at
10:05 PM
|
Comments (51)
September 22, 2004
Non-NFS notes
So I have tested with the Non-NFS nodes on cairo and compiled a few notes. Overall everything worked fine. I built on c1, and tested on c12-c15 with the release binaries.
- On compute nodes we (by default) set our working directory to $HOME. So you get directories created like:
Creating Directory /home/test1/work/nanny/
Creating Directory /home/test1/work/nanny/
We should allow the user to set this directory. Currently they can via mpirun -wd DIR Which will set the working directory on each remote node to the argument (DIR), before starting work.
We could build in the capability to allow the user to set the working directory per node in the configuration.
For the moment the mpirun option is the best for the moment, but we should consider this question a bit.
- GROMPP Running problems/Fix:
You must export the GMXLIB and GMXLIBDIR environment variables BEFORE starting lamd via lamboot.
export GMXLIB=/full/path/to//release/top
export GMXLIBDIR=/full/path/to//release/top
lamboot -v
This is because once lamboot is executed it sets the environment variables, and any changes or additions are not propagated until the lamd is restarted.
Posted by hursejo at
08:38 PM
|
Comments (131)
September 17, 2004
Meeting Minutes - Sept. 17, 2004
Folding@Clusters
- Do a checksum before using any plain text files, to ensure no bad things happened since we downloaded them.
- Fix One Node problem with network capability discovery.
- Restart command and logic in code. Top of Josh M's list.
- Clean up the /cluster/molecule directory. Josh M reports that he is making progress.
Make read-only and immuteable
Josh H gave him the Villin-Urea stuff and put it in ~joshh/dev/molecule
- grompp shared issue (#3 for Josh M). Look for the following files at minimum
release/work/mother/topol.top:11:23: ffG43a1.itp: No such file or directory
release/work/mother/topol.top:2494:19: spc.itp: No such file or directory
- CPP (C pre-processor) in mdp configuration. [#4 Josh M]
How does F@H do this? They may just use option 1, can we confirm this.
- Why hard code the full path? make this just "cpp"
- If we have to hard code the full path, document to the user that they have to edit the mdp file before running.
- Look for cpp on thier system, and edit the mdp file.
- Josh H pushed the configure.ac file that charlie made to CVS.
- Node map to users: File in generic/doc/node-usage.html
Keep this up to date.
- Josh M & Charlie need to build and Run the folding-at-clusters stuff on, at least, cairo. To get into the change-build-test mantra.
- Charlie is working on documentation for install.
- Packaging Binaries:
- Decide OK
- Tag
- checkout tag
- Build
- Make tarball (out of release directory -- cleaned, meaning no working directories or files)
We can get rid of the release directory in CVS. The makefile can create this for the developers. [Charlie]
- See mail message about expected runs
- TimeZone & detecction of CPUS in COSM
- Press:
- Web presencce
- 1 Page overview
- T-Shirts?
- Poster
General
- Fix consistancy when make'ing on Bazaar. check autoconf, automake, etc...
- There was a bit of ownership changes as a result of Josh M's script using [from root]:
chown -R :users *
on your directory should fix any residual problems.
CVSROOT has been fixed by Josh H. Someone else should check security here, just to make sure.
- Post F@C Beta:
- Bazaar and Cairo images all nodes. This involves getting the systemimager server running. Think Consistancy.
- --version in mother
- Need to finish up presentation for Ohio LinuxFest
Posted by hursejo at
10:10 AM
|
Comments (32)
September 16, 2004
Update - Josh H - Sept. 16, 2004
Folding@Cluster
- Fixed the child binary on cairo by no linking in the altivec assembly code.
This needs to be corrected, but for now we have a working setup for PPC.
Pending
- Test code on Bazaar NFS
there may be a problem with the discovery code on x86.
- Test code on Non-NFS Bazaar and Cairo
- Nanny should check to make sure that the checkpoint file has changed since the last time that it sent it to the mother.
Now it sends the file every time it pings the child, which leaves open the opportunity that we send the same file twice to the mother.
- When Child is finished and ready to report it should:
- Notify the mother that the children are finished
- Mother then askes for the file(s)
- Child transfers Final files to Mother via MPI
- Children are released
- Have the the midwife (what will be the F@C Server) to send grompp, child, nanny to the mother via HTTP.
- Fix the altivec linking issue with the child on PPC/linux. Does this happen on PPC/Mac OS X?
Posted by hursejo at
12:35 PM
|
Comments (49)
September 13, 2004
Building Folding-at-Clusters
There are now complete instructions for building Folding-at-Clusters in folding-at-clusters/source/README.
If you have any problems building with those get in touch with charliep.
Posted by charliep at
12:06 AM
|
Comments (45)
September 10, 2004
Running GROMACS
It seems most of our runs (including the scheduler and some of the command lines in the gromacs overview doc) were falling for inconsistant command line flags between grompp and mdrun. The '-c' flag deals with .gro files generally, but does different operations
depending on the executable. grompp uses -c flag for input while mdrun uses
it for output. Simply put, we are re-writing our .gro input files with .gro
output from the simulation.
Things used to look like this:
o
grompp -f grompp.mdp -p topol.top -c conf.gro -o villin.tpr
o
mdrun -s villin.tpr -o villin.trr -c conf.gro -g villin.log
When they should've looked like this:
o
grompp -f grompp.mdp -p topol.top -c conf.gro -o villin.tpr
o
mdrun -s villin.tpr -o villin.trr -c output.gro -g villin.log
Realistically, we should cut the -c option from mdrun unless we determine
.gro output is the way to go.
Posted by mccoyjo at
11:54 AM
|
Comments (72)
Meeting Minutes - Sept. 10, 2004
Numerical Methods
- With Josh M's solution to the gromp/mdrun file, John (as well as the rest of us) need to revisit all previous tests that use the previous setup.
- Josh M will do a find/exec on the cluster file system, searching for mdrun or gromp, marking these as potential failures.
- Josh M will post the grompp/mdrun info from the mail message to MT. John will pick this up and change all the necessary things.
- John Needs to checkout the gromacs source from CVS via
cvs checkout gromacs-3.2.1
Do not use the --enable-fac flag in the GROMACS configure script.
- Place known good copies of the molecules in to the molecule repository Read Only.
Folding@Clusters
- Discovery: Use COSM for everything that we can. Write our own code for network stuff.
- Makefile fixup is Charlies token, Josh H will stay off it until the go flag is set.
- Charlie will add mdrun.o to the list of generated object files.
- Josh M has notes about how to get rid of generated pedantic errors
- In folding-at-clusters/source/Makefile add "-lm" to the F_AT_C_CFLAGS list.
- Remove the "charliep" hardcoded stuff from the includes for GROMACS in mother.h, and child.h
the child.h file is the only file that needs it since we no longer link grompp into the mother.
- Josh H cleaned the /cluster/project/folding-at-clusters
- Josh H is going to push his folding-at-clusters repository, and Charlie is going to push the gromacs-3.2.1
- Running directory Structure.
- log
- work
- bin
- conf
- mother.conf
- nannyHost.conf
- childHost.conf
- folding-at-clusters
- source
- documentation
- release (empty directory -- populated by make release in source directory)
- Josh M is waiting for a mail message from Josh H.
- Religion:
- All code and comments, where possible [not function calls], to 80 columns.
- Use tabs, not spaces when editing.
- COSM -NO-THREADS ? Take out due to HTTPD on mother which needs at least x2 threads for max load.
- Josh M is going to work on the restarting in folding-at-clusters
- Talked about packaging. Primary consern developers build binaries, testers use binaries to simulate user experience (testers use tools to ensure the tagged code is current). Two/Three phase tagging (v1-29, v1-30, tester bless, v2-0, ...). Charlie will post more details.
Posted by hursejo at
10:07 AM
|
Comments (144)
September 09, 2004
New F@C Repository Building Notes
Here are some notes that I had about building the F@C + COSM + gromacs-3.2.1:
I did all of my work on Cairo, there may be details that need to be changed for Bazaar/OSX.
- gromacs-3.2.1 Repository
- ./configure
--prefix=/cluster/home/joshh/cvs/gromacs-cairo-bin/
--enable-f_at_c
--enable-mpi
--enable-mpi-environment=GROMACS_MPI
--enable-float
--disable-software-recip
--enable-software-sqrt
--disable-x86-asm
--enable-ppc-altivec
--disable-cpu-optimization
CPPFLAGS=-I/cluster/cairo/software/fftw-2.1.5-Baseline/include
LDFLAGS=-L/cluster/cairo/software/fftw-2.1.5-Baseline/lib
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin:/bin:/cluster/cairo/software/lam-7.0.2/bin:/cluster/cairo/software/lam-7.0.2/bin
- Don't use make -j 2 it will not build out of the box (at least for me), use the single threaded version make
- You will need to do a make install inorder to get the grompp binary (and maybe others).
- cosm repository
- cd cosm/v3
- ls make
- ./build linux-ppc -DNO_THREADS
Why NO_THREADS ??
- folding-at-clusters repository
- I fixed the /cluster/project/folding-at-clusters repository so it now represents the current structure of the folding-at-clusters repository.
- I updated the graph in the documentation
- I added the discovery code to the source tree
- I changed the Makefile to build the child inplace instead of in the gromacs directory.
Final Mark: Unable to link to the gromacs stuff, because it does not build properly. The problem is that we have the child now compiled in GROMACS. The child needs the defines from COSM. COSM is currently not linked into the GROMACS Makefiles (specificly the src/kernel/Makefile).
If we add the COSM stuff to GROMACS will it break GROMACS (requiring that we convert it to COSM)?
Does it break our model of seperation between COSM, folding-at-clusters, and gromacs-3.2.1 ?
Is it easier to use the old model of compiling the child withing folding-at-clusters?
Posted by hursejo at
08:32 PM
|
Comments (58)
September 04, 2004
Building GROMACS
Generally you should build GROMACS using ./configure, make, etc. rather than a Perl script. This gives you control and feedback often needed.
There are three variables that we generally need to modify when building GROMACS: CFLAGS, CPPFLAGS, and LDFLAGS. CPPFLAGS and LDFLAGS can be modified through the shell's environment before calling ./configure, CFLAGS must be modified in acinclude.m4 and then the configure environment must be rebuilt.
The following three step procedure sometimes does more work than required but it always produces the correct result (or an error), rather than silent failure.
1) CPPFLAGS and LDFLAGS are normally used to specify the location of the FFTW include files and libraries.
2) CFLAGS is normally used to control profiling, debugging, etc.
Edit acinclude.m4, near line 770 you will see where to modify xCFLAGS.
3) Rebuild the configure environment, configure, and build.
make distclean
aclocal; autoheader; automake; autoconf
./configure [configure options]
Check Makefile to be sure it looks like it should WRT CFLAGS, etc.
make -j 2
make install (optional)
Posted by charliep at
10:54 AM
|
Comments (60)
September 03, 2004
Meeting Minutes - Sept. 3, 2004
Numerical Methods
- Default (with SSE) & without SSE (both bazaar and cairo) time is from running the code without profiling data.
Rerun those 4 sets to fix this issue.
- Add innerloops per protocol from earlier meeting minutes. (~= top 2, over 20% always stays in list)
- Figure out gprof in the context of the output files in the chart.
- "compiler flag" => "option"
- Research how to use a perl script to automake the chart data.
- Line items:
- Use the compiler option to Not Unroll loops. (man gcc, look at CFLAGS in the scripts). With -pg this may be turned off by default (95% confirm).
Want to add fno-unroll-loops option to compiler CFLAGS
CFLAGS, GROMACS, and configure Problem
- NB See Building GROMACS entry for the fix to this CFLAGS problem.
- Pitfall Compiler options: If !CFLAGS set then get set A, if CFLAGS set the get set B. Where A != B. Research uniformality.
- We want the -pg option to be appended to the default setup, not replacing the default setup.
- Replicate fix in 'all the right places' WRT multiple platforms.
- Numerical Methods: Think about this and report back with new ideas.
- B-and-T-GROMACS: table for now, but revisit later.
- GROMACS WRT F@C: Edit configure, and add everything we need directly into the script. (FFTW, Optimization Options speccific to ppc -- others archs? --, )
Running GROMACS
- Figure this out! Use one of the following 2 options:
- How to use commandline options to preserve .gro file, Not overwriting the orignal. (Josh M)
- If that is not possible, Rework every script to get the fresh molecule set everytime we run grompp/mdrun (cvs update -C ? Must overwrite the file you are using with the good file in CVS).
We need to re-run the NM tests to behave according to the above.
F@C
Posted by hursejo at
09:32 AM
|
Comments (118)
September 02, 2004
Update - Josh H - Sept. 2, 2004
Folding@Cluster
- Extracted grompp from the mother. It is now executed as a separate binary.
- Checkpoints are now given to the nanny upon the request of the mother.
Details The Nanny notifies the Mother whenever it has a checkpoint. When the mother decides that it wants to pull said checkpoint it tells the nanny to send it.
- The mother will keep a backup of the previous checkpoint in case we need it. The checkpoints are placed in the mother's molecule directory.
- Restarting works in the Basic setup: Restart from the beginning.
I am waiting up on a final decision about how to restart properly before implementing it.
- Removed the Irecv and Isend's in the mother and nanny and replaced them with Iprobe calls.
This makes more sense for the situations in which we use them. Also it is safer in our context.
- Instead of decrementing the number of children used upon every restart, we now try to restart with the same number of nodes N times (where N = 2 for now) before decrementing the number of children.
- Fixed bug: Nanny used to try and send a non-existant checkpoint file to the mother. Checks are now in place to ensure that the file must exist and be larger than 0 bytes.
- Transfer gro and tpr files via MPI instead of via HTTP between child and mother.
Questions
- What files should the Child send to the Mother when it finishes computation? Just the tpr file?
Pending
- When Child is finished and ready to report it should:
- Notify the mother that the children are finished
- Mother then askes for the file(s)
- Child transfers Final files to Mother via MPI
- Children are released
- Have the the midwife (what will be the F@C Server) to send grompp, child, nanny to the mother via HTTP.
We can place them in the root path directory for now. We may want to put them in another directory [bin], but F@H just dumps them in the working directory and that is the least overlead
- Remove all hardcoded values. There are a few -- search for joshh
- Test on NFS Cluster
Posted by hursejo at
09:00 AM
|
Comments (75)