Earlham College Cluster Computing Lab
General Project
February 04, 2005
Meeting Minutes - February 4th, 2005
- General
- Poster
- Travel: 11a Thurdsay the plane leaves; leave Richmond 9ish. Return 5:30p on Tuesday.
- Food and work Wednesday night at the ranch.
- Stuff to take
- projector
- printer? (blow off if the resort has color printing)
- gear bag (not whole thing; access point)
- screen for projector
- Poster tube (link in clustcomp archives)
- Long jumper
- Numerical Methods
- converting GROMACS files to use with NAMD. Will document in cvs when done. see clustcomp m\
essage. John and Josh
- Look at dihedrals (what is it and how does it manifest in our configuration files). do we \
have any? John
- John's literature search is going well. More goodness to come.
- Literature search - JohnS
- inverse square root, molecular dynamics, vector
MT
Link
Create a Literature Notes file in CVS
(numerical-methods/doc/literature-search-notes.txt)
- get bibtech from john's IEEE articles. John & Charlie
- Charlie has info on 1/x^2 via Peter. He'll share soon.
- Folding@Clusters
- testing: good db of runs (platform, molecule, and node combinations). next is non-NFS test\
ing in /tmp/
- always use .gro files for input so we don't have to deal with converting the pdb files
- 3 flavors of failing. Let's start to think about how to catch these errors.
- die to LINCS error
- fail a bunch of times and max out restarts
- activity stops, we use cpu, no new checkpoints being received, stalls in simulation
- stalls during capability discovery
- bug in capability discovery (netpipe); look at joshh's kludge
- to catch failures, mother forks a process outside of the mpi world for the netpipe bandwid\
th test, mother has timeout feature on nannies and when triggered resets world.
- list of molecules that consistantly fail. charlie
- wrap mpi calls for value or kill to avoid race condition created by listen for message with\
certain flags.
- DVC <-> results conversation. charlie & joshm
- get ps/day information from mdrun output. get from log file on child 0
- preserve log in a mother.conf or molecule.conf option
- how many mass points in a system. can we tell how many mass points we have given our config\
fiels? wc -l on top file is a possible solution. John
- package up a2 after siam for pande
- 1-8 nodes on bazaar old image, 1-4 on new images testing done.
- two types of testing a2 needs before release: Failure testing and Non-NFS testing.
- F@C vs F@H - use b19 for x86 single cpu run
- Failure and Recovery (with fault tolerance checking) - not done.
- Look at folding-at-clusters/documentation/index.html
- Killing lam on non n0 nodes.
- SMP testing - Charlie will contact Henry Neeman and get the details
for access to OSCER's SMP box to JoshH who will do the testing.
- Review/Update protocol.txt - Revisit after SIAM. Make a phase diagram
so we can figure-out how to test checkpointing, failure, and recovery
completely.
- Cleanup of files on non-NFS systems - Revisit after SIAM.
- Plumbing
- reimage all cluster nodes save 0th nodes.
- cluster names in cexec
- Gaussian working in the near future (next week)
- hopper kernel parameters
- move hdd out of bazaar golden client (b19) to another machine.
- usb <> ps2 for cairo part of kvm
- How to decide if /cluster/... vs locally on each node for a particular
package? Let's discuss this in more detail after SIAM.
- Other
- Instrument NAMD and GROMACS to print out the x to find its range. Hard part is finding where to put the instrumentation in the code. Maybe GNUplot it.
- Post SIAM
- midwife - see jan 28th meeting minutes
- Ganglia
- JoshM will post a software list for us to review and update.
- codeviz
- cruise last 3-4 weeks of meetings for items lost in the pre-SIAM madness.
Posted by charliep at February 4, 2005 09:50 AM