General Project

February 04, 2005

Meeting Minutes - February 4th, 2005

  • General
    • Poster
    • Travel: 11a Thurdsay the plane leaves; leave Richmond 9ish. Return 5:30p on Tuesday.
    • Food and work Wednesday night at the ranch.
    • Stuff to take
      • projector
      • printer? (blow off if the resort has color printing)
      • gear bag (not whole thing; access point)
      • screen for projector
      • Poster tube (link in clustcomp archives)
      • Long jumper
  • Numerical Methods
    • converting GROMACS files to use with NAMD. Will document in cvs when done. see clustcomp m\ essage. John and Josh
    • Look at dihedrals (what is it and how does it manifest in our configuration files). do we \ have any? John
    • John's literature search is going well. More goodness to come.
    • Literature search - JohnS
      • inverse square root, molecular dynamics, vector MT Link
        Create a Literature Notes file in CVS (numerical-methods/doc/literature-search-notes.txt)
    • get bibtech from john's IEEE articles. John & Charlie
    • Charlie has info on 1/x^2 via Peter. He'll share soon.
  • Folding@Clusters
    • testing: good db of runs (platform, molecule, and node combinations). next is non-NFS test\ ing in /tmp/
    • always use .gro files for input so we don't have to deal with converting the pdb files
    • 3 flavors of failing. Let's start to think about how to catch these errors.
      • die to LINCS error
      • fail a bunch of times and max out restarts
      • activity stops, we use cpu, no new checkpoints being received, stalls in simulation
      • stalls during capability discovery
    • bug in capability discovery (netpipe); look at joshh's kludge
    • to catch failures, mother forks a process outside of the mpi world for the netpipe bandwid\ th test, mother has timeout feature on nannies and when triggered resets world.
    • list of molecules that consistantly fail. charlie
    • wrap mpi calls for value or kill to avoid race condition created by listen for message with\ certain flags.
    • DVC <-> results conversation. charlie & joshm
    • get ps/day information from mdrun output. get from log file on child 0
    • preserve log in a mother.conf or molecule.conf option
    • how many mass points in a system. can we tell how many mass points we have given our config\ fiels? wc -l on top file is a possible solution. John
    • package up a2 after siam for pande
    • 1-8 nodes on bazaar old image, 1-4 on new images testing done.
    • two types of testing a2 needs before release: Failure testing and Non-NFS testing.
    • F@C vs F@H - use b19 for x86 single cpu run
    • Failure and Recovery (with fault tolerance checking) - not done.
      • Look at folding-at-clusters/documentation/index.html
      • Killing lam on non n0 nodes.
    • SMP testing - Charlie will contact Henry Neeman and get the details for access to OSCER's SMP box to JoshH who will do the testing.
    • Review/Update protocol.txt - Revisit after SIAM. Make a phase diagram so we can figure-out how to test checkpointing, failure, and recovery completely.
    • Cleanup of files on non-NFS systems - Revisit after SIAM.
  • Plumbing
    • reimage all cluster nodes save 0th nodes.
    • cluster names in cexec
    • Gaussian working in the near future (next week)
    • hopper kernel parameters
    • move hdd out of bazaar golden client (b19) to another machine.
    • usb <> ps2 for cairo part of kvm
    • How to decide if /cluster/... vs locally on each node for a particular package? Let's discuss this in more detail after SIAM.
  • Other
    • Instrument NAMD and GROMACS to print out the x to find its range. Hard part is finding where to put the instrumentation in the code. Maybe GNUplot it.
  • Post SIAM
    • midwife - see jan 28th meeting minutes
    • Ganglia
    • JoshM will post a software list for us to review and update.
    • codeviz
    • cruise last 3-4 weeks of meetings for items lost in the pre-SIAM madness.
Posted by charliep at February 4, 2005 09:50 AM
Comments