January 28, 2005

Meeting Minutes - January 28, 2005

Folding@Clusters


  • Poster outline. Hopefully this weekend - Charlie

  • Molecular systems in /cluster/project/molecules are ready to use. nsteps
    = 500, checkpointing, Readme. Still to be done is to remove the other options
    that are causing lots of extra files to be created. Charlie.

  • Major Bugs. See source/TODO for these (review weekly).

  • A2 Release Testing - Running tests on 1 through 4 nodes for about 12 molecular
    systems on bazaar and cairo. So far about 500 or so runs.

    • NFS/Non-NFS - not done.
    • Different Mol types: - done.
    • Platforms: x86/Linux, PPC/Linux, PPC/OSX - done.
    • Number of nodes [2-16] - 1 through 4 nodes currently.
    • Failure and Recovery (with fault tolerance checking) - not done.

      • Look at folding-at-clusters/documentation/index.html
      • Killing lam on non n0 nodes.

    • Matricies

      1. NFS/Non-NFS Platform
      2. Mol. Types + Number of nodes + Platform
      3. Faulure/Recovery + Platform (high number of nodes)



  • SMP testing - Charlie will contact Henry Neeman and get the details
    for access to OSCER's SMP box to JoshH who will do the testing.

  • Less focus on code for the next 2 weeks, work on poster.

  • Review/Update protocol.txt - Revisit after SIAM. Make a phase diagram
    so we can figure-out how to test checkpointing, failure, and recovery
    completely.

  • Cleanup of files on non-NFS systems - Revisit after SIAM.


Numerical Methods


  • Nota bene - This section is unchanged from last week, which was unchanged
    from the week before that, ...

  • Basic MidWife - JoshM

    JohnS will send pointer to JoshM about what is there so far, and what
    needs to happen. [use the folding-at-cluster/testing directory]

  • Extensions to MidWife concept (new program and schema?) that supports
    building FFTW, GROMACS, etc. with particular configure and compiler/linker
    options. Table until next meeting...

  • Literature search - JohnS

    • inverse square root, molecular dynamics, vector
      MT
      Link


      Create a Literature Notes file in CVS
      (numerical-methods/doc/literature-search-notes.txt)

  • Poster outline - Charlie (this weekend)

  • NAMD Report: looking for benchmark molecules, Keep working on how to get
    Charm working - JoshM

Plumbing


  • Cairo image - F@C build problem, F@C run unknown, NAMD no, ganglia no

  • Bazaar image - F@C builds, F@C runs, NAMD no, GAUSSIAN no, ganglia no

  • DVC - results integration - JoshM and Charlie

  • New image notes

    • How to decide if /cluster/... vs locally on each node for a particular
      package? Let's discuss this in more detail after SIAM.

    • Postgres client libraries, binaries, DBI, and DBD::Pg on all client nodes.

    • JoshM will post a software list for us to review and update.

  • NAMD/CHARM - can run binary under OSX but can't install source and build
    under PPC-Linux. JoshM will try building on bazaar.

  • CodeViz stuff is installed on Bazaar /cluster/, JoshM will post instructions.
    Incompatibility between compilers, //tag and graphing. After SIAM.

  • Charlie has had two odd CVS experiences recently, TODO and gromacs. He'll
    look and see what if anything is going wrong here.

  • Cleaned-up DVC file spamming - JoshM.

  • distcc doesn't work with //tag option that GROMACS uses. Don't worry about
    it, distcc is really only useful for BCCD. JoshM.

Other


  • New copy of B-and-T GROMACS poster. After SIAM.

  • Mary Lou changed the plane ticket to have Josh's correct last name.

Posted by charliep at 09:49 AM | Comments (0)

January 21, 2005

Meeting Minutes - January 21, 2005

Folding@Clusters


  • Poster outline. Hopefully this weekend - Charlie

  • Review/Update protocol.txt - Revisit next week. Make a phase diagram
    so we can figure-out how to test checkpointing, failure, and recovery
    completely.

  • Major Bugs. See source/TODO for these (review weekly).

  • A2 Release Testing - Charlie has a simple framework and is running
    tests on 1-4 nodes for about 8 molecular systems on bazaar and cairo. So
    far about 200 or so runs. The only problems that have come-up are either
    known or fixed, at least so far.

    • NFS/Non-NFS
    • Different Mol types: All...
    • Platforms: x86/Linux, PPC/Linux, PPC/OSX
    • Number of nodes [2-16]
    • Failure and Recovery - with fault tolerance checking.
    • Matricies

      1. NFS/Non-NFS Platform
      2. Mol. Types + Number of nodes + Platform
      3. Faulure/Recovery + Platform (high number of nodes)



  • SMP testing - Charlie will contact Henry Neeman and get the details
    for access to OSCER's SMP box to JoshH who will do the testing.

  • Checkpointing - nxtxout (in number of steps) is the mdp file option
    that controls checkpointing. Charlie will update documentation/README.

  • Cleanup of files on non-NFS systems - Revisit next week.

Numerical Methods


  • Nota bene - This section is unchanged from last week.

  • Basic MidWife - JoshM

    JohnS will send pointer to JoshM about what is there so far, and what
    needs to happen. [use the folding-at-cluster/testing directory]

  • Extensions to MidWife concept (new program and schema?) that supports
    building FFTW, GROMACS, etc. with particular configure and compiler/linker
    options. Table until next meeting...

  • Literature search - JohnS

    • inverse square root, molecular dynamics, vector
      MT
      Link


      Create a Literature Notes file in CVS
      (numerical-methods/doc/literature-search-notes.txt)

  • Poster outline - Charlie (this weekend)

  • NAMD Report: look for benchmark molecules, Keep working on how to get
    Charm working - JoshM

Plumbing


  • Cairo image - can't compile GROMACS, haven't tried running F@C. Re-imaging
    is working and is easy. distcc works. ntp working. updated c3 doc with
    version 4 syntax.

  • Bazaar image - isn't imaging yet. JoshM will contact Skylar.

  • Keep Non-NFS nodes [Cairo c13-15, Bazaar will be b13-b15]

    New images on Cairo c9-12, and Bazaar b9-12

    First test of new Images:

    • Bazaar - F@C, GAUSSIAN, NAMD, ganglia
    • Cairo - F@C, NAMD, ganglia


  • CHARM - can run binary under OSX but can't install source and build
    under PPC-Linux. JoshM will try building on bazaar.

  • CodeViz stuff is installed on Bazaar /cluster/, Josh M will post instructions.

  • Humidity solution on it's way from John Walker. Not sure if it will
    be mounted in the ceiling with the HVAC or in the cluster closet.

  • GAUSSIAN on bazaar add to /cluster/

Other


  • Poster design meeting - 4p Wednesday

  • The hotel reservations are correct for SIAM CSE 05 now, checkin on Thursday
    and checkout on Tuesday (5 nights). According to Mary Lou the misspelling on
    the airline ticket can only be corrected when we get to the airport. They
    have put a note in the file about it, and we shouldn't have any problems, but
    I think we'll make sure to leave with plenty of time for the airport that day.
    JoshH, if you have a passport bring it.

  • IU tour - Yes, JoshH and Charlie will work this out later in the spring.

Posted by charliep at 02:32 PM | Comments (0)

January 14, 2005

Design Meeting - Jan. 14, 2005

  • TODO File: see file for updates.
  • protocol.txt: revisit next week
  • Molecule configuration file: (itp files must go in top/, only used by grompp/preprocessing phase) - JoshH
    • GRO_FILE (mother & child)
    • MDP_FILE (mother & child)
    • TOP_FILE (mother & child)
    • ITP_FILES (mother) CSV [Optional] (top)
    • NDX_FILES (mother) CSV [Optional]
    • MAX_NODES_CLUSTER [Optional] appschema
    • PROCESSES_PER_NODE_CLUSTER [Optional] -np
    • MAX_NODES_SMP [Optional] appschema
    • PROCESSES_PER_NODE_SMP [Optional] -np
  • CPU Count - Charlie
  • Progress Meter - joshh
Posted by hursejo at 12:45 PM | Comments (0)

Meeting Minutes - January 14, 2005

Folding@Clusters
  • Poster outline. Will progress this weekend - Charlie
  • Go through TODO file -- See next entry on F@C Design
  • Progress meter added via MPI between mother/child - joshh
  • Review/Update protocol.txt -- Revisit next week
  • Major Bugs:
    • Handle strings properly.
    • Command line option droping - COSM kludge "-c" thing.
    • Fault Tolerance checking
    • Cairo villin bug. -- joshh can demo
    • Check lamnodes for correct string (n0,n1,n2 vs n3.b32,n10) - strip whitespace, strip n, run atoi.
    • mdrun should have as an argument the LamHosts list (n1,n2,n5,n6) and command line generation.
    • CPU count in COSM for Linux
  • A2 Release To Do:
    • Testing
      • NFS/Non-NFS
      • Different Mol types: All...
      • Platforms: x86/Linux, PPC/Linux, PPC/OSX
      • Number of nodes [2-16]
      • Failure and Recovery
      • Matricies
        1. NFS/Non-NFS Platform
        2. Mol. Types + Number of nodes + Platform
        3. Faulure/Recovery + Platform (high number of nodes)
    • FATC.conf file for molecule names. Basic stuff, and Other files -- see F@C Design for details
    • See Bugs above...
Numerical Methods
  • Basic MidWife - JoshM
    JohnS will send pointer to JoshM about what is there so far, and what needs to happen. [use the folding-at-cluster/testing directory]
  • Extensions to MidWife concept (new program and schema?) that supports building FFTW, GROMACS, etc. with particular configure and compiler/linker options. Table until next meeting...
  • Literature search - JohnS
    • inverse square root, molecular dynamics, vector MT Link
      Create a Literature Notes file in CVS (numerical-methods/doc/literature-search-notes.txt)
  • Poster outline - Charlie (this weekend)
  • NAMD Report: look for benchmark molecules, Keep working on how to get Charm working - JoshM
Plumbing
  • New images for bazaar and Cairo, leave some subset of each functional until the new images are capable of running F@C.
    Cairo image setup is working. Hassan is working on Bazaar.
    Keep Non-NFS nodes [Cairo c13-15, Bazaar will be b13-b15]
    Image Cairo c9-12, and Bazaar b9-12
    First test of new Images:
    • Bazaar - F@C, GAUSSIAN, NAMD, ganglia
    • Cairo - F@C, NAMD, ganglia
  • c3 tools on all nodes working. send pointer to where these live for path - Josh M
  • Make b13-b15 Non-NFS - JoshM
  • GAUSSIAN on bazaar add to /cluster/
  • Send John a access point/NAT router. - Charlie
  • CodeViz stuff is installed on Bazaar /cluster/, Josh M will post instructions
Other
  • Review SIAM travel details:
    • JohnS is comming [up|down] Feb. 3rd
    • Charlie will fix date problems with Hotel reservation.
  • IU tour - yes JoshH and Charlie will work this out.
Posted by mccoyjo at 12:04 PM | Comments (0)

January 07, 2005

Meeting Minutes - January 7, 2005

  • General
    • Meeting next Friday 10a-1p.
    • Joshm will follow up with John to make sure he has a working environment.
    • New camera setup.
    • SIAM accommendations are good save changing typo.
    • Life will be good for all involved if we finish the poster before we get to Florida. Money for food; not poster.
  • Plumbing
    • Sytemimager: Joshm will email Hassan asking for some help in learning systemimager.
    • Work on general and 0th node docs.
    • Get 4 nodes with new images on each cluster for testing FATC. Joshm
    • Update node list. Joshm
    • Checkout CVS email notification on commit. Look at for node list and FATC. Joshh
  • FATC
    • Time on SMP box via Henry.
    • non-NFS testing on c13-15. Joshh
    • Poster roundup. Charliep
    • No threads in the near future.
    • Making MPI easy to install is imortant in the near term and until we can remove MPI.
    • Signals for checkpointing. Hook in mdrun. Check out old cvs for info. Joshh
    • Add heartbeat protocol between mother and nannies to see if the nannies are still functional. 5 minute timer in mother. When time has expired, we check nannies.
    • GROMACS 3.3 beta fixed errno problem.
  • Numberical Methods
    • Charmm installation.
    • CodeViz on mdrun.
    • NAMD doc keep moving. Joshhm
    • John literary search. John and Charliep
Posted by mccoyjo at 11:33 AM | Comments (0)

January 02, 2005

Design Notes - process architecture

This is a description of the new process architecture developed after our experiences working with the a1 release.

Overall Plan:


    Mother MPI_Spawn()'s Nannies
    Mother MPI_Spawn()'s mdrun (we have removed the child, and just have
    mdrun)

Mother:


    1. Spawn Nannies, 1 per node in mother.conf
    2. Capability Discovery with Nannies
    3. Make mdrun<->Nanny assignments

      a) Get PID and Hostname information from mdrun (Init_FATC())
      b) Make nanny/mdrun assignments and distribute to nannies
      c) Reap any unused nannies
      d) Tell each nanny the # of children assigned to them

    4. Run grompp
    5. Spawn mdrun
    6. Collect periodic checkpoint files from nanny0
    7. When mdrun completes

      a) Completioin of mdrun is indicated by mdrun0 sending a message to the mother. This message will pass the exit code (sucess or flavor of failure).
      b) Nanny0 will send all the necessary files to the mother

    8. Reap all nannies
    9. Report result to F@C server, get a new molecule, and restart with the new molecule

Nanny:


    1. Get # of children to look for with PID information from mother
    2. When the checkpoint file is updated nanny0 will send it to the mother
    3. When a nanny checkpoints/checks-in-with it's mdrun process it will
    compare the cpu time from the last checkpoint with the cpu time from this
    checkpoint and

      a) if it has not changed then it will report the stale state to the
      mother
      b) if the process goes away then report that to the mother.

    *Still not sure how to do this in an elegant way.
    5. When mdrun finishes [mother tells all the nannies when this happens] nanny0 will send all of the files to the mother

mdrun:


    0. The mdrun binary will be renamed to fatc_child as part of "$ make release")
    1. No source code changes except:

    • stderr -> stdout
    • error codes instead of exit()
    • Init_FATC code for PID/hostname communication, and freopen. This is called just after MPI_Init by all mdrun processes.
    • Finalize_FATC code for "finished" message to the mother. This is called just before MPI_Finalize by all mdrun processes

Notes:


    1. The child.[c,h] files will be moved to folding-at-clusters/source/old

    2. We want to limit the changes we make in mdrun, but script based changes that are easy to apply are ok.

    3. There are some kludges in the way that the nanny 'finds' the mdrun
    process it is matched with. There are better ways to do this, but for the
    moment the kludges allow for a proof of concept and quick solution.

    4. We are using MPI_Spawn() instead of system(mpirun ...) because the former
    allows a bit more control over the MPI_COMM group for the mdrun process(s),
    whereas the latter completely separates the processes and adds some more
    challenges that are harder to overcome.


Questions:


    1. Set Nice/ProcessPriority level
    Answer: mdrun already has a command line option for setting the nice level.

    2. Redirect stdout to a log file [via freopen].
    Answer: Init_FATC function in mdrun. nannies will transmit the logs back to the mother on completion.

    3. Get Work Unit from mother [tpr, gro(?) files]
    Answer: Mother pre-populates files on the nanny0 node as JoshH suggested.
    This works if nanny0 and the mother are on different nodes and on NFS and
    non-NFS systems.

    4. Notification to the mother that we are finished.
    Answer: Finialize_FATC sends message to mother from mdrun0

Posted by charliep at 06:01 PM | Comments (0)

January 01, 2005

New Image Plan

A couple of days ago I replied to JoshM's message about testing the new cairo image with questions about where to find what was installed, etc. I think I was headed down the wrong road and that we should instead focus on making it easy to add/change things in the image, re-image machines, and re-boot machines. More like the release early and often mantra.

  • 1) Get a basic image built. Done for cairo, to be done for bazaar.

  • 2) Check/update the list of 0th node items. Create the file /cluster/project/sna/0th-node.html, check documents in that directory and MT for items. Include how to rebuild with a new image and then apply changes.

  • 3) Check/update /cluster/project/sna/cluster-imaging.html, particularly the parts about modifying an existing image and forcing an update on a one or a couple of nodes and then forcing an update on the whole cluster.

  • 4) Use F@C on a couple of nodes as a first-order test before calling an image ready to deploy on all the nodes in a cluster.

  • 5) Install the new images on all the nodes (including the 0th nodes) and we'll start using them and fix what's broken. We should make sure to keep all the documents referenced above updated as we go.

    Posted by charliep at 05:35 PM | Comments (0)