August 31, 2004
Conversation with Vijay Pande
* What should we plan on sending back as a result?
science - trr, xtc (if present); last frame in trr used to generate next, xtc for analysis; mdp file controls whether an xtc file is generated.
log (unique extension) - CPU time, wall time, # restarts, cluster characteristics
8.3 file naming restrictions still apply
Distribution - package for Un*x systems? For beta do a tarball, later do packages.
Windows port - POSIX issues are handled by the C compiler
Future - can we use MPI-Lite or something similar so that we can embed it with F@C. The nanny could be a Windows Service.
Build all static binaries.
More large molecules? Ribiosome? not needed now.
* Where we are:
Capability discovery
Startup - molecule preparation (grompp), lam startup, GROMACS startup
Progress monitoring
Checkpointing
Restarting
Results
* Tracked-down 2 of the 4 compiler/optimizer errors we spoke and we have a fix for them.
* Load distribution with GROMACS and MPI coming soon. Don't let this hold-up the beta.
* How does hyper-threading affect us? Probably not at all.
* Using COSM for communication between mother and nannies with HTTP.
* Beta around the middle of the month?
Build a roadmap for the future.
How to test the quality of a GROMACS result? Not needed now.
How are we going to get molecules when in production? Folding@Home template core.
Posted by charliep at
02:54 PM
|
Comments (0)
update
Folding@Clusters
- Found a new method for restarting that is one call to grompp. Checkout the restart post.
- The latest cut of the code looks great. Once the canonized restart method is verified, it will be easy to integrate with F@C.
- Learned DBI and IO::sockets to use in the midwife.
- We have a molecule testing tool in the tools page that is connected to a database that is very useful for the midwife. It stores mocule names, directories, and gives a molecule id. Connecting a midwife table seems to be the way to go.
- I've been having a hard time getting F@C to compile. More to come after I've explored this more thoroughly.
Todo List
- plumbing
- ssh without nfs.
- finish Images.
- fix system imager on admin. The version was updated and it stopped working.
- Status check on Athena.
- Fix KVM wiring problems.
- System imager on gentoo. Maybe newest version will work.
- Checkout async NFS.
- run DBI::proxy daemon on hopper.
- Folding@Clusters
- Reaper
- Work on todo items without stepping on Joshh's toes.
- Place mechanisms for wall time, number of restarts, and number of nodes in F@C for testing with the midwife.
Posted by mccoyjo at
10:06 AM
|
Comments (20)
August 29, 2004
John - Sunday update
I have been slowly filling in the table. I find it is easy/alright to keep runs going while doing other homework. I haven't done any planning ahead or looked at the big picture I am so busy assembling. Yep, that is about it.
Posted by schaejo at
08:05 PM
August 24, 2004
Meeting Notes - Aug. 24
General:
- SC2004: Charlie found money to cover everything 'else' in addition to the Student Program. Josh H will register when more is known about his schedule.
- SIAM: Need to get Num. Methods abstract in. 5 authors. F@C Abstract will go as is.
- Ohio LinuxFest: Need to only register John. Josh^2 and Charlie are already registered.
F@C:
- Restarting: Josh M has the method. He believes that the program does all the checking to make sure we don't propagate errors.
Check with Pande Labs about the accuracy of our method. Point them to the MT entry. (Josh M, Charlie P)
- Testing:
Need a piece of script to:(midwife??) HTTP communications, this should be a separate proccess
- Get next molecule that has not been run from the Database: TestID, Molecule Name, names of the conf/mdp/etc files
- Put those files on the Annex
- Start mother (not the process)
- When mother is finished running, extract running information and put it into the database. Number of restarts, size of universe (number of nodes), wall time from molecule start to molecule end, in the future transmit the tpr file.
Random Reaper: Randomly kill an mdrun process -- Gremlin -- Coroner.
- Fully implement error handling
- Annex: Test Users (test1-3).
Will add documentation about ssh key exchange on non-NFS system.
LAM is installed on all nodes.
- Scheduler (B-and-T-G) should be perl additional scripts.
- Need to pursue struct to pass around arguments in mother/child/nanny
General
- Tobias: Charlie will chat with him WRT to the Group
- NFS Async option may be an improvement boost.
- Regular weekly meeting on hold until Josh H has a definite schedule.
Posted by hursejo at
02:03 PM
|
Comments (33)
update
- Plumbing
- Folding@Clusters
- See my earlier mt entry for the latest details on restarting.
- Completed more code review.
- General
- I plan to increase my presence in Dennis for both accessibility and focus issues.
- John and I have talked about the student volunteer option and SC2004. Coordination on that front is happening.
Posted by mccoyjo at
12:57 AM
|
Comments (68)
August 23, 2004
restarting method for varying numbers of nodes
grompp -f mdout.mdp -c d.villin.tpr -p topol.top -e ener.edr -t d.villin.trr -np 8 -o new.villin.tpr
I believe I have found the proper gromacs tools and what to use them on for restarting a run with a different number of nodes.
tools used:
files initially required:
- original mdout.mdp
- original .tpr
- original mdout.mdp
- result or checkpoint file (.trr)
files created:
Process:
- Create a .gro file that incorporates the result and the topology files.
trjconv -s original.tpr -f checkpoint.trr -o new.gro
- Use the .gro as input to grompp to create a new .tpr file configured for the new number of nodes.
grompp -f mdout.mdp -c new.gro -np 4 -o new.tpr
Notes:
- A simulation can only be ran for a set amount of time without modification. Even after restarting, the simulation cannot continue past the simulation time specified when grompp was initially executed. To extend the simulation, tpbconv must be used with the -until or -extend (which take a number of picoseconds as arguments) options:
tpbconv -s topol.tpr -f traj.trr -e ener.edr -o new.tpr -extend 10
- If something is wrong with this method, I do not know enough chemistry to test the accuracy effectively. Tips leading to these results were taken from the the gromacs users and developers lists.
Posted by mccoyjo at
03:03 PM
|
Comments (85)
August 22, 2004
Schaefer - Sunday Update
I'm back. When is our next meeting?
Posted by schaejo at
11:56 AM
August 17, 2004
Meeting Notes - August 17, 2004
General
- Dawit is headed to Columbia for the 3-2 program in computer engineering. He would like to keep working with us remotely, we will need to identify tasks which are easily partitionable and trackable.
- Recruiting, start looking for a sophmore or junior, CS/math type.
Folding@Clusters
- Charlie has GROMACS with -O2 working under PPC now, trying for -O3 and possibly -O4 next. The default for GROMACS is -O2 which causes seg faults under PPC.
- JoshH has seen the light, we'll be using the native GROMACS/MPI capabilities for distributing work to the children.
- Checkpointing - GROMACS can save checkpoints at periodic intervals. This happens on the 0 rank node which isn't necessarily on the same node as the mother. Need to have a mechanism to move that file from rank 0 child to mother and test the viability of it. JoshM is still looking for the right method for restarting with a different number of nodes, he'll send a message to the developers list.
- Only one restart procedure is required, same number of nodes is just one case. Make sure we test this with large molecules such as proteasome and the other new large one (Charlie has this).
- All of GROMACS' printf's need to be managed, for the short term we can consider piping all of them to files which we process. Interaction with COSM is most of the issue.
- Code changes:
- Replace our stress CPU with COSM calls.
- New printing mechanism is in place now.
- COSM has a test script that can be used to verify the subset of the API which we are using.
- Code to check the quality of master.conf file.
- Signals and COSM. Is it possible to have more than 2? LAM and COSM conflict in their usage. Use diff in COSM directories to see where changes have been made.
- Code review:
- Consider use of structures to organize data elements.
- Why divide by 4 WRT LAM hosts?
- Documentation
- Same path name must exist on mother and all child nodes. Does COSM offer a way around this? Why is this?
Posters and Presentations
- Student Ambassador program - JoshM is interested, will coordinate with John.
- SIAM poster submissions due next Friday (the 27th). F@C ready to go, need N-M and CPs education presentation.
- Ohio LinuxFest, CP will put something together once the Grab-A-Byte presentation materials are ready.
Plumbing
- Bazaar Annex without NFS - wierd BIOS setting that requires keypress before boot? LAM is the only software that we should install. Keep notes during the install process (LAM, F@C user, etc.). JoshM will finish this up.
- Cairo Annex without NFS - c12 through c15. No NFS, local password file, home directories local.
- Bazaar slowness. Charlie.
Posted by charliep at
01:56 PM
|
Comments (120)
update
- General
- Spent a good amount of time reading and learning about routing, dns, dhcp, and the proper ways to admin networks.
- I am signing up for the SC2004 student volunteer program. They have sections on funding and requests, so I have some minor questions.
- Plumbing
- Bazaar connectivity problems fixed.
- Some annex nodes are failing to boot. The reasons are not consistant. b16 still needs to be fixed.
- Still no progress on Bazaar slowness.
- Cairo image was made sucessfully. The actual imaging is complaining about having no boot loader installed when there is definitly one on the golden client.
- Folding@Clusters
- Code review ready for our meeting.
- Restarting is coming along.
Posted by mccoyjo at
10:24 AM
|
Comments (82)
August 16, 2004
Update - Josh H - Aug 16, 2004
Worked/Working on
- General
- Took photos of Chalk Board in Cluster Mtg room and posted them here.
- Installed LAM 7.0.6 on Cairo.
The errors were:
../../share/.libs/liblam.so: undefined reference to `_ioexit'
../../share/.libs/liblam.so: undefined reference to `_getbuf'
../../share/.libs/liblam.so: undefined reference to `_tiob'
I fixed it by adding to the rest of our build script:
LDFLAGS="-L/usr/bin -lutil"
--enable-shared
- Installed LAM 7.1 Beta 16 on Cairo, noting the above.
- F@C
- Started a Developer's Documentation
- Nanny Stalls: Some of the fist signals sent by the child are lost so a timeout on a loop (denoted in the code by NOTE: JOSHH 1A) will ensure that it always finishes this loop instead of waiting forever.
- Built and ran on Bazaar. Needs to be tested on non-NFS setup.
You can use the configure.pl script to switch bettween x86 and ppc setups easily.
- Nearly finished implementing a Print Command that will let us either print to stdout or print to a Log file (which is what we want to do in production). I have the mother and nanny finished, and the child will be finished soon.
To Do/Pending
- F@C:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
I put in a work-around in the code that keeps the mother from segfaulting when freeing the arguments passed to mdrun. We should track this down in the near future though.
- Ensure security of signal handling. --> Instead of signals maybe a HTTP handshake??
- Test code on Bazaar Annex using non-NFS filesystem.
Posted by hursejo at
06:13 PM
|
Comments (73)
August 11, 2004
Meeting Notes - Aug. 11
Plumbing
- Cairo Image is complete for c1-15, c0 is next in line.
- Bazaar is having issues with systemimager (silent failure) which is holding the image. Internet routing (see next note) may help this.
- Outside logins to Bazaar is not getting routed correctly. Is this residual from the cluster move? Check with Kevan...
- Bazaar slowness: The fix Skylar send did not work as reported by Josh M. Charlie takes this token.
- Bazaar Annex becomes a non-NFS to use as a testing gound for F@C.
Make home files to point to /home instead of /cluster/home
- DDT is still running out of ~mccoyjo, Josh M will move to /cluster/cgi-bin and update html page (Note both of these are repositories in CVS.)
- DBI::Proxy Deamon on hopper.
- Josh H is checking out installing LAM 7.0.6 on Cairo in his spare cycles.
General
- Do MT Entries...
- SC2004 Student Volunteer Apps.
- Need to finish up the presentation for Ohio LinuxFest.
- Next formal meeting Tuesday 1p.
F@C
- Bugzilla: No progress. Charlie will work on this soon.
- Checkpointing and Restarting: tpbconv -- may allow us to take last state of mdrun and produce input to mdrun without grompp. Generates a new tpr file.
Could we use this to monitor that it is running correctly? No because of varous loads over the time of the run.
How are we going to do this accurately?
Only need to keep last Checkpoint file at any point in time for this method.
Is there a way to check the validity of a given checkpoint file. Sanity Check... May be just checking for errors in tpbconv?
2 restart situations:
- Restart with same number of nodes
- Restart with different number of nodes
- Should compare notes for restarting with FAH GROMACS core restarting.
- Memory tracking: no progress.
- See if GDB may be of help in tacking stalled nanny problem.
- Need to build on Bazaar.
- Code Review. Next week we should meet Tuesday at 1p
Numerical Methods
- Dawit is working on completing the Chart of Runs. #1 Priority...
- Chart listing rules:
- Min of 2
- 20% or more
- Once something appears in list it stays in list
- Dawit noticed an interesting pattern in the results. Will put this information as a footnote to the bottom of the chart.
- Read all links that John posted to MT from Aug 5, 2004
- Josh M may start working with Numerical Methods folk in conjunction with F@C
- Josh H will take photo of Chalk Board in Cluster Mtg room and post to web.
B and T - GROMACS Paper -- On the shelf
Posted by hursejo at
02:50 PM
|
Comments (133)
August 10, 2004
update
- Plumbing
- Bazaar Image - Everything seems to be in order save the installed system imager. It fails silently when creating the master imaging script.
- Cairo Image - A new and lightweight image has been installed. The installation is quite functional. DBI::proxy, systemimager, and c3 tools have been installed.
- Bazaar Slowness - I would like a fresh pair of eyes to stare at this problem with me. The delay only happens when ssh'ing above b0. There is no delay on DNS queries or pings. I've exhausted my knowledge of how ssh works.
- Bazaar Lack of Connectivity - Bazaar has lost the ability to see the world outside of hopper. DNS queries succeed. Pings out from bx fail. Pings from quark/acl give: Redirect Host (New nexthop: 159.28.230.232). The problem may be with our last config on quark. The problem is still being investigated by Dawit and I.
- Folding@Clusters
- Code Review - I have waded through the code excluding some capability discovery code. Things are looking good. I want to hit some XXX's after doing a quick review of the latest cut that includes COSM calls.
- Restarting - Things are proceeding well on this front. In the last couple of days new ways of implementing restarting have come to my attention. Instead doing the cludge of converting the output file to an input format and generating a new conf file using trjconv and grompp, I have found good utilities for restarting after a crash and recycling output to input. I'm prepared to discuss these methods in detail at tomorrow's meeting.
- Checkpointing - When looking at the restarting problem, I stumbled across the mechanism in gromacs to periodically write .trr, .txc, .log, and .edr files to disk on the head node. Combined with the tools used for restarting, I believe we have a simple solution to both problems. Again, more detail at tomorrow's meeting.
Posted by mccoyjo at
07:03 PM
|
Comments (39)
Update - Josh H - Aug 10, 2004
Worked/Working on
- F@C
- Finished a set of examples for COSM:
Capability Discovery
File I/O
HTTP client/server (Currently with 3 Threads)
- Mother is now shipping binaries via MPI for child/nanny
- Converted much of mother/nanny/child to COSM
- Added HTTPD to mother, and HTTP to child/nanny for transfer of results files.
To make this easier/faster/more secure we may want to think about harnessing the zip feature in COSM to Zip up our results and send them via HTTP...
Currently the GetWork function requests the tpr and gro files from the mother, the Result function should push the results file to the mother. We should also make a Checkpoint function that pushes a checkpoint to the mother.
There are still some bugs with the Nanny. every once in a while one or more of the nannies will not move into the Checkpointing stage. This causes the mother to stall when trying to free the children. I need to look into this. I think it is just a MPI programming error somewhere.
- If you want to play then checkout the cvs tree and build it. It is currently primed for Cairo. Read the README for how to set it up.
- The master.conf file has changed a bit since we are using COSM's built in Config library.
- The Head child creates a directory $WORKING_PATH/work in which is moves all of its files and runs mdrun. Mother works out of $WORKING_PATH/molecule
- So by the looks of things we should be NFS agnostic at the moment.
Do we have a testing environment to confirm this?
Could we use Bazaar annex when Josh M and Dawit are finished?
To Do/Pending
- F@C:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
- Nanny Unexplained stall in pre-checkpoint stage
Posted by hursejo at
03:15 PM
|
Comments (165)
August 05, 2004
Potential Reading Material
Here are the links to the articles:
1. Improving Goldschmidt Division, Square Root...
2. High Speed inverse square root
3. Pseudo Division and Pseudo Mult...
4. digit-recurrent arithmatic
Yeah, its is kindof a grab-bag topically speaking.
Posted by schaejo at
02:51 PM
Meeting Notes - Aug. 5
General
- Next Meeting: Wed., Aug. 11 @ 2pm
- Cheers to Charlie for the Shirts.
- Post your MT updates.
Plumbing
- All plumbing is tabled until next week due to Abstract work??
- Cairo image: DBI::Proxy??
- Move Bazaar Annex to Cluster Closet:
F@C
- Abstract: Submitted to SC2004. Will want to submit to SIAM as Poster by Aug 27th as well. May want to change format.
- COSM Addition: Josh H Working on test case, will soon integrate into code root.
- Memory Leaks: Josh H not looked at this yet.
- Restarting: Josh M needs to report on this...
- MPI shiping of binaries: Josh H is going to do this soon.
- Bugzilla: Charlie no progress.
Numerical Methods
- Abstract: Due Aug 27.: In development...
Don't know if they are printing proceedings.
Focus on collecting data and resources
- Coordination of Runs:
- The Chart: Separated for Bazaar and Cario. Wall time is the time represented in the chart.
Track functions taking 20% or more time, at least 2. Track significant functions through out all configurations.
John saw a ghost. will report future work items.
look into unrolling of inner loops, and cache eff.
- Literature search: John will put them in MT, we will all look into them.
B and T - GROMACS Paper
Posted by hursejo at
12:35 PM
|
Comments (54)
August 04, 2004
Update
I have been focusing on the abstract paper. I am currently looking into the different major numerical methods that are implemented in MD packages (Ewald Corrections, Monte Carlo, LJ, Fourier) and trying to understand their theoretical base so it'll make sense when I compare to implementation in gromacs. I am also looking more into the invsqrt routine and doing tests to provide solid answers to the abstract start up questions I have put together (look in cvs).
Posted by bekelda at
09:37 PM
|
Comments (99)
Update - Josh H - Aug 4, 2004
Worked/Working on
- F@C
- Finishing up COSM HTTPD Client/Server Example.
Which I am using to learn about how we want to use the COSM library. Hope to finish this before the meeting tomarrow, and start integrating it with the F@C core.
- B-and-T-GROMACS
- Finished up dumping of text into Extened Results Section. The paper has been shelfed for the next couple of weeks.
To Do/Pending
- F@C:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
- Have mother ship the mother/nanny/child binaries via lam's mpirun.
- Add HTTP support in F@C
- B-and-T-GROMACS:
- [Tabled] How to handle general citations (rather than specific ones).
Posted by hursejo at
06:00 PM
|
Comments (57)
Schaefer's Wednesday Update
There is a new chart (more correctly, a pair of charts) in the cvs directory now, next to the old one. The apparent lack of PME runs is misleading, I just haven't gotten Dawit to put his runs in yet. The chart also has a couple of typos, etc, that will get worked out tomorrow.
I think reading the PoCo book is good background stuff. Meaning I am learning new things which deepen my overall understanding of what is happening, but none of it jumps out and says "read me! I am topical to gromacs!". I think it is useful.
The shirts are cool.
I did a little bit of a literature search, and found some potentially interesting articles, where should I park them?
Posted by schaejo at
04:08 PM
August 02, 2004
Meeting Notes - Aug. 2nd
General
The meeting for this wednesday is moved to this thursday at 1pm. Charlie will attend by phone.
Wednesday the eleventh at 2pm is the next meeting, for those still around.(hahaha!)
JoshH is to bring fancy shirts from the ranch, to share with the rest of the group.
The reading list was updated a while ago, how is it going?
Post your MT updates.
Plumbing
Most Important Item: the Cairo image. add DBI proxy to the list, and such.
Charlie will visit Fryes for cable gender converters. (our problem is female to female)
Move the annex downstairs. Set it up on the smaller cart.
B and T - GROMACS Paper
Topic tabled until early september.
Charlie will deal with poster printing at Kinkos (while driving in a fire truck?)
Decision on venue to print in is also tabled till september.
JoshM's audit runs went fine, except for dppc.
F@C
This is prioity One for JoshM and JoshH (and Charlie).
Looks like porting to COSM will be a good idea. JoshH will scoop out the details.
Memory Leaks were not caught by electric fence, Charlie suggests trying a small test case and setting a watchpoint with gdb.
JoshM seems to be making progress with Restarting by hand, now he gets to make it automagic.
Bugzilla - Charlie?
Numerical Methods
The abstract is priority One for John and Dawit (and Charlie). Write blind versions of it tonight (monday) so that we can meet about them tomorrow (tuesday) afternoon with Charlie.
Coordination on the testing runs, so we can go faster. Finish the Chart by wednesday/thursday to finish the abstract by thursday/friday.
Literature search, look for good information on comparing Newton-Raphson to other methods of division.
Posted by schaejo at
04:34 PM
Update - Josh H - Aug 1, 2004
Worked/Working on
- F@C/Numerical Methods paper
- Numerical Methods
- Cleaned up formatting of Chart.
- B-and-T-GROMACS
- Sent mail to ACM Crossroads regarding publishing our article.
- General
- WeatherDuck is now running in an infinite loop polling data. This has seemed to help with the sound metric.
- Added dawit to software group so he can fully use cvs.
- I have installed DBI::Proxy on many of the cairo machines. Some of them are failing nonuniformally. I am leary about changing the scheduler to use Proxy instread of Pg until the cluster is uniform. Could we push an image on cairo that has it installed and running? Is the Cairo Image ready (less this addition)?
To Do/Pending
- General
- DBI::Proxy needs to be installed on Cairo for the scheduler. Bazaar may already have this.
- Folding@Cluster:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
- Have mother ship the mother/nanny/child binaries via lam's mpirun.
- HTTP client/server: Look into COSM.
- B-and-T-GROMACS:
- Talk about printing Poster...
- Add 'Extended Results' Text.
- How to handle general citations (rather than specific ones).
Posted by hursejo at
09:55 AM
|
Comments (99)
Update
I read the NASA article.
I've re-run the bazaar tests for sse after F@c was killed on the node (difference only on wall time).
I did a test on enable-software-sqrt and did not get any noticeable chanege in performance. I will investigate this further as this seems like a crucial function specifically implemented on software to improve performace.
Dawit
Posted by bekelda at
09:35 AM
|
Comments (118)