July 28, 2004
Meeting Notes - July 28, 2004
Plumbing
- Bazaar: Josh M and Dawit are still making progress. Able to ssh directly to b0. Cannot access b0 from hopper. Maybe chat with Rowan about this.
- Bazaar slowness: Unable to work on this until above issue covered. Josh M
- Authoritative list of 0th nodes tested with athena and in CVS: ibid. waiting for athena to come up before pushing forward to this. Josh M and Dawit
- Bugzilla: No movement yet. Charliep
- Check WeatherDuck and post comment. Try leaving it on. Josh H
- DBI::Proxy needs to be installed on Cairo for the scheduler. Bazaar may already have this. Josh H
Folding@Clusters
- Check out 'Electric Fence' to track memory leaks between F@C and GROMACS. Josh H
- Josh M and Charlie have started code review. Will make MT entries.
- Charlie is tracking Seg Fault problem with Capability Discovery in F@C. He is checking into Optimization levels via gcc to fix this.
- Restarting: trjconf may be the program that we need for taking output from mdrun to grompp. Is this the only file we need? How do we use it?
- HTTP client/server: Josh H is researching how to do this. Look into COSM.
Numerical Methods
- Wall Time & Gprof Chart: Add borders and fix formatting issues. Josh H
- Results are nearly complete. John and Dawit
- Dawit is having some mpicc compiling errors. Josh H will take a look.
- Josh H found some text regarding Long range interactions in the mdp file for GROMACS. Passed info to John.
Papers and Presentations
- Some additions to the Reading list. Check it out!
General
- Dawit and John need key CAB13. Fill out forms and leave them on Charlie's Desk.
- Need combo for keypad access to Basement access to Cluster Closet. Charlie
- Need keycore changed for Cluster Closet from CAD3 to something reasonible. Charlie
- Where is the space for the Cluster Computing group during the Academic year? North end of Recompute in Basement.
B-and-T-Gromacs
- Send mail to CrossRoads regarding publishing our article in the next Quarter or two. Mention status of authors.
- Josh M is having some problems running DPPC on cairo with the scheduler. Going to see if a reboot on selected nodes will help (This may be a memory leak problem).
- Audit runs - Search for 'Audit' in the label in the DB. Josh M will reconcile the Audits with the previous runs.
- Josh H will add 'Extended Results' Text.
- Not going to pursue PDF problems on Hopper.
Posted by hursejo at
01:13 PM
|
Comments (99)
John's Wednesday Update
I have been working on the gathering of the data for the different compilation flags. The data generation step is almost over, and I will be switching to analyzing it more closely today.
I have made no real progress on the writing of the abstract, because, while I now know fairly clearly what the work of the rest of the project will be, I do not know what we are going to find. And since I think that abstracts are a summary of the findings, I don't feel confident enough in my guesses of what we are going to find to write an abstract.
Posted by schaejo at
10:27 AM
update
- Spent a lot of time working on bazaar. This has been a lengthy affair between the dhcp and the dns problems.
- Bazaar slowness is on hold until bazaar is usable.
- The audit runs have been chugging away. There are still some failures in the scheduler for some of the runs.
- Started the F@C code review.
- Read the abstracts and the b-and-t-g paper again.
Posted by mccoyjo at
09:44 AM
|
Comments (39)
July 27, 2004
Update - Josh H - July 28, 2004
Worked/Working on
- F@C/Numerical Methods paper
- B-and-T-GROMACS
- Other Considerations merged into Results and renamed to Extended Results
- Made Readme for detailed-scheduler.pl
- PDF rendered very unusually on PowerBook, PS ok.:
It seems that the problem lies in the way latex is parsing the geometry.sty file on Linux machines and hopper. I tried building the tex file on my RedHat server and it produced the same result. However when I built the tex file via the teTex tools on OSX it displayed fine on an OS X PowerBook and RedHat Server. I am not sure how to proceed.
- Poster Links
SIAM
ACM Crossroads
- Crossroads of the ACM:
- Writers Guide provides some useful details.
- This is a quarterly magazine
- Submit in HTML format
- 1500 and 6000 words
- Only call for articles at the moment is for SPAM, Here is the Call for Articles site to monitor.
- Link to previous magizines.
- I could not find anything regarding who should/[should not] submit articles. So it is probably fine for Charlie to be an author.
- The site is undergoing some work, and many of the links are broken or outdated. It is a bit hard to navigate and fine useful information. Maybe we should just e-mail them with questions?
It could be a while before we get a chance to submit anything here. Should we consider other places?
- General
- perl/postgres/DBI/DBD install inconsistant on Cairo. It seems that c0,3,4,5 have postgres installed along with DBD::Pg. DBD::Pg is a dependancy of detailed-scheduler.pl. 95 % of the time we run it off of c0 and use MPI, if we run with only 1 process then we use ssh instead of MPI to run the job to save the overhead (if any) of the MPI calls. Due to this the scheduler will die if you run on any other node than those 4. To install DBD::Pg you need a local postgres install. Also on these 4 nodes perl 5.8.3 is installed where the rest of the cluster has 5.8.0 installed.
Is there a way to NOT have a local install since we always use the one off of hopper?
This should be something for our next image once we figure out all the details.
To Do/Pending
- Folding@Cluster:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
- Have mother ship the mother/nanny/child binaries via lam's mpirun.
- Buildin HTTP Client/Server
- B-and-T-GROMACS:
- Talk about printing Poster...
- How to handle general citations (rather than specific ones).
Posted by hursejo at
09:29 PM
|
Comments (60)
July 26, 2004
Meeting Notes - July 26, 2004
Plumbing
- Bringing Bazaar and Athena back on-line: Can see b0 from hopper. Stumbling with dhcrelay, currently Quark is collecting all of the packets. Suggestion: Make Quark a temp dhcp host for bazaar and athena. Josh M and Dawit will purse this solution.
- Bazaar slowness: Unable to work on this until above issue covered.
- Authoritative list of 0th nodes tested with athena and in CVS: ibid. waiting for athena to come up before pushing forward to this.
- Bugzilla: No word yet from charlie. No rush at the moment.
- Fixed ppc_altivec.h in the 'mygromacs' directory that people have been using. John and Dawit confirmed the fix.
Folding@Clusters
- Restarting: Been playing with the files a bit, but nothing definate yet.
- HTTP client/server: Josh H is researching how to do this.
Numerical Methods
- How many nodes do Dawit and John need for the tests we discussed on Sunday? Charlie wants the rest.
- John: c15, c14
- Dawit: c12, c13
- Charlie: c0-c11
- Josh M needs a few nodes for running the audits.
- Wall time, hotspot chart to HTML: Needs a bit more work [Wall times, links to gprof stuff]
- list of long range interaction routines (PME, CUT-OFF, ??,??): Nothing yet, going to look through the GROMACS manual next.
- Working on expanding the chart for different configure flags. Doing installs at the moment while walking through the GROMACS source.
- Altivec flags now working and seeing a big performace gain.
Papers and Presentations
- Review, improve, etc. the F@C abstract. Don't delete text rather move it to a deprecated section.
- Review, improve, etc. the N-M abstract. Don't delete text rather move it to a deprecated section.
- Josh H will find link to SIAM poster instruction site.
General
- Adjenda for July 28. 1p = General, 2p = 2 Abstracts, 3p = B-and-T-GROMACS -- Charlie to meet via phone.
- Need to calibrate the weatherduck, but need more datapoints, and another source of information.
- Need keycore changed for Cluster Closet from CAD3 to something reasonible.
- Need combo for keypad access to Basement access to Cluster Closet.
- Dawit and John need key CAB13.
- Where is the space for the Cluster Computing group during the Academic year?
B-and-T-Gromacs
- Molecule runs: On hold until he has acces to some cairo nodes. having some problems with the scheduler, Josh H will help debug.
- Josh H will put instructions for using the detailed-scheuler.pl script in a README file in CVS' b-and-t-gromacs repository.
- Josh H is tracking down problem with PDF rendering.
- Josh H is going to move Other Considerations section into Results section and rename it to Extension??
Posted by charliep at
01:30 PM
|
Comments (23)
sunday update
A week full of meetings and the cluster move prevented me from doing much "work". However, what to do next is quite a bit clearer.
To Do List: Compile versions of GROMACS with various configure flags, run them with gprof to determine behavioral differences and without to determine run times, isolate the inner loop "kernel", read new items, and understand the algorithm of the inner loops.
Posted by schaejo at
10:56 AM
Update - Josh H - July 26, 2004
Worked/Working on
- B-and-T-GROMACS
- Posted information about how to use the current scheduler. This informaiton should probably make it to a README in the near future.
- How to handle definitions? Journal, Zobel?
Zobel say: When the new term is first introduced, define it in the paper. No footnotes or seperate sections.
- Numerical Methods: Fixed my Gromacs tar file to have the fixed version of the configure script and the ppc_altivec.h.
To Do/Pending
- Folding@Cluster:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
- Have mother ship the mother/nanny/child binaries via lam's mpirun.
- Buildin HTTP Client/Server
- B-and-T-GROMACS:
- Talk about printing Poster...
- Crossroads of the ACM? Guidelines, deadlines, etc. Charlie ok as an author?. Submission timetable and guidelines
- Make Readme for detailed-scheduler.pl
- PDF rendered very unusually on PowerBook, PS ok.
- Other Considerations merged into Results with appropriate changes.
- How to handle general citations (rather than specific ones). I don't even know if there is a commonly accepted way of doing this in scientific literature. It may be that re-reading the articles, after our prose is relatively stable, looking for specific citations is the way to handle this.
- F@C/Numerical Methods paper
- Get link from SIAM regarding Poster Setup, post it in MT until we find a stable place to put it.
- Review/add to Abstracts
Posted by hursejo at
07:10 AM
|
Comments (128)
July 25, 2004
WeatherDuck Calibration
From JoshH on Saturday July 24th - Just some numbers from the Oregon Scientific (Left) and WeatherDuck (Right)
Before Reboot:
76.1 -> 79.2
29 -> 49
After Reboot:
74.1 -> 76.8
38 -> 41
Posted by charliep at
08:16 AM
|
Comments (87)
July 23, 2004
Meeting Notes - July 23, 2004
People Present: Josh H, Josh M, Dawit, John, Charlie
Generic
- Cluster Move: We are not going to move Bazaar into the room until the Duck is calabrated. Need to watch the room for a while still.
- Reading: Lindal ClusterWorld Article
- Adjenda for July 28. 1p = General, 2p = 2 Abstracts, 3p = B-and-T-GROMACS
Plumbing
- Get Bazaar and Athena wired up and working - Josh M and Dawit
- Bazaar Slowness: No Progress - Josh M
- Make Athena a cluster once again. Dawit.
- Zero'ith list still in development.
- Charlie has BugZilla token.
- Josh H will fix gromacs tar ball in mygromacs directory so the ppc_altivec.h add the altivic.h include.
Paper
- F@C to SC2004, Both F@C and Numerical Methods to SIAM.
- Get link from SIAM regarding Poster Setup, post it in MT until we find a stable place to put it.
- Numerical Methods folks will meet Sat July 24 at 3 pm
Numerical Methods
- Are the Wall time values in the chart based on the non-profiled AND non-debug binaries? They are no using '-gdb' so no debugging. No to profiling.
- The table from the White Board is comming along WRT to completion and posting of HTML version.
- Reading: GCC 3.3 Manual has some good info about Alitivec and SSE.
- cflow might be handy. Talk to Josh M.
- Need list of long range interaction routines (PME, CUT-OFF, ??,??)
Folding@Cluster
- Josh M reporting on Restarting: Has some good leads, but no solution as of yet.
- Code review by Charlie and Josh M.
- Getting files to children and from nannies. HTTP (which will be in the mother) or code up some TFTP. TFTP is out because of encryption so use SSL via HTTP on a designated port -- seperate listening process for mother. Should present configuration html, push/pull files in house, and communicate with Pande Labs. Apache model is reasonible.
- Josh H will send Josh M a sample conf file and talk about create-configure.pl script.
Posted by hursejo at
04:01 PM
|
Comments (116)
July 22, 2004
Update - dawit
Run lzm/pme on cairo with altivec,compiled fine.
Took an image off of b20 for bazaar image.
Posted by bekelda at
09:34 AM
|
Comments (13)
July 21, 2004
Update - Josh H - July 21, 2004
Worked/Working on
- B-and-T-GROMACS
- Talked about paper with Josh and Charlie.
- Will work on this more Thursday Morning.
- General:
- Passed Bugzilla token to Charlie.
- Cleaned up the install scripts on cairo and bazaar in $CLUSTER/bin/
- Numerical Methods: Fixed my Gromacs tar file to have the fixed version of the configure script.
To Do/Pending
- Folding@Cluster:
- Keep working on to do items.
- Track memory address with malloc/free issue on mother and child. Is GROMACS playing with the memory that I malloc'ed?
- Have mother ship the mother/nanny/child binaries via lam's mpirun.
- B-and-T-GROMACS:
- Talk about printing Poster...
- Check CrossRoads Submission timetable and guidelines
Posted by hursejo at
07:36 PM
|
Comments (30)
Inner loop Identification
The following information is taken from /include/types/nrnb.h:
The four numbers following an inner loop (inl) tell a good deal of information about that loop. The first number (as in inl(1st)(2nd)(3rd)(4th)) tell the Coulomb Type, the second tells the Van der Waals type, the third tells the solvent optimizations being used, and the forth tells the free energy option. The number 0 (zero) in any of these positions stands for no, none, or turned off.
position meaning 1 2 3 4
1st coulomb type normal Reaction field Table
2nd Van der Waals Lennard-Jones Buckingham Table Bham-Table
3rd solvent optimiz. general water water-water
4th free energy Lambda Softcore
Posted by schaejo at
04:03 PM
July 20, 2004
B-and-T GROMACS Minutes - July 21, 2004
Decide where to publish, Crossroads of the ACM? Guidelines, deadlines, etc. Charlie ok as an author? JoshH.
References tour. Charlie.
http://www.earlham.edu/~charliep/mt/archives/002711.html, other entries?
PDF rendered very unusually on bubba, PS ok. JoshH.
Reconcile database results for 1-1-1, 2-2-2, 4-4-4, 6-6-6, 8-8-8; bazaar and cairo (only audit cairo now but we'll ultimately need bazaar as well); villin, dppc, then proteasome;use detailed version of the scheduler; run under b-and-t-g user; use c10-13 for now; gromacs-optimal-3.2.1 and gromacs-baseline-3.2.1; rerun 1-1-1 and 4-4-4 for villin and dppc, baseline and optimal (2^3 runs total); report back. JoshM.
Confusion between Methodology and Experimental Design. Some material from each needs to move to the other. Are they parallel universes at some level? We all need to think about this a bit. J^2 and Charlie.
How to handle definitions? Journal, Zobel? JoshH.
Other Considerations merged into Results with appropriate changes. JoshH.
How to handle general citations (rather than specific ones). I don't even know if there is a commonly accepted way of doing this in scientific literature. It may be that re-reading the articles, after our prose is relatively stable, looking for specific citations is the way to handle this. JoshH.
Posted by charliep at
06:38 PM
|
Comments (114)
Meeting Notes - July 19, 2004
Plumbing
- Always check optimization levels and profiling options in _all_ our build scripts. You only want to use debugging or profiling when you specifically need it, they will prevert the runtime statistics if enabled for timed runs. Later on we need to figure-out what optimization levels PPC and x86 will safely support. PPC has trouble with 03 when doing proteasome over 4 nodes. JoshH will check/fix /cluster//bin to see if they are right.
- Bazaar slowness - If ssh is the problem why is Cairo so fast with ssh? For JoshM's plumbing item he will look at this soon. Did DNS change propogate?
- State of Athena image? 12 nodes hopefully, Dawit will check them. Getting closer...
- Node 0 list? List from Hassan + list from Charlie's blog entry = new document in SNA CVS project. Dawit and JoshM.
- New scheduler. Flags to F@C? Sounds possible. J^2 and CP meeting sometime later this week.
Folding@Clusters
- JoshM and Charlie should look at the TODO in CVS.
- NFS dependency needs to be broken, problems with stress CPU code and optimization, configure script (good HowTo available), freeing memory causes a seg fault. Charlie will add these to the TODO.
- Code review this week, JoshM and Charlie.
- Starting processes on particular nodes with LAM-MPI, we need to figure this out so that we can use it for load balancing. Charlie.
- Checkpointing - Skip LAM-MPI for now, JoshM will look at GROMACS to see what the relationship is between what it currently writes-out and what grompp takes as input. Look at all those tools that come with GROMACS, does one of them do this? JoshM.
- Bugzilla setup on admin. JoshH.
Numerical Methods
- Problems with AltiVec on Cairo, fixed with altivec.h in configure. Was this submitted as a bug?
- Dawit and John will finish the table and the call diagram. Wall time, config.log, gprof pointers, etc. See last meetings' notes. Legend on call diagram. Check J^2's diagrams from last year (white board pictures, cflow, etc.) Where are the calls to FFTW?
- Find full list of long-range interactions (PME, cutoff, others) and add those to the chart. John and Dawit.
- Reading list - Charlie will work on this during the week and update it.
B-and-T GROMACS
- JoshH's trim and substitute suggestion. Maybe just report preliminary results? Depends on the publishing venue, talk to Jim and see what he thinks. Looks like we should carefully state what we learned based on tests we actually did and leave the rest for future work. We are heading towards Crossroads (ACM) as the venue.
- JoshH added a new sub-sub-section on choosing a benchmark, we should all review this.
- J^2 and CP will meet Wednesday at 12p - decide on venue, review current draft.
Papers and Presentations
- Calendar tour - July 26th deadline for SC2004, August 11 deadline for SIAM. Both will require an abstract. Sounds like Numerical Methods and F@C for each.
Numerical Methods for Molecular Dynamics on Commodity Vector Arch.
Folding@Clusters: Using the Parallel Grid Resources for Large Molecule Molecular Dynamics.
All of us should be thinking about those this week, they will be the focus of our meeting on Thursday.
- Changes to make for LinuxFest. Some of Charlie's Grab-a-Byte sounds like it may be appropriate for this, he will put a copy of it with the original submission when it's ready (in a week or so). 1 hour with questions.
General
- Move the clusters this Thursday at 11a. Network connection? Computer and WeatherDuck in the new space.
- Forward scheduling - We will meet on Mon and Wed at 1p during the last week of July and first week of August.
For next meeting items see the unpublished entry for that date.
Posted by charliep at
07:33 AM
|
Comments (28)
July 19, 2004
Update
This weekend has been spent primarily searching for information about the parallel I/O in the MPI2 standard.
First of all, file I/O as specified by MPI2 allows processes using MPI both basic file I/O and parallel file I/O regardless of the underlying system. There are some other MPI impementation that would be interesting to check out. Here are some free, source available, comercial implementations and some non-comercial implementations.
The implementations of MPI I checked for MPI2 file I/O support are: MPICH, LAM-MPI and MPI-LITE. After a bit of digging, I found that MPI-LITE has little to no support for the MPI2 I/O standard and both LAM-MPI and MPICH rely on the same software implementation of the standard: ROMIO. Yeah, remember ROMIO? It is a software package designed to fit into any MPI implementation to provide the I/O. When dealing with file I/O, ROMIO is dependent on another software package, ADIO, to provide an abstract interface to many different underlying filesystems.
For a general overview and some performance analysis, check out this paper: A Case for sing MPI's Derived Datatypes to Improve I/O Performance
That's great, but how does this apply to checkpointing? The collective parallel I/O is shared in a single, logical file (aka: it looks like the same file to all the processes sharing the I/O file). The big question is where is this information stored on disk in a way that we can use it for checkpointing? I have yet to find a source that gives implementation details. In order to find the answer to this question, more research and/or looking at ROMIO/ADIO's code is needed.
links:
MPI-LITE
LAM-MPI
MPICH
MPI2 Standard
ROMIO
ADIO
LAM-MPI User's Guide
Posted by mccoyjo at
09:47 AM
|
Comments (52)
Sunday Update
Well, I didn't get as much done over the weekend as I had hoped, which means the table has yet to be made. I am still getting results for cairo that the ppc-altivec flag has no effect, which means that I need to pay yet more attention to the configure file and compile output (I guess I had a false success on the last build). Dawit and I will create the table this morning.
Posted by schaejo at
08:36 AM
Update - dawit
Did new runs both on bazaar and cairo with time command.
Spent time with athena plumbing, not much progress but I've switched a11 and a0 and I'm configuring gentoo for a0.
Posted by bekelda at
07:54 AM
|
Comments (62)
July 18, 2004
Update - Josh H - July 18, 2004
Worked on/Working on
- Folding@Cluster:
- Working on debuging the code.
- B-and-T-GROMACS
- Worked on cleaning up the paper. I have been through most of it, and am keeping an updated ps and pdf in CVS for all to look through. I suggest that before commiting the text document to CVS that we always make sure that the postscript and pdf versions are up to date. I have a script to do this if any are interested. src/perl/LatexMake.pl
- I am considering droping the GROMACS Ports for the 'Other Considerations' section, and replaing it with a short discussion on the possibility for this type of exploration. This way we can finish the paper, and put the porting stuff in Future work. Also this allows me to focus more on the F@C stuff. Thoughts?
- General: Figured out why Charlie was having the CVS problem that he would not receive any new directories when doing a 'cvs update'. By default 'cvs update' does not check for new directories and only checks those files in the directories that you have. a 'cvs update -d' will get any new directories.
To Do/Pending
- B-and-T-GROMACS:
- MPICH Port: Find out why it stalls on MPI_Finalize with SMP runs
- MP-Lite Port: Run gdb on it to find where it segfaults
- Possibly use another MPI package?
Posted by hursejo at
03:10 PM
|
Comments (29)
July 16, 2004
IPMI and SOL in image
I wanted to start a discussion on this so I placed this entry here. Please feel free to comment.
We mentioned this set of software that we might be able to harness with our equipment. Mostly we need something to allow us to reboot a stalled/failed node over the Ethernet connection or some interface that we can access remotely. This way we don't have to travel to campus to reboot a compute node if it stops responding.
Below are two links regarding IPMI and one users experience in setting it up under Debian:
OpenIPMI
Debian HowTo Document
IPMI calls for a specialized linux kernel in many/all cases, but IPMI addresses an Intel standard which allows access to the BIOS even if the system has been shutdown -- as long as the power is flowing through the power supply.
I know that I would find this useful. This is just one solution out there, but it is gaining popularity. There is also a peice of hardware that owuld let us do this as well, if I recall correctly.
Thoughts?
Posted by hursejo at
04:10 PM
|
Comments (212)
July 15, 2004
Meeting Notes - July 15, 2004
Numerical Methods
- Make sure that the right options are being used for each GROMACS build, reconcile the results. We should have documented command lines for GROMACS' ./configure for each platform and test configuration that we are running.
Josh H pulled the following information from the b-and-t-g dtabase with the following SQL command:
SELECT * from option_profile where layer = 'Gromacs' and code_root ~* 'bazaar' and options ~* 'Optimal';
Bazaar
--enable-mpi --enable-mpi-environment --enable-float --disable-software-recip --enable-software-sqrt --enable-x86-asm --disable-ppc-altivec --disable-cpu-optimization
Cairo
--enable-mpi --enable-mpi-environment --enable-float --disable-software-recip --enable-software-sqrt --disable-x86-asm --enable-ppc-altivec --disable-cpu-optimization
- Develop a simple chart (in html) like the one I drew on the whiteboard last week. If there is any "fine-print" add that as text to the page. For each cell in the matrix provide the wall time and a pointer to the gprof output files (one flat, one hierarchical) that correspond to that run. My memory is that between Dawit and John we should have two molecule/methods on two clusters both with and without architecture specific optimizations. Post an entry to MT with the URL
and send email notification of the post.
Will also put this chart in CVS. Will post both Total wall time and average wall time over the 10 runs. Will start using the UNIX time command to measure time.
- Need more to read if there is more available.
- Continue refining John's blackboard diagram. Leave it up until at
least Monday. It is in a stable state at the moment, and will likely not change before monday. May consider puttin ghtis in XFig, dia, or other.
- Dawit noted that invsqrt was called many times, but still ranks low on the CPU Utilization time. Why might this be?
- Generally we are trying to confirm our impression about where time is spent (both in "generic" mode and architecture specific mode) so that we can identify candidate code for the benchmarking kernel.
Folding@Clusters
- BugZilla will be needed before letting anyone outside of our group tests.
- Code is looking good. Charlie will do a review of it and post the results later this week.
- What's up with the LAM-MPI checkpointing add-in? Unless it is very simple and powerful it's likely to be easier to manipulate GROMACS' checkpoint capability to meet our needs (IMHO).
Josh H seems to remember that this is only if lamd dies on the machine moving the process [from the last checkpiont] to another machine.
- Parallel File I/O is said to be supported by our flavor of MPI. Josh M is still looking in to this.
- Need to think about CVS tags when a stable verions arrises.
B-and-T GROMACS
- Exactly which version(s) of MPICH are we using or have we used?
mpich-1.2.5.2 is the one we are using and mpich2-0.96p2 which we droped.
- Josh H still needs to review that paper.
Plumbing
- Why is it that when JoshH created the src directory in the folding-at-clusters module and I ran "cvs update" on my client that I don't get the new directory and its contents? (Unless I do a "cvs release" and "cvs checkout".)
Josh H iwill look into this.
- What's up with bazaar WRT network lag time?
Josh M is still working on this. he is fairly sure that it is still something with SSH. He is reading up on this at the moment.
- What's up with the canonical zero node list?
Dawit and Josh M are going to image some bazaar annex nodes, and start woriking on building a small cluster with a head node. dhcrelay, systemimager, and something else. They have a starter of the list, but are workingo n confirming this. Will post the list(s) to sna CVS.
- Josh M and Dawit have been updating the sna plumbing list.
- What else needs to be done to athena to complete the imaging project and have a useful cluster for F@C testing?
Systemimager is the only stumbling block at the moment. Dawit thinks that him and Josh M are close to figuring this out for a11. They have been able to pull an image from bazaar.
General
- WeatherDuck has yet to arrive. Josh H will send a chaser e-mail inquiring about the replacement for Cluster's WeatherDuck.
- LinuxFest has accepted us. We should get more information soon.
- Need IPMI research for serial/IP BIOS access.
- Did we get the poster? Do we need to get to Dayton to do this?
Below are some additional items that were talked about in and out of the meeting WRT Monday's Meeting.
Numerical Methods
- Josh M gave John and Dawit a small scheduler that they have been using to run tests.
General
- Still need to do some forward Scheduling on Monday
Plumbing
- User Accounts on Bazaar annex seem to work fine with the new image, so the touble that was previously recorded is being tabled until it comes up again.
- Josh M has not had a chance to clean the image on Cairo yet.
- No word on Firewall status other than it has been noted by the imaging folks that no firwall should be on the compute nodes only on hoppers extenal interface.
Posted by hursejo at
12:32 PM
|
Comments (92)
July 14, 2004
Update - Dawit
I have collected gprof results from bazaar and cairo, with vector and x86-asm disabled on bazaar and altivec disabled on cairo and without.
Have fixed systemimager problem on athena and bazaar and I'm taking image from athena.
Posted by bekelda at
11:35 PM
|
Comments (43)
Update - Josh H - July 14, 2004
Worked on/Working on
- Folding@Cluster:
- Put code in CVS: folding-at-clusters/src/
- Added the Capability Discovery code to the Framework.
Note that we are getting the segfault on the cpu tests. We have seen this before, and zero'ed out a field to fix it. Now it is back and we may have to do the same thing. We need to run the debugger on the code and find where it dies.
- Mother is calling our version of grompp
- Child is calling out version of mdrun
- Seems to run in limited circles. I have tested it wirh Villin. There are some serious issues that I am working through at the moment. I will send mail conserning these shortly.
- There are a few ToDo items listed in the TODO Document I am sure there are more. This should get us moving towards our goal.
To Do/Pending
- B-and-T-GROMACS:
- MPICH Port: Find out why it stalls on MPI_Finalize with SMP runs
- MP-Lite Port: Run gdb on it to find where it segfaults
- Possibly use another MPI package?
- Work on Paper!!
Posted by hursejo at
07:41 PM
|
Comments (45)
Sunday Update July 14th
I have completed gprof runs of the following: bazaar with default flags, bazaar with --disable-vector, bazaar with --disable-x86-asm, cairo with default, and cairo with --disable-ppc-altivec. I suspect something fishy is going on, because both cairo batchs look far to similar, and the bazaar default and the --disable-vector batchs have a similar aspect. This is worries me, and I will have to spend some time looking at the configure script to see what is going on.
I have made a picture of a blackboard using the gprof call graph as data. It doesn't quite capture the flow of the program, but it at least shows who is calling who, most of the time. I will leave it up for a while (till we consense on its usefullness).
Otherwise, I have been monkeying around with compiling different copies of gromacs in different directories. And learning very basic perl script (although the learning there took 15min.)
I am wondering what I should do next.
Posted by schaejo at
04:32 PM
July 12, 2004
Meeting Notes - July 12, 2004
Present: Josh H, Josh M, John, Dawit
Numerical Methods
- gprof runs comming along well. John is working on profiling the gmond files, all of his runs for LZM/CUT are finished. Dawit has finished his runs for LZM/CUT and is working through some gprof errors. John and Dawit will work together to overcome this error.
- There may be a problem disabiling SSE instructions on Bazaar in the configure script. There isn't one explicitly labeled. Try the x86-asm, --enable-vectorized-recip, --enable-vectorized-sqrt, --enable-vector flag. The latter may not benifit us, but it is doubtfull that it will hurt us.
- Scheduler may be useful when running these tests in bulk. Josh M has a predicessor to the scheduler in CVS that will be useful for this.
F@C
- Josh M reported about load balancing and checkpointing. Load Balancing is all left to the programmer, i.e. not in standard. Josh M has some pointers to papers on this.
- Checkpointing is not mentioned in any of his reading on MPI. LAM-MPI has an addon that he will look into.
- Parallel File I/O may be useful for Checkpointing files. Josh M will do some investigations.
- LAM-MPI has a fairly complete list of Impletations other than theirs, both commercial and Open. This may be a b-and-t-gromas item as well. May be able to support other implementations in F@C core.
- I put the F@C framework in CVS. It does not have the GROMAS or all of the capability discovery yet. Should be ready the end of the week.
General
- Charlie should post Agenda items in MT for next meeting.
- Need to do some Forward Scheduling early next week.
Plumbing
- Image: Athena DNS, NIS, ssh, and all other resources should be working. a11 is working for client image testing, not head node. SystemImager is only distributed [as of late] deb and rpm files. Dawit is going to either download an other version from the sourceforge site and use it, or unpack the rpm version on a redhat machine and copy the source over. If we get this to work we should think about contributing it to the Gentoo site. Dawit is also looking for a dhcrelay port to Gentoo.
- Dawit will show Josh M how to use systemimager to image bazaar annex. Need to make a golden client, etc.
- it seems that users accounts are not consistant across the bazaar annex. b16 and b20 are correct, but the others are not.
- Josh M and Dawit will make a list of additional packages for client and head node from the base install.
- Clean up the Cairo image, use c15.
- Make sure firewall is off on all nodes. Should only hoppers external interface.
- 'The Slowness on bazaar'. Josh M does not link it is DNS. telnet has no delay, but ssh does. This may be a ssh problem. Josh M will check this out.
B-and-T-GROMACS
- Need to look at the paper. Josh H take the next look and do some updates.
Posted by hursejo at
02:02 PM
|
Comments (109)
Update
I'm back from vacation and ready to get back to work.
The majority of I accomplished while gone was reading. All the MPI reading was completed (the two MPI books and the MPI articles). This reading taught me two important things about MPI. First is that MPI leaves all load balancing to the programmer; there is no real internal load balancing. Second is there is no explicit checkpointing mechanism. That being said, the parallel file I/O specified in MPI2 could be an excellent method through which to implement checkpointing.
Posted by mccoyjo at
10:15 AM
|
Comments (21)
Update - Dawit
I did 10 runs on b18 and c13 for lzm/pme. I now have enough gprof output to do comparison.
Made the mistake of using the USE utility to add nis support on athena image and it took 3 days compiling since it rebuilds the whole tree. Need to look into their cross compile option.
a11 is ready with all network capability so we can ssh and test image. Systemimager download on a11 needs a source tarball and I could not find any current ones without rpm.
Posted by bekelda at
12:05 AM
|
Comments (63)
July 11, 2004
Sunday Update
I have finished the first half of the gromacs gprof runs. I have ten for cairo and eleven for bazaar (I can't count). I have re-compiled with disabling flags, and am beginning to do gather data on those. I think I really ought to learn to use cron, or write perl scripts to automate this stuff. It was a good weekend.
Posted by schaejo at
10:52 PM
July 08, 2004
Useful Websites
here are some educational links (I hope):
1. http://numericalmethods.eng.usf.edu/siteindex.html
2. http://www.amara.com/papers/nbody.html
3. http://www.ap.univie.ac.at/users/ves/cp0102/dx/node107.html
The first is a link to Univ. Southern Florida's engineering page, which contains a good explaination of how to do division, without doing division. The explaination is listed as an example of a "Physical Problem" for CS Engineers in (nonlinear?) Differential Equations (the first topic, not interpolation). The third link is attached to the university of Wien in Germany, so it is probably reliable.
The second is an individual's webpage, and it contains brief explainations for different ways of doing molecular dynamics. Probably not directly useful, but informative.
Posted by schaejo at
02:00 PM
Meeting Notes - July 8, 2004
Present: Charlie, John, Dawit
Numerical Methods
- Readings all done, at least at a high level. When we have a better sense of what code we are looking at we can do the in-line, un-rolling analysis.
- Building GROMACS on bazaar and cairo seems to be working for both Dawit and John. Need to confirm that -pg is actually there.
- Wait to see if gprof gives us enough for call graphs before crafting our own.
- Use gprof to generate flat and hierarchical data based on 10 runs of one molecule/method on bazaar (with and without SSE) and cairo (with and without AltiVEC). John - LZM/cutoff, Dawit - LZM/pme.
- On Monday talk about the gprof results and consider benchmarking kernel.
- Sort-out the performance numbers at the end of the GROMACS log. Speak with Josh^2 about our earlier notes about this. Why is PS/Node hour the same for cairo and bazaar with the same molecule/method?
- Stop the literature search for now.
- John will document what he has learned about the differntial equations are doing in GROMACS.
F@C
- JoshH and Charlie met and discussed the code and the overall approach. Josh will put the code in CVS at some point (soon!).
General
- JoshH will "organize" and take notes for the two meetings next week.
Plumbing
- Athena - network is ok now, image is coming along but not done yet.
- List for 0th nodes, see earlier meeting notes.
Posted by charliep at
12:19 PM
|
Comments (64)
Wednesday Update
I thik I finally found a good way (easy & accurate) to compare the run times between bazaar and cairo. I was wrong about there not being any timing in the .log files, but what I have seen so far I don't quite understand or trust (the log files claim that bazaar is sometimes faster than cairo, which does not match my experience).
I read all three articles. And the first half of the FFTW doc.
The literature search is going poorly, I have never been good at finding stuff, but I will continue trying.
I don't think I understand how to diagram the gromacs code. Is this a graphic representation of gprof's call graph? Or is it something else?
If it is a visually structured call graph, it would be good to know where the -pg flag goes. Dawitt and I were wrestling with it yesterday.
Posted by schaejo at
08:44 AM
July 07, 2004
Update - Dawit
Read the assigned articles.
Built personal gromacs on cairo.Still having problem building it in bazaar.
Worked on athena image.
Posted by bekelda at
11:03 PM
|
Comments (44)
July 04, 2004
Update - Josh H - July 4, 2004
Worked On/Working On
- Weather Duck: They are sending a replacement. Once it arrives then we will swap out ours.
- GROMACS Port PVM: All levels finished.
- GROMACS Port MPICH: After testing on x86 and seeing the same error -- stalling and timeing out on MPI_Finalize, I removed the command from GROMACS source and it is now running. It leaves the mdrun slaves running so I have to manualy kill them before each test. I am re-running the NxNxN tests again since the environment/code has changed.
- GROMACS Port MPICH2: Droped for the time. It has build problems under ppc.
- GROMACS Port MP_Lite: Need to run GDB to see what is causing the core dump
- F@C development Starting to merge GROMACS mdrun with framework. Keeping notes on any changes and what needed to be extracted.
- B-and-T-GROMACS Paper Installed Latex, dvipdf, dvips, aspell, ispell on hopper to aid in LaTeX development.
To do
- B-and-T-GROMACS Paper
Posted by hursejo at
04:22 PM
|
Comments (197)
Meeting Notes - July 5, 2004
Numerical Methods
- Three articles from Dr Dobbs to read, first two more generally applicable, third one more specifically.
- Build and run gromacs in home dir? Not yet, will work on this soon.
- Are MFLOPS useful as counters? No, particular values are manipulated in more than one way and at multipule levels. Consider other approaches for measuring load, gprof?, hand scaffolding? After some discussion we decided to go the gprof route. Test both native short vector instruction and generic GROMACS on both x86 and PPC. John and Dawit.
- Elapsed time comparison between cairo and bazaar, not yet. John.
- Learn about nohup and &. Dawit and John.
- Diagram of call structure of GROMACS source module/function dependencies on white-board. Will make electronic later. Dawit and John
- Literature search for vector algorithms, nothing yet. John
Plumbing
- Develop and practice 0 node install on a0. We need a canonical published list of changes made to 0th nodes after imaging:
- routing
- dhrelay
- ssh keys (which need to be preserved in /cluster//etc?
- others?
- Dawit returned b15 to its former glory as a cluster node.
- iptables and ipchains should both be completely removed from all images. Hopper's external NIC is the only place we should have any firewall.
- Merge SNA and plumbing list. Charlie
- We need to do a backup audit soon.
Other
- Make MT log entries on Sunday evening and Wednesday evening!
Posted by charliep at
03:26 PM
|
Comments (47)
July 01, 2004
Meeting Notes - July 1, 2004
Present: Charlie, JoshH, John, Dawit
General
- Dawit and John work together more. Move one more workstation into seminar room. (I was unable to reproduce the little hand motion from lunch in this format.)
- We'll meet from 11a-12p EST on Monday and Thursday of next week.
- Fried UPS and cairo disk drive RMAs. Charlie
B & T GROMACS
- Latex on hopper is set. JoshH
- MPICH - PPC problem, try it under x86. For stalling consider removing MPI_Finalize() call. JoshH
- MPICH-2 - not installing, needs dubugging. Dropped due to hassle. JoshH
- Intra collective communicators for next LAM-MPI version (7.1.x), sometime this Fall maybe.
Folding@Cluster
- JoshH working on child merge with GROMACS.
- With 10 process 4 nodes highly utilized, one marginally and 5 minimally used. Consider manual load balancing ala OpenMosix. Does MPI offer anything here (dynamic process launching)? OpenMosix itself is too complex to consider. Charlie
- Scaling model coming along, still lots of configurations that fail under PPC and x86. Charlie
- MPI - learn about application schemas, launching a binary from one node, load balancing options. JoshH, JoshM, Charlie
Numerical Methods
- John discovered the megaflop accounting is bogus. We can consider a couple of different approaches to fixing it: re-calculate constants and keep going, make accurate counters, use gprof, others?
- Determine wall (elapsed) time accouting on bazaar and cairo. John
- Draw picture of structure of GROMACS source module/function dependencies on white-board. Will make electronic later. Dawit and John
- How to build GROMACS in home directory, JoshH to show Dawit and John
- Literature search for vector algorithms. John
Plumbing
- b15 returned to its former glory as a cluster node. Dawit
- b16 under image test for a week. Dawit
- Merge SNA and plumbing list. Charlie
- We need a canonical published list of changes made to 0th nodes after imaging:
- routing
- dhrelay
- ssh keys (which need to be preserved in /cluster//etc?
- others?
- We need to do a backup audit soon.
Cluster Closter Move
- Watch temperature at North and South ends of room, are there hot spots? Does the intake vent need to be relocated?
- Speak to BillB about remaining outlet and jack relocations, lighted switches. Charlie
- New WeatherDuck on its way. JoshH
Conferences and Presentations
- No word from LinuxFest yet.
Posted by bekelda at
01:19 PM
|
Comments (62)
Update
I finished reading HPC and gromacs manual chater 1,Chapter 3 I skimmed through.
Doing some output staring, will be running more molecules to check for mflop accounting.
Trying to figure out how to yank a computing process like LJ + Coul(WW) from main code and run on own for kernel benchmark.
Posted by bekelda at
08:59 AM
|
Comments (37)
Wednesday Update
I put some of my grep'ing notes into the numerical-methods folder.
Charliegave me info to start a literature search for papers involving modest-sized-vector algorithms, but I haven't found any real goods yet. I read more of K&R, which is slower going as I get to things that look less like C++.
I am going to finish reading the GROMACS manual appendix B on 1/sqrt(x) and then do more grep'ing, with a focus on the dependincies of the inner loops I found.
Posted by schaejo at
08:08 AM