April 30, 2004

Meeting Notes - Apr 29, 2004

CCG Meeting - cp, jh, jm, js
New power outage schedule, May 15-16 and June 12.

Bazaar annex is fully functional. Joshes both running code on it.
See MT for JoshH's MPI config file.

No graphing tool yet, maybe this weekend. JoshM.

Some progress on mdrun/MPI, starting parallel runs. JoshM.

All of us (particularly JM and CP) need to be more regular about using
MT.

The new testing users are setup on the cluster, f-at-c and b-and-t-g.
Passwords are the same as the switches.

Fix rid-lookup.php and install in cgi-bin, link to it and the new tool
at c.e.e/html/resources (all of those files are in CVS) and in the
Links section in MT. JoshM.

Villin/Urea molecule failure. Charlie.

Implement catch the child failure, recover, and re-start with one less
process logic and code in v0.5 of F-at-C. JoshH.

We'll need to setup Bugzilla before too long.

Reading list for 1/x, sqrt(x), and vector processing. Charlie.

Posted by charliep at 06:18 AM | Comments (0)

April 28, 2004

Bazaar Annex Nodes

As I was waiting for some GROMACS code to finish compiling I took the liberty to finish imaging the new bazaar nodes (b16-20). They are all running now, waiting eagerly for work.
Since you need a special lam-bhost.conf file to work with these nodes, I have created one that we can all use. Instructions on how to use it are at the top of the file.
http://cluster.earlham.edu/home/joshh/src/lam-mpi/bazaar-annex.conf

Posted by hursejo at 03:59 PM | Comments (0)

April 27, 2004

Added Project 1012 to CVS

I added the Project1012 Molecule to CVS. Here is the link if you want to view the files:
http://cluster.earlham.edu/project/b-and-t-gromacs/tests/etc/orignal-molecules/project1012/

It is a bit smaller than the last molecule that we recieved, but this will 'stretch' to 100 processes with out error [in one set of testing].

Posted by hursejo at 05:38 PM | Comments (0)

April 19, 2004

Protocol for v0.5

So I have typed up an protocol outline for v0.5 which is a bit more detailed then what we have on the board now. There are some notes at the bottom which we should address when we meet next. Here is the link:
http://cluster.earlham.edu/home/joshh/dev/fac/protocal.txt

Here is a link to the current frame work:
http://cluster.earlham.edu/home/joshh/dev/fac/

Posted by hursejo at 11:59 AM | Comments (0)

April 16, 2004

Villin with Urea Molecule Set

I placed the "villin and URE in 6 A cubic box in water" molecule set in the b-and-t-gromacs CVS repository with the other molecules. It is under the directory villin-urea or via the softlink 'urea'.

I have been doing some testing with this molecule, and cannot seem to get it to span more than 3 processes with out major failure [i.e. Application death]. This is a very large molecule, and if we can only use at most 3 processes with it I am interested in finding out why. This is one of those question that we need to answer in order to produce some stable F@C code. grompp is able to split it fine, but mdrun chokes. I am playing around with the other versions of GROMACS to see if there is any difference, specifically I am interested in testing with 3.2.1.

Posted by hursejo at 12:26 PM | Comments (2)

GROMCAS 3.1.4 & 3.2.1 Install

I have installed the latest stable release of GROMACS (3.2.1) on the clusters. I also installed GROMACS 3.1.4 on cairo. For both versions I installed a Baseline and an Optimal Config.

So on both clusters we have the following versions of GROMACS with both Baseline and Optimal configurations:


  • 3.1.4
  • 3.1.5_pre1
  • 3.2.0
  • 3.2.1

Posted by hursejo at 10:35 AM | Comments (0)

April 14, 2004

MPI_Info

MPI_Comm_spawn sequentally gives processes to configured processors. For out application we would like the user to define exactly which machines this program will run on. MPI_Info can specify a file that will be used to specify such nodes. The file needs to look something like:


< MPI_Info File >
n1 -np 1 nanny
n4 -np 1 nanny
n5 -np 1 nanny
< /MPI_Info File >

Where n* is the node reference reported by lamnodes, this can also be c* to refer to specific cpus in the configuration. the -np * specifies the number of processes to start on this resource.

To abstract the user a bit from the details of this level of detail I have created a function to generate the Nanny and Child MPI_Info files from a hostfile.conf that contains a comma seperated list of nodes in the lamnodes format above. For Example:


< hostfile.conf >
n1,n4,n5
< /hostfile.conf >

It is vital to note that MPI_Comm_spawn, and MPI_Info are part of the MPI-2 standard but many implementations do not support it fully. LAM-MPI is one of the few that support both MPI_Comm_spawn and MPI_Info. However LAM-MPI does not currently have some additional functions implemented for intercommunicators. Some of these functions are MPI_Bcast and other functions that send/receive/reduce messages globally to a group. Also there is not function [that I have thus found] to poll the side of a group via an intercommunicator. The work around here is to have the first member of the Child group (rank == 0) to send its size to the mother over the intercommunicator. Man pages provide useful information about each of these commands and their limitations.

Posted by hursejo at 10:12 AM | Comments (0)

April 12, 2004

installation of cflow

cflow is installed on bazaar in /cluster/bazaar/bin/cflow
and on cairo in /cluster/cairo/bin/cflow

The rpm's failed silently, so I had to grab the source. A PPC version seems to be hard to locate, but I am still looking. *Just after posting I found a diff for ppc and installed cflow on cairo*

After looking up usage documents online, it seems that cflow is difficult to run on larger programs due to instability. I have had no successes after 40 minutes of playing with it on gromacs, but the given examples and smaller c programs work fine.

documentation:
http://www.opengroup.org/onlinepubs/007904975/utilities/cflow.html
http://www.freealter.org/doc_distrib/cflow-2.0/#sect6

Posted by mccoyjo at 10:54 AM | Comments (0)

April 09, 2004

MPI_Comm_Spawn

In order to have a speerate Mother, Nanny, and Child process and play in the MPI sandbox we will need to use MPI_Comm_spawn to launch our Child and Nanny binaries from the mother. I have been playing around with this and produced a basic framework that uses this functionality. You can play wiht the files they are located here:
http://cluster.earlham.edu/home/joshh/dev/discover/dev/spawn/

Some points to mention before running to make a huge sandcastle:
1. There is an difference between intracommunicators (standard usage of MPI_COMM_WORLD) and intercommunicators. intracommunicators are used to speak with those members of your own tribe (the Children that are spawned are in a tribe all of their own, and the mother is in a seperate tribe). Intercommunicators allow tribes to talk together. So the Children need to know the handle (MPI_Comm mother) to reach the mother, and the mother needs to know the handle (MPI_Comm everyone) to reach the children.
2. The Children are able to call MPI_COMM_WORLD directly and it will allow the children to talk amongst themselves without talking to the Mother. If they want to talk to the mother they need to use the mother MPI_Comm 'channel'.

We should be able to Spawn Nannies and Children as seperate Tribes, and join them as necessary. It is possible to, after creating 1 or more intercommunicators, to join them into one big, happy intracommunicator.

To start the program you only need to start the mother, and pass it one node to start from. MPI_Comm_spawn wakes up the additional processors.
$ mpirun n0 mother

Posted by hursejo at 11:49 PM | Comments (0)

April 08, 2004

cflowd

The installation of cflowd requires the arts and GNU flex libraries. There seems to some problem with arts communicating with flex at the moment. I'll taker a closer look asap.

It seems that cflowd analyzes flow files from network communication developed by Cisco. Is that what we where looking for?

Posted by mccoyjo at 12:49 PM | Comments (1)

April 07, 2004

Capability Discovery

I cleaned up the Capability Discovery code a bit.


  • Passing structs as pointer reference arguments, reserving the return value for a status code.
  • Cleaned up some of the code (removing unused variables, and misrepresented print statements).
  • Ensured approprate 3rd party Licences are at the top of their approprate files.

These peices of code can be found here:
MPI Version
Single Version

Posted by hursejo at 04:35 PM | Comments (1)

April 06, 2004

Future Runs

Here is a table of the Future runs that I would like to run to answer the question:

For a given molecule, what is the optimal Number of Processes, taking into consideration SMP vs Uni-processor machines running both x86 and PPC hardware [Bazaar and Cairo respectivly]?

The file is here:
chart.html

Posted by hursejo at 10:22 AM | Comments (6)

Reading SMP

Here are some notes about reading SMP vs Uniprocessor runs in the Database.

SMP Collection


cpus | nodes | processes | label | molecule | cluster_name | finish_time
------+-------+-----------+---------------------------+----------+--------------+---------------------
2 | 1 | 2 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-02-24 21:52:37
4 | 2 | 4 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-03-02 16:24:51
6 | 3 | 6 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-03-02 16:10:04
8 | 4 | 8 | Gromacs-SMP-Optimal-3.2.0 | villin | bazaar | 2004-03-02 15:58:00
2 | 1 | 2 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-02-24 20:44:34
4 | 2 | 4 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-03-02 20:59:57
6 | 3 | 6 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-03-02 20:51:44
8 | 4 | 8 | Gromacs-SMP-Optimal-3.2.0 | villin | cairo | 2004-03-02 20:45:20

Uni Processor Collection

cpus | nodes | processes | label | molecule | cluster_name | finish_time
------+-------+-----------+-------------------------------------+----------+--------------+---------------------
2 | 2 | 2 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 16:28:06
4 | 4 | 4 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 16:06:36
6 | 6 | 6 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 15:50:56
8 | 8 | 8 | Gromacs-Optimal-Configuration-3.2.0 | villin | bazaar | 2004-01-20 15:37:34
2 | 2 | 2 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:40:55
4 | 4 | 4 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:27:23
6 | 6 | 6 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:18:47
8 | 8 | 8 | Gromacs-Optimal-Configuration-3.2.0 | villin | cairo | 2004-01-17 20:11:57

The difference is in the conbination of nodes, and processes. In the Uniprocessor runs nodes = processes, in SMP (Dual CPU) runs nodes = processes/(cpus per node) or nodes = processes/2.

Note that in these runs cpus = processes, but this may not be so in the furture. This is only true because we only tested by running one process per cpu, but we may find that running more than 1 processes on a cpu is the optimal configuration.

I am going to setup some runs on Cairo and Bazaar to fill out our table.

Posted by hursejo at 07:59 AM | Comments (0)

April 05, 2004

AltiVec error

As near as I can tell the SIGTRAP error is an interaction between MPI and the AltiVec code although that doesn’t sound reasonable on the surface of it. I made a copy of discovery.c san all the MPI calls and it works fine.

I think we will have to trace down exactly where in inl1130() the trap is occurring in order to ultimately fix this. One approach would be to have our signal handler tell us where it was invoked from. Another would be to put some code in to pause stresscpu() long enough for us to attach with gdb before the offending call is made.

Posted by charliep at 11:09 PM | Comments (4)

Capability Discovery

I have finished the mother/nanny Capability Discovery Code. I have an MPI Version and a singleton version (seperate mother and nanny programs).

These are located here:
MPi Version:
http://cluster.earlham.edu/home/joshh/dev/discover/mpi/
Singleton Version:
http://cluster.earlham.edu/home/joshh/dev/discover/single/

I have a version for the x86 SSE enabled Linux systems, PPC Altivec Linux systems, and a generic Linux version. Currently the Altivec MPI Version is broken due to a Trap/Breakpoint error that I am struggling to track down.

in the directories above there is a create.pl script, which will build the approprate collection of source for your specified (on the command line) system. I need to make this into a configure script in the future.

Posted by hursejo at 01:06 PM | Comments (0)