General Cluster Administration
A couple of days ago I replied to JoshM's message about testing the new cairo image with questions about where to find what was installed, etc. I think I was headed down the wrong road and that we should instead focus on making it easy to add/change things in the image, re-image machines, and re-boot machines. More like the release early and often mantra.
From JoshH on Saturday July 24th - Just some numbers from the Oregon Scientific (Left) and WeatherDuck (Right)
Before Reboot:
76.1 -> 79.2
29 -> 49
After Reboot:
74.1 -> 76.8
38 -> 41
I wanted to start a discussion on this so I placed this entry here. Please feel free to comment.
We mentioned this set of software that we might be able to harness with our equipment. Mostly we need something to allow us to reboot a stalled/failed node over the Ethernet connection or some interface that we can access remotely. This way we don't have to travel to campus to reboot a compute node if it stops responding.
Below are two links regarding IPMI and one users experience in setting it up under Debian:
OpenIPMI
Debian HowTo Document
IPMI calls for a specialized linux kernel in many/all cases, but IPMI addresses an Intel standard which allows access to the BIOS even if the system has been shutdown -- as long as the power is flowing through the power supply.
I know that I would find this useful. This is just one solution out there, but it is gaining popularity. There is also a peice of hardware that owuld let us do this as well, if I recall correctly.
Thoughts?
| Distribution | Latest Stable Kernel | Notes | Red Hat 9.0 | 2.4.26 | Dead and and 2.6 kernel upgrade options look sketchy |
| Fedora Core 2 | 2.6 | Rumors of instability persist. |
| Gentoo 2004.1 | 2.6 | Updated often, support for many kernels (hurd, windows compatibility kernel). |
| SUSE 9.1 | 2.6 | Updated often and has an extremely large package library. It is becoming less "free-friendly". |
| Debian 3.0 | 2.2.20 | Can be upgraded to 2.6 kernel, but it certainly is not a Debian branded stable release |
As it looks now, SUSE and Gentoo look the most promising followed by Debian. Red Hat 9.0 seems like a really bad idea. Fedora could be a real wild card. It has good update times and the kernel we need but it is still considered flakey.
Per our conversation on Monday I changed how our domain points so it is more intutive. So now http://cluster.earlham.edu/ will point to the html pages that we have created, and to reach the details click on the Browse Files link. The old link to http://cluster.earlham.edu/html still works just as it has before.
In the process I had to move /cluster/icons to /cluster/html/icons and they are not in CVS. Should they be?
I also noticed that cvsweb was not working so I fixed that as well.
I fixed the links in the HEADER file as well.
So here are some of my notes from imaging the Cairo nodes that were running OSX.
As I was waiting for some GROMACS code to finish compiling I took the liberty to finish imaging the new bazaar nodes (b16-20). They are all running now, waiting eagerly for work.
Since you need a special lam-bhost.conf file to work with these nodes, I have created one that we can all use. Instructions on how to use it are at the top of the file.
http://cluster.earlham.edu/home/joshh/src/lam-mpi/bazaar-annex.conf
cflow is installed on bazaar in /cluster/bazaar/bin/cflow
and on cairo in /cluster/cairo/bin/cflow
The rpm's failed silently, so I had to grab the source. A PPC version seems to be hard to locate, but I am still looking. *Just after posting I found a diff for ppc and installed cflow on cairo*
After looking up usage documents online, it seems that cflow is difficult to run on larger programs due to instability. I have had no successes after 40 minutes of playing with it on gromacs, but the given examples and smaller c programs work fine.
documentation:
http://www.opengroup.org/onlinepubs/007904975/utilities/cflow.html
http://www.freealter.org/doc_distrib/cflow-2.0/#sect6
I was not able to fully upgrade gcc on hopper through the ports package. What it did was to install the binary with a different name so:
$gcc --version
2.95.4
$ gcc33 --version
gcc33 (GCC) 3.3.1 20030707 (prerelease) [FreeBSD]
Is there a way to drop the 2.95.4 release of gcc and move the gcc33 binary name to something more common (e.g. gcc)?