| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 |
Archives
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
Solaris patches
I’m getting bitten by Solaris patches. In particular, several recent patches have decided to muck with /etc/rc3.d scripts in the postpatch script. I remove a number of these scripts at Jumpstart time so that the systems they start won’t be started (things like automounter and volume management, which mostly serve to bite one in the hindquarters). Sun adding them back in messes with my systems. They have every right to modify the main file in /etc/init.d, but they should keep their hands off of my territory in /etc/rc?.d.
So I copied out the portion of my Jumpstart script that disables these, and I’ve made it a standalone script that can be run after any patches are applied. Just run it. Good maintenance, and swat those pesky Sun idjits.
EYEWI tape drive failure
The standalone DLT drive on EYEWI seems to have lost its mind. I’ve replaced it with the DLT1 that used to be on MIR, and it seems happy now. I had to reconfigure NetBackup a bit to make it not complain about the drives, but the snapshot dump of the catalog filesystem works fine (although there seems to be something wacky with /tmp that I need to look into).
RAHU disk failure
Apparently the first system disk on RAHU died at some point. Since we’re mirroring, we’re still ok, but the VxFS snapshot that was using it has been failing. Sun is sending out a replacement drive.
ROJ crash
ROJ crashed last night at about 21:20 — apparently a memory error of some sort. It saved a crash dump and rebooted, and SHANTI recorded full system logs during the event. I’ve opened a case with Sun support and sent logs and crash dump on to them for analysis. Meanwhile, ROJ seems to be running well again.
ROJ RAID battery
The batteries on the A1000s are nearing the end of their life again. Some notes on what to look for.
- Hardware error 3FD9 is the warning that the battery is getting old but not yet expired. Look for this in the GUI.
- /usr/sbin/osa is the home for rm6, raidutil, and other goodies. rm6 is the GUI. raidutil can show battery age, reset battery age, and enable/disable write caching.
LDAP restart
I applied the Microsoft solution to an ailing LDAP situation. Rebooted one box and restarted Sun ONE on the other.
We’d been having spurts of failed logins, with no indication in the LDAP logs of what was wrong, since we could simply believe that all of them were bad passwords (some of these were cached passwords in SquirrelMail sessions, frex). Apparently something started to rot in at least one of the LDAP servers, and restarting the apps cleared things up.
Spam engine
Never a dull. I was wrangling with the PacketShaper all day, trying to get the student network to behave in any kind of reasonable fashion, and now I find that part of my Internet bandwidth problems were due to a veritable flood of spam that managed to be sent from a library computer through our e-mail gateway.
A few tens of thousands of spam messages were sent from LLYA019 through TAIKA, starting around 3:00 this afternoon. I don’t know how the spam malware gets the IP address of a good gateway to send its wares through, but I have two good guesses (and in case any nasties are reading this, I’m not going to elaborate — full disclosure of my thoughts, this ain’t). I didn’t discover this until just recently. I managed to clean up all the messages from the box still queued up on TAIKA (plenty, since I’d throttled outgoing SMTP from campus), and now the shaper reports a reasonable bandwidth utilization for that traffic class. I also blocked the infected box at both the shaper and the firewall, and I’ve seen no further mal-traffic from it (I also gave it a bogus IP in NetReg, with the hopes of kicking it off the net the next time it tries to renew its lease).
Tomorrow I need to stop mucking with the shaper and the firewall and get a handle on some other tasks in desperate need of my attention; perhaps Friday or next week I can get back to this and see if my tunings are having any kind of positive effect (without killing the rest of campus in their wake…). I can’t say I enjoy being on the student network, but it is instructive to feel their pain.
More LDAP Indices
Last night I rebuilt the LDAP indices on ASHTI after adding an index for uidNumber.
Everything went fine, and the index rebuilt within about 7 minutes. Total downtime was maybe 20 minutes, what with turning off services and restarting them.
- Turn off LDAP synchronization cron jobs on BARIS, KE, and SHANTI.
- Turn off Sendmail on BARIS and TAIKA (they’re LDAP-capable, and I’d prefer no Sendmail to LDAP connection errors during the downtime).
- Upload new index LDIF.
- Block LDAP access on ASHTI to ports 389 and 636 in the packet filter.
- Run the indexing command.
Then reverse as appropriate to come back out of it.
Storm control
Another storm passed through last night. Not much this time, other than a power flicker.
But that took out the servers outside of the machine room.
- The Spectra tape library doesn’t seem to poweron automatically after a power failure. Also seems I don’t recall the access codes for the front panel.
- Changed a couple of Solaris boxes to use UFS logging with the hopes that they’ll be a little more graceful about power flicker resets.
Machine room A/C
Went out Saturday around 5. The paging network has been flakey, so I didn’t get in until midnight. No immediate damage, but I’m expecting a few disks to die in the next while.
Index note in LDAP log file
From the Sun blueprint book: to determine whether a search in the LDAP directory server was not answered by an index, look for notes=U in the RESULT section. Then look for the corresponding SRCH and find out what it was searching on; create an index for that.
ASHTI belly up
I tried replacing the temporary disk on ASHTI this morning with one of the old 18 G drives from KE. I think in the end it’ll work fine, but in the short term I toasted some filesystems and essentially killed ASHTI.
I managed to install the Sun ONE directory server on SITH and copy over all the data properly, so we at least have LDAP service restored (updating the dns records for directory.earlham.edu). SITH is now part of the regular NetBackup rotation again.
When I return I’ll salvage anything else off of ASHTI that I need and then jumpstart it. I’m going to try making it and SITH into a dual-master LDAP cluster so that hardware failures don’t take us out quite this way again.
EYEWI disk magic
EYEWI’s /home partition was full again this weekend, causing backups to fail. I eeked out another 12 GB of spare space by doing some partition shuffling and moved NetBackup to the new partition.
When I installed the two extra drives for the RAID 5 data partition, I had to use partitions the same size as the two partitions on the first two drives. That meant I had several slices left over, as well as about 12 GB on each drive. I rearranged the slices and made slice 0 a 12 GB slice on each of those drives. I then mirrored them and formatted them. Then I copied the NetBackup installation and catalog backup area onto the new partition and let things go. This should hold NetBackup for a while, although we may want to replace those drives with 72 or 143 GB drives at some point (or get an actual hardware RAID LUN for the data).
ASHTI recovered
Looks like /usr/local was indeed toast, but I recreated the filesystem, reinstalled the packages, and restarted some services, and it seems better now.
Forcibly umounting /usr/local, fscking the partition, and remounting still generated I/O errors on remount, so I simply newfs’d the partition. I generated a list of SMC and ECS packages installed into /usr/local, and reinstalled them from ROJ (and three from my home directory on ASHTI — Net-SNMP and its dependancies of OpenSSL and libgcc).
I had to recreate the Net-SNMP config file, which also meant changing the path to perl in /usr/local/bin/snmpconf. Whatever.
I restarted Net-SNMP using the script in /etc/rc3.d, and it was fine. I restarted /etc/rc2.d/S72inetsvc to restart inetd, and NetBackup is now fine.
I’m probably going to reboot the whole machine at some point just to make sure things are peachy, but for now it’s running well.
Now we just need to see if we can scare up an 18GB replacement hard drive. Sun doesn’t seem to want to sell me one — wouldn’t I much rather have 3g or 73 GB?
ASHTI disk failure
You’d think we had an air conditioning failure recently the way disks are melting down recently. ASHTI’s first disk died and I had to scramble to get a replacement in.
ASHTI was still running ok, but swap was unhappy and thus not allowing logins.
I took one of the hot spare 18G disks out of PACO’s A1000 unit, stuck it in SITH for partitioning, and then swapped it for the dead drive in ASHTI. I had the partitions slightly wrong, so I had to repartition (I’d swapped the 0 and 1 partitions), but it at least was running well enough to let me log in and do that.
Right now, the mirrors are still rebuilding. /usr/local has some errors and complains that it needs to be fsck’d. This may be true, or it may be fallout from the incorrect initial partition. I’ll wait until we’re done rebuilding, and then see if the errors are still there. If so, I’ll have to see about recovery there.
In the meantime, I’ve fired off a request for a new 18G drive.
A1000 disk failure
ROJ’s A1000 had a disk failure this morning.
It failed over to the hot spare, and I have a case open with Sun to get me a replacement part.
Disk failure on Xserve RAID
Disk 8 (first disk on right side) failed at 5:58 this morning.
It’s currently rebuilding onto the hot spare disk 14 and should be done in an hour or so. I replaced the failed disk with the spare from the parts kit and will call Apple for a replacement spare.
I hope this infernal beep turns off when the array is finished rebuilding.
More comment spam attacks
MovableType seems to suffer when it gets a comment spam attack.
I don’t know if they’re designed to be DoS attacks or just nasty spam attacks, but when they happen they spike the load average on Heiwa and bring it to its knees. Fortunately today I was already logged in there and had a pretty good idea of what was happening before I could confirm it.
Steps to fix: block access to the web server at the shaper (to get the load back under control), grab the Apache log file and identify the comment spam offending IP address(es) (look for “mt-comment” in the request log), block the offending address(es) at the shaper, and then unblock the web server.
Trackback spam
Cleaned out a few trackback spams.
Apparently I’d left pings turned on for a few posts. Silly me. I usually mean to disable comments and trackbacks in here, since I’m really just talking to myself.
So this post is partly to see if deleting those trackbacks will make ecto happy. It can’t seem to update its list of recent posts, nor figure out that it actually posted the one yesterday.
PACO and a slightly unhappy disk
PACO seems to have a slightly unhappy disk, but probably nothing to worry about yet.
Got this in the error log today:
Feb 28 13:44:13Feb 28 13:44:13
Feb 28 13:44:13
Feb 28 13:44:13
Feb 28 13:44:13
As far as I can tell, it’s just found a new bad block. Keep an eye on it, but don’t call Sun for replacement just yet.
LDAP Indexing
Yesterday I rebuilt the LDAP indices.
It takes about 30 to 45 minutes for a full rebuild, assuming slapd doesn’t crash in the middle. Be sure to turn off the heavy hitters for LDAP: Samba and RADIUS (small amounts of LDAP traffic is ok).
I’m not sure it really solved the problem, though. And I’m not sure what the problem was, other than ASHTI showing a load average of around 1 for long periods of time and Samba complaining about not being able to connect to LDAP a lot. It may be time to upsize the LDAP server.
PacketShaper Replaced
The replacement for our dead PacketShaper arrived today, and with Aaron and Kevan’s help, I got it online.
Aaron and Kevan got it rackmounted and connected to the appropriate cables. At that point I ran through the guided basic setup to give it its network identity. With that, I was able to connect over FTP, transfer the config.ldi file, and then load the current configuration as the previous shaper had it. All in all, quite straightforward. The new unit has 512MB of RAM, which bodes well for NetFlow reports.
PacketShaper Replacement
While upgrading our main shaper from OS 6.0.1 to 7.0.1, we found a dead hard drive.
After rebooting, the unit dropped into debugger and completely failed to respond to anything. Packeteer support says it’s a dead hard drive, and they’ll be shipping us a replacement unit. The side benefit of this is that we may get a unit that has (or can be upgraded) enough memory to run adaptive response and/or NetFlow reports.
The student shaper got upgraded to 7.0.1 without incident.

