Worklog | Work, notes, and projects. Thursday, September 1, 2005
 
March 14, 2006

Solaris patches

I’m getting bitten by Solaris patches. In particular, several recent patches have decided to muck with /etc/rc3.d scripts in the postpatch script. I remove a number of these scripts at Jumpstart time so that the systems they start won’t be started (things like automounter and volume management, which mostly serve to bite one in the hindquarters). Sun adding them back in messes with my systems. They have every right to modify the main file in /etc/init.d, but they should keep their hands off of my territory in /etc/rc?.d.

So I copied out the portion of my Jumpstart script that disables these, and I’ve made it a standalone script that can be run after any patches are applied. Just run it. Good maintenance, and swat those pesky Sun idjits.

Posted by Rowan Littell at 03:51 PM
November 16, 2005

EYEWI tape drive failure

The standalone DLT drive on EYEWI seems to have lost its mind. I’ve replaced it with the DLT1 that used to be on MIR, and it seems happy now. I had to reconfigure NetBackup a bit to make it not complain about the drives, but the snapshot dump of the catalog filesystem works fine (although there seems to be something wacky with /tmp that I need to look into).

Posted by Rowan Littell at 12:04 PM

RAHU disk failure

Apparently the first system disk on RAHU died at some point. Since we’re mirroring, we’re still ok, but the VxFS snapshot that was using it has been failing. Sun is sending out a replacement drive.

Posted by Rowan Littell at 12:02 PM
November 10, 2005

ROJ crash

ROJ crashed last night at about 21:20 — apparently a memory error of some sort. It saved a crash dump and rebooted, and SHANTI recorded full system logs during the event. I’ve opened a case with Sun support and sent logs and crash dump on to them for analysis. Meanwhile, ROJ seems to be running well again.

Posted by Rowan Littell at 09:18 AM
October 20, 2005

ROJ RAID battery

The batteries on the A1000s are nearing the end of their life again. Some notes on what to look for.

  • Hardware error 3FD9 is the warning that the battery is getting old but not yet expired. Look for this in the GUI.
  • /usr/sbin/osa is the home for rm6, raidutil, and other goodies. rm6 is the GUI. raidutil can show battery age, reset battery age, and enable/disable write caching.
Posted by Rowan Littell at 09:36 AM
October 06, 2005

LDAP restart

I applied the Microsoft solution to an ailing LDAP situation. Rebooted one box and restarted Sun ONE on the other.

We’d been having spurts of failed logins, with no indication in the LDAP logs of what was wrong, since we could simply believe that all of them were bad passwords (some of these were cached passwords in SquirrelMail sessions, frex). Apparently something started to rot in at least one of the LDAP servers, and restarting the apps cleared things up.

Posted by Rowan Littell at 01:51 PM
September 14, 2005

Spam engine

Never a dull. I was wrangling with the PacketShaper all day, trying to get the student network to behave in any kind of reasonable fashion, and now I find that part of my Internet bandwidth problems were due to a veritable flood of spam that managed to be sent from a library computer through our e-mail gateway.

A few tens of thousands of spam messages were sent from LLYA019 through TAIKA, starting around 3:00 this afternoon. I don’t know how the spam malware gets the IP address of a good gateway to send its wares through, but I have two good guesses (and in case any nasties are reading this, I’m not going to elaborate — full disclosure of my thoughts, this ain’t). I didn’t discover this until just recently. I managed to clean up all the messages from the box still queued up on TAIKA (plenty, since I’d throttled outgoing SMTP from campus), and now the shaper reports a reasonable bandwidth utilization for that traffic class. I also blocked the infected box at both the shaper and the firewall, and I’ve seen no further mal-traffic from it (I also gave it a bogus IP in NetReg, with the hopes of kicking it off the net the next time it tries to renew its lease).

Tomorrow I need to stop mucking with the shaper and the firewall and get a handle on some other tasks in desperate need of my attention; perhaps Friday or next week I can get back to this and see if my tunings are having any kind of positive effect (without killing the rest of campus in their wake…). I can’t say I enjoy being on the student network, but it is instructive to feel their pain.

Posted by Rowan Littell at 09:32 PM
August 25, 2005

More LDAP Indices

Last night I rebuilt the LDAP indices on ASHTI after adding an index for uidNumber.

Everything went fine, and the index rebuilt within about 7 minutes. Total downtime was maybe 20 minutes, what with turning off services and restarting them.

  • Turn off LDAP synchronization cron jobs on BARIS, KE, and SHANTI.
  • Turn off Sendmail on BARIS and TAIKA (they’re LDAP-capable, and I’d prefer no Sendmail to LDAP connection errors during the downtime).
  • Upload new index LDIF.
  • Block LDAP access on ASHTI to ports 389 and 636 in the packet filter.
  • Run the indexing command.

Then reverse as appropriate to come back out of it.

Posted by Rowan Littell at 09:49 AM
June 28, 2005

Storm control

Another storm passed through last night. Not much this time, other than a power flicker.

But that took out the servers outside of the machine room.

  • The Spectra tape library doesn’t seem to poweron automatically after a power failure. Also seems I don’t recall the access codes for the front panel.
  • Changed a couple of Solaris boxes to use UFS logging with the hopes that they’ll be a little more graceful about power flicker resets.
Posted by Rowan Littell at 08:55 AM
June 27, 2005

Machine room A/C

Went out Saturday around 5. The paging network has been flakey, so I didn’t get in until midnight. No immediate damage, but I’m expecting a few disks to die in the next while.

Posted by Rowan Littell at 10:12 AM
June 09, 2005

Index note in LDAP log file

From the Sun blueprint book: to determine whether a search in the LDAP directory server was not answered by an index, look for notes=U in the RESULT section. Then look for the corresponding SRCH and find out what it was searching on; create an index for that.

Posted by Rowan Littell at 08:39 PM
May 25, 2005

ASHTI belly up

I tried replacing the temporary disk on ASHTI this morning with one of the old 18 G drives from KE. I think in the end it’ll work fine, but in the short term I toasted some filesystems and essentially killed ASHTI.

I managed to install the Sun ONE directory server on SITH and copy over all the data properly, so we at least have LDAP service restored (updating the dns records for directory.earlham.edu). SITH is now part of the regular NetBackup rotation again.

When I return I’ll salvage anything else off of ASHTI that I need and then jumpstart it. I’m going to try making it and SITH into a dual-master LDAP cluster so that hardware failures don’t take us out quite this way again.

Posted by Rowan Littell at 03:59 PM
May 16, 2005

EYEWI disk magic

EYEWI’s /home partition was full again this weekend, causing backups to fail. I eeked out another 12 GB of spare space by doing some partition shuffling and moved NetBackup to the new partition.

When I installed the two extra drives for the RAID 5 data partition, I had to use partitions the same size as the two partitions on the first two drives. That meant I had several slices left over, as well as about 12 GB on each drive. I rearranged the slices and made slice 0 a 12 GB slice on each of those drives. I then mirrored them and formatted them. Then I copied the NetBackup installation and catalog backup area onto the new partition and let things go. This should hold NetBackup for a while, although we may want to replace those drives with 72 or 143 GB drives at some point (or get an actual hardware RAID LUN for the data).

Posted by Rowan Littell at 01:38 PM
May 11, 2005

ASHTI recovered

Looks like /usr/local was indeed toast, but I recreated the filesystem, reinstalled the packages, and restarted some services, and it seems better now.

Forcibly umounting /usr/local, fscking the partition, and remounting still generated I/O errors on remount, so I simply newfs’d the partition. I generated a list of SMC and ECS packages installed into /usr/local, and reinstalled them from ROJ (and three from my home directory on ASHTI — Net-SNMP and its dependancies of OpenSSL and libgcc).

I had to recreate the Net-SNMP config file, which also meant changing the path to perl in /usr/local/bin/snmpconf. Whatever.

I restarted Net-SNMP using the script in /etc/rc3.d, and it was fine. I restarted /etc/rc2.d/S72inetsvc to restart inetd, and NetBackup is now fine.

I’m probably going to reboot the whole machine at some point just to make sure things are peachy, but for now it’s running well.

Now we just need to see if we can scare up an 18GB replacement hard drive. Sun doesn’t seem to want to sell me one — wouldn’t I much rather have 3g or 73 GB?

Posted by Rowan Littell at 10:06 AM
May 10, 2005

ASHTI disk failure

You’d think we had an air conditioning failure recently the way disks are melting down recently. ASHTI’s first disk died and I had to scramble to get a replacement in.

ASHTI was still running ok, but swap was unhappy and thus not allowing logins.

I took one of the hot spare 18G disks out of PACO’s A1000 unit, stuck it in SITH for partitioning, and then swapped it for the dead drive in ASHTI. I had the partitions slightly wrong, so I had to repartition (I’d swapped the 0 and 1 partitions), but it at least was running well enough to let me log in and do that.

Right now, the mirrors are still rebuilding. /usr/local has some errors and complains that it needs to be fsck’d. This may be true, or it may be fallout from the incorrect initial partition. I’ll wait until we’re done rebuilding, and then see if the errors are still there. If so, I’ll have to see about recovery there.

In the meantime, I’ve fired off a request for a new 18G drive.

Posted by Rowan Littell at 10:31 AM
May 02, 2005

A1000 disk failure

ROJ’s A1000 had a disk failure this morning.

It failed over to the hot spare, and I have a case open with Sun to get me a replacement part.

Posted by Rowan Littell at 04:55 PM
April 20, 2005

Disk failure on Xserve RAID

Disk 8 (first disk on right side) failed at 5:58 this morning.

It’s currently rebuilding onto the hot spare disk 14 and should be done in an hour or so. I replaced the failed disk with the spare from the parts kit and will call Apple for a replacement spare.

I hope this infernal beep turns off when the array is finished rebuilding.

Posted by Rowan Littell at 07:46 AM
March 31, 2005

More comment spam attacks

MovableType seems to suffer when it gets a comment spam attack.

I don’t know if they’re designed to be DoS attacks or just nasty spam attacks, but when they happen they spike the load average on Heiwa and bring it to its knees. Fortunately today I was already logged in there and had a pretty good idea of what was happening before I could confirm it.

Steps to fix: block access to the web server at the shaper (to get the load back under control), grab the Apache log file and identify the comment spam offending IP address(es) (look for “mt-comment” in the request log), block the offending address(es) at the shaper, and then unblock the web server.

Posted by Rowan Littell at 07:56 PM
March 08, 2005

Trackback spam

Cleaned out a few trackback spams.

Apparently I’d left pings turned on for a few posts. Silly me. I usually mean to disable comments and trackbacks in here, since I’m really just talking to myself.

So this post is partly to see if deleting those trackbacks will make ecto happy. It can’t seem to update its list of recent posts, nor figure out that it actually posted the one yesterday.

Posted by Rowan Littell at 08:01 AM
February 28, 2005

PACO and a slightly unhappy disk

PACO seems to have a slightly unhappy disk, but probably nothing to worry about yet.

Got this in the error log today:

Feb 28 13:44:13 paco scsi: [ID 107833 kern.warning] WARNING: /pci@1f,4000/scsi@3/sd@a,0 (sd9):
Feb 28 13:44:13 paco scsi: [ID 107833 kern.notice] Requested Block: 2889 Error Block: 2889
Feb 28 13:44:13 paco scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0416A7KZH6
Feb 28 13:44:13 paco scsi: [ID 107833 kern.notice] Sense Key: Hardware Error
Feb 28 13:44:13 paco scsi: [ID 107833 kern.notice] ASC: 0x19 (defect list error), ASCQ: 0x0, FRU: 0x2

As far as I can tell, it’s just found a new bad block. Keep an eye on it, but don’t call Sun for replacement just yet.

Posted by Rowan Littell at 03:31 PM
February 25, 2005

LDAP Indexing

Yesterday I rebuilt the LDAP indices.

It takes about 30 to 45 minutes for a full rebuild, assuming slapd doesn’t crash in the middle. Be sure to turn off the heavy hitters for LDAP: Samba and RADIUS (small amounts of LDAP traffic is ok).

I’m not sure it really solved the problem, though. And I’m not sure what the problem was, other than ASHTI showing a load average of around 1 for long periods of time and Samba complaining about not being able to connect to LDAP a lot. It may be time to upsize the LDAP server.

Posted by Rowan Littell at 11:52 AM
February 02, 2005

PacketShaper Replaced

The replacement for our dead PacketShaper arrived today, and with Aaron and Kevan’s help, I got it online.

Aaron and Kevan got it rackmounted and connected to the appropriate cables. At that point I ran through the guided basic setup to give it its network identity. With that, I was able to connect over FTP, transfer the config.ldi file, and then load the current configuration as the previous shaper had it. All in all, quite straightforward. The new unit has 512MB of RAM, which bodes well for NetFlow reports.

Posted by Rowan Littell at 03:48 PM
January 28, 2005

PacketShaper Replacement

While upgrading our main shaper from OS 6.0.1 to 7.0.1, we found a dead hard drive.

After rebooting, the unit dropped into debugger and completely failed to respond to anything. Packeteer support says it’s a dead hard drive, and they’ll be shipping us a replacement unit. The side benefit of this is that we may get a unit that has (or can be upgraded) enough memory to run adaptive response and/or NetFlow reports.

The student shaper got upgraded to 7.0.1 without incident.

Posted by Rowan Littell at 10:37 AM