The sounds of silence,

published at 11:07am on 07/26/06, with 4 Comments

For the past several months they have been doing construction on the outside of my building. This is once-in-a-decade style work, and we have been told that it’ll all be worth it and that we won’t have this kind of work for quite some time to come. That is no small comfort when I wake up to the sound of grinders cutting away the brickwork on the facade outside my window.

I would now like to share this experience with you.


powered by ODEO

Filed under: Personal, with 4 Comments

Tools of the Trade,

published at 12:07pm on 07/04/06, with 2 Comments

The New York Times recently ran a photo feature about the tools of the trade of today’s traveling chefs, and the carrying cases they use to transport their collections of knives, spoons and other implements. These are wonderful collections, both in the tools themselves (says Ian Lai, “There is a lot of energy in the tools”) and in the cases that these chefs have found to carry them.

I was thinking that these days, it is becoming increasingly rare for people who have their own tools or, for that matter, their own trade. My friends the traders (to trade is their trade) live on their Bloomberg terminals at their offices (“I have six screens on my desk!”) and this technology is essential for the proper execution of their daily work, but these tools could never carry the same emotional weight as my great grandfather’s cooperage tools hanging above the fireplace at my parents’ house. In half a century, nobody will be hanging a computer terminal over their fireplace and pondering it as a family heirloom. Of course one could argue that nostalgia only goes so far, and the people who once worked in the factories that dotted all of downtown New York felt about the same about the tools of their particular trade as, say, the real estate broker does to her computer terminal.
Read the rest of this entry »

Filed under: Observations, with 2 Comments

Too many choices, too little skin,

published at 11:06pm on 06/29/06, with 8 Comments

I recently fell off my bicycle.

It was a pretty hard fall, and I have nobody to blame but myself. I was coming off the bridge, thinking to myself, “self, you are going pretty fast, you should definitely slow down.” I remember thinking “wow, you’re going to hit that fence.” And I remember thinking “shit” as I overcompensated, the back tire kicked out, and I drove the front of my bike into the asphalt. I have a nice hole in my arm and some pretty nasty road rash.

When I was younger, I would get a Bandaid, stick it on my booboo and be done with it. I probably rubbed some aloe from the aloe plant in the kitchen onto the wound, all the while wondering how this plant was going to help heal me any faster. I was little, and my little body was pretty adept at patching itself up. Not much changed when I was in college, nor since then. In fact, I think that I’ve been using the same mega box of Bandaids since I first stocked up my medicine cabinet in my first post-school apartment. So here I am, with a bloody arm and a scraped up leg, and I realize that a Bandaid just is not going to cut it.

Read the rest of this entry »

Filed under: Observations, with 8 Comments

What do we have here?,

published at 1:05am on 05/09/06, with No Comments

Over two years ago, I’d written to danah about starting a blog, which prompted a discussion about the difference between journals and blogs. Though there is probably some sort of etiquette surrounding the quoting of one’s self, I certainly do not know what it is, so I will now proceed to quote with reckless abandon.

Read the rest of this entry »

Filed under: pith, with No Comments

How I Fixed My Raid-1 Partition Size Error,

published at 12:07am on 07/23/05, with 17 Comments

How did it start?

The first indication that there was something wrong with the server came on June 10, 2005 in the form of error messages that were reported to me by the command that I have running hourly to mail me system anomalies.

	Jul 10 04:16:11 loco kernel: attempt to access beyond end of device
	Jul 10 04:16:11 loco kernel: 09:03: rw=0, want=56050716, limit=56050688

Every hour, at around the same time, these errors started cropping up. I looked through all the crontabs and found one command, a bounced mail queue processor that I run for one of my projects that was running at that time. Turning off the process stopped the errors from coming up, and I thought that perhaps we just had a couple of corrupted files. The next morning, the errors started cropping up again, one or two at a time.

Realizing that this could be a sign that the drives were eating themselves, I decided to head to the data center for a bit of one-on-one time with the server.

So what did we do?

The first thing I did was drop the system into single-user mode. We’re running ext3 filesystems on software RAID-1 on two 73gb SCSI drives. I decided that I would try e2fsck on the partition that was giving me problems, but I kept running into the following error:

	The filesystem size (according to the superblock) is xxx 
	The physical size of the device is xxx
	Either the superblock or the partition table is likely to be corrupt!

Ok, so that’s a bit puzzling, and I spent a bit more time puzzling over this, and finding absolutely nothing in Google that would give any indication of what might have been going on, until I found the following gem in an article about converting a running system into a RAID-1 system:

Step-11 – resize filesystem
When we created the raid device, the physical partion became slightly smaller because a second superblock is stored at the end of the partition. If you reboot the system now, the reboot will fail with an error indicating the superblock is corrupt.

https://raid.wiki.kernel.org/index.php/Tweaking,_tuning_and_troubleshooting#Step-11_-_resize_filesystem
http://howtos.linux.com/howtos/Software-RAID-HOWTO-7.shtml#ss7.6 (source has moved)

Eureka!

It appears that when we originally set up the RAID, we never resized the partitions. For the past year or so, the system has been running along without any problems because it just never wrote to that part of the disk. A couple of files must have made it out to this portion of the disk where the RAID superblock is stored, and the RAID system wouldn’t let it write and was throwing the errors that I saw. However, resizing the partitions without repairing them first will throw the following error:

	attempt to read block from filesystem resulted in short read while trying to resize

Obviously there was a problem with the drive that needed to be addressed.

Fixing the problem

The solution was actually quite straight forward, once I got all the steps in place. There were two time-consuming parts to this process. First, I had to figure out what was wrong. And second, I needed to wait to repair the drive. In the process of trying to write out beyond the RAID partition, some inconsistencies were introduced to the drive. e2fsck was the way to fix this. The solution is as follows:

	1. Unmount all partitions
	2. Repair the partitions
	3. Resize the partitions

Unmounting the partitions in single-user mode is a matter of running:

	umount -a

I’m not really sure how this works, but it doesn’t matter what services are running or what happens to be in use – it just unmounts everything for you.

Once the partitions were unmounted, it was a simple matter of telling e2fsck to check for bad blocks when run on the offending partition. man e2fsck tells us the following:

       -c     This  option  causes  e2fsck to run the badblocks(8) program to
              find any blocks which are bad on the filesystem, and then marks
              them  as  bad  by  adding them to the bad block inode.  If this
              option is specified twice, then the bad block scan will be done
              using a non-destructive read-write test.

By running e2fsck -cc /dev/md3 we were able to do the repairs non-destructively. However, as expected, on our 53 gig /home partition, this badblocks scan took about 7 hours to run. The good news is that in that time, it did find errors, it did seem to fix them, and running e2fsck following that run seemed to return no other errors.

After the partitions were repaired, I ran resize2fs following the instructions in the above article. I first ran e2fsck again (but not in badblocks mode), just to make sure everything was clean, then I resized the partition and then I ran e2fsck again.

	e2fsck -f /dev/md3
	resize2fs /dev/md3
	e2fsck -f /dev/md3

This worked like a charm, and I did not get the “short read” error from earlier. I was not able to unmount the root partition, however, since the running system needed access to it, and I was not able to mount it read-only as was suggested in the article.

How to resize the root partition

Resizing the root partition turned out to be less of a pain than I might have expected, though it was by no means obvious when first thinking about the problem. The solution would be as follows:

	1. Copy all files from the root partition to another, empty partition (/tmp worked nicely)
	2. Reboot the server passing in the new, fake root partition to the boot loader
	3. Unmount all partitions (including the real root partition, which is not running)
	4. Repair and resize as above

Fortunately, /tmp had its own partition. I deleted the contents out of /tmp (which should be temporary anyway) and copied all of the files out of the root partition into this new, temporary root. Remember that you can copy /dev files, but should avoid /proc. The idea here is to copy all of the files out of /, excluding anything that is mounted from another partition. [Looking at the man page again, after the fact, -x would probably be exactly what’s needed here. -jcn]

	1. cp -ax / /tmp
	(can't actually remember the cp command, but this should work)
	2. Edit /tmp/etc/fstab to not mount the partition that /tmp resides on

Once that is done, it is simply a matter of rebooting. At the LILO prompt, tell the existing kernel to use the new partition (which is normally /tmp) as the root partition.

	LILO: kernel root=/dev/sd5 single

Once booted, I ran unmount -a and proceeded as above.

Done!

This seems to have worked. resize2fs is, in fact, non-destructive and now when I run e2fsck, it just runs – it does not give me the error about a mismatched physical vs. filesystem partition.

Followup

Did this document help you? If so, I’d love if you would let me know, and let me know if there is anything I left out or was confusing. Thanks!

Filed under: Technology, with 17 Comments