published at 12:07am on 07/23/05
How did it start?
The first indication that there was something wrong with the server came on June 10, 2005 in the form of error messages that were reported to me by the command that I have running hourly to mail me system anomalies.
Jul 10 04:16:11 loco kernel: attempt to access beyond end of device Jul 10 04:16:11 loco kernel: 09:03: rw=0, want=56050716, limit=56050688
Every hour, at around the same time, these errors started cropping up. I looked through all the crontabs and found one command, a bounced mail queue processor that I run for one of my projects that was running at that time. Turning off the process stopped the errors from coming up, and I thought that perhaps we just had a couple of corrupted files. The next morning, the errors started cropping up again, one or two at a time.
Realizing that this could be a sign that the drives were eating themselves, I decided to head to the data center for a bit of one-on-one time with the server.
So what did we do?
The first thing I did was drop the system into single-user mode. We’re running ext3 filesystems on software RAID-1 on two 73gb SCSI drives. I decided that I would try e2fsck on the partition that was giving me problems, but I kept running into the following error:
The filesystem size (according to the superblock) is xxx The physical size of the device is xxx Either the superblock or the partition table is likely to be corrupt!
Ok, so that’s a bit puzzling, and I spent a bit more time puzzling over this, and finding absolutely nothing in Google that would give any indication of what might have been going on, until I found the following gem in an article about converting a running system into a RAID-1 system:
Step-11 – resize filesystem
When we created the raid device, the physical partion became slightly smaller because a second superblock is stored at the end of the partition. If you reboot the system now, the reboot will fail with an error indicating the superblock is corrupt.
http://howtos.linux.com/howtos/Software-RAID-HOWTO-7.shtml#ss7.6(source has moved)
It appears that when we originally set up the RAID, we never resized the partitions. For the past year or so, the system has been running along without any problems because it just never wrote to that part of the disk. A couple of files must have made it out to this portion of the disk where the RAID superblock is stored, and the RAID system wouldn’t let it write and was throwing the errors that I saw. However, resizing the partitions without repairing them first will throw the following error:
attempt to read block from filesystem resulted in short read while trying to resize
Obviously there was a problem with the drive that needed to be addressed.
Fixing the problem
The solution was actually quite straight forward, once I got all the steps in place. There were two time-consuming parts to this process. First, I had to figure out what was wrong. And second, I needed to wait to repair the drive. In the process of trying to write out beyond the RAID partition, some inconsistencies were introduced to the drive. e2fsck was the way to fix this. The solution is as follows:
1. Unmount all partitions 2. Repair the partitions 3. Resize the partitions
Unmounting the partitions in single-user mode is a matter of running:
I’m not really sure how this works, but it doesn’t matter what services are running or what happens to be in use – it just unmounts everything for you.
Once the partitions were unmounted, it was a simple matter of telling e2fsck to check for bad blocks when run on the offending partition. man e2fsck tells us the following:
-c This option causes e2fsck to run the badblocks(8) program to find any blocks which are bad on the filesystem, and then marks them as bad by adding them to the bad block inode. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.
By running e2fsck -cc /dev/md3 we were able to do the repairs non-destructively. However, as expected, on our 53 gig /home partition, this badblocks scan took about 7 hours to run. The good news is that in that time, it did find errors, it did seem to fix them, and running e2fsck following that run seemed to return no other errors.
After the partitions were repaired, I ran resize2fs following the instructions in the above article. I first ran e2fsck again (but not in badblocks mode), just to make sure everything was clean, then I resized the partition and then I ran e2fsck again.
e2fsck -f /dev/md3 resize2fs /dev/md3 e2fsck -f /dev/md3
This worked like a charm, and I did not get the “short read” error from earlier. I was not able to unmount the root partition, however, since the running system needed access to it, and I was not able to mount it read-only as was suggested in the article.
How to resize the root partition
Resizing the root partition turned out to be less of a pain than I might have expected, though it was by no means obvious when first thinking about the problem. The solution would be as follows:
1. Copy all files from the root partition to another, empty partition (/tmp worked nicely) 2. Reboot the server passing in the new, fake root partition to the boot loader 3. Unmount all partitions (including the real root partition, which is not running) 4. Repair and resize as above
Fortunately, /tmp had its own partition. I deleted the contents out of /tmp (which should be temporary anyway) and copied all of the files out of the root partition into this new, temporary root. Remember that you can copy /dev files, but should avoid /proc. The idea here is to copy all of the files out of /, excluding anything that is mounted from another partition. [Looking at the man page again, after the fact, -x would probably be exactly what's needed here. -jcn]
1. cp -ax / /tmp (can't actually remember the cp command, but this should work) 2. Edit /tmp/etc/fstab to not mount the partition that /tmp resides on
Once that is done, it is simply a matter of rebooting. At the LILO prompt, tell the existing kernel to use the new partition (which is normally /tmp) as the root partition.
LILO: kernel root=/dev/sd5 single
Once booted, I ran unmount -a and proceeded as above.
This seems to have worked. resize2fs is, in fact, non-destructive and now when I run e2fsck, it just runs – it does not give me the error about a mismatched physical vs. filesystem partition.
Did this document help you? If so, I’d love if you would let me know, and let me know if there is anything I left out or was confusing. Thanks!
Filed under: Technology