How I Fixed My Raid-1 Partition Size Error

How I Fixed My Raid-1 Partition Size Error,

published at 12:07am on 07/23/05

How did it start?

The first indication that there was something wrong with the server came on June 10, 2005 in the form of error messages that were reported to me by the command that I have running hourly to mail me system anomalies.

	Jul 10 04:16:11 loco kernel: attempt to access beyond end of device
	Jul 10 04:16:11 loco kernel: 09:03: rw=0, want=56050716, limit=56050688

Every hour, at around the same time, these errors started cropping up. I looked through all the crontabs and found one command, a bounced mail queue processor that I run for one of my projects that was running at that time. Turning off the process stopped the errors from coming up, and I thought that perhaps we just had a couple of corrupted files. The next morning, the errors started cropping up again, one or two at a time.

Realizing that this could be a sign that the drives were eating themselves, I decided to head to the data center for a bit of one-on-one time with the server.

So what did we do?

The first thing I did was drop the system into single-user mode. We’re running ext3 filesystems on software RAID-1 on two 73gb SCSI drives. I decided that I would try e2fsck on the partition that was giving me problems, but I kept running into the following error:

	The filesystem size (according to the superblock) is xxx 
	The physical size of the device is xxx
	Either the superblock or the partition table is likely to be corrupt!

Ok, so that’s a bit puzzling, and I spent a bit more time puzzling over this, and finding absolutely nothing in Google that would give any indication of what might have been going on, until I found the following gem in an article about converting a running system into a RAID-1 system:

Step-11 – resize filesystem
When we created the raid device, the physical partion became slightly smaller because a second superblock is stored at the end of the partition. If you reboot the system now, the reboot will fail with an error indicating the superblock is corrupt.

https://raid.wiki.kernel.org/index.php/Tweaking,_tuning_and_troubleshooting#Step-11_-_resize_filesystem
~~http://howtos.linux.com/howtos/Software-RAID-HOWTO-7.shtml#ss7.6~~ (source has moved)

Eureka!

It appears that when we originally set up the RAID, we never resized the partitions. For the past year or so, the system has been running along without any problems because it just never wrote to that part of the disk. A couple of files must have made it out to this portion of the disk where the RAID superblock is stored, and the RAID system wouldn’t let it write and was throwing the errors that I saw. However, resizing the partitions without repairing them first will throw the following error:

	attempt to read block from filesystem resulted in short read while trying to resize

Obviously there was a problem with the drive that needed to be addressed.

Fixing the problem

The solution was actually quite straight forward, once I got all the steps in place. There were two time-consuming parts to this process. First, I had to figure out what was wrong. And second, I needed to wait to repair the drive. In the process of trying to write out beyond the RAID partition, some inconsistencies were introduced to the drive. e2fsck was the way to fix this. The solution is as follows:

	1. Unmount all partitions
	2. Repair the partitions
	3. Resize the partitions

Unmounting the partitions in single-user mode is a matter of running:

	umount -a

I’m not really sure how this works, but it doesn’t matter what services are running or what happens to be in use – it just unmounts everything for you.

Once the partitions were unmounted, it was a simple matter of telling e2fsck to check for bad blocks when run on the offending partition. man e2fsck tells us the following:

       -c     This  option  causes  e2fsck to run the badblocks(8) program to
              find any blocks which are bad on the filesystem, and then marks
              them  as  bad  by  adding them to the bad block inode.  If this
              option is specified twice, then the bad block scan will be done
              using a non-destructive read-write test.

By running e2fsck -cc /dev/md3 we were able to do the repairs non-destructively. However, as expected, on our 53 gig /home partition, this badblocks scan took about 7 hours to run. The good news is that in that time, it did find errors, it did seem to fix them, and running e2fsck following that run seemed to return no other errors.

After the partitions were repaired, I ran resize2fs following the instructions in the above article. I first ran e2fsck again (but not in badblocks mode), just to make sure everything was clean, then I resized the partition and then I ran e2fsck again.

	e2fsck -f /dev/md3
	resize2fs /dev/md3
	e2fsck -f /dev/md3

This worked like a charm, and I did not get the “short read” error from earlier. I was not able to unmount the root partition, however, since the running system needed access to it, and I was not able to mount it read-only as was suggested in the article.

How to resize the root partition

Resizing the root partition turned out to be less of a pain than I might have expected, though it was by no means obvious when first thinking about the problem. The solution would be as follows:

	1. Copy all files from the root partition to another, empty partition (/tmp worked nicely)
	2. Reboot the server passing in the new, fake root partition to the boot loader
	3. Unmount all partitions (including the real root partition, which is not running)
	4. Repair and resize as above

Fortunately, /tmp had its own partition. I deleted the contents out of /tmp (which should be temporary anyway) and copied all of the files out of the root partition into this new, temporary root. Remember that you can copy /dev files, but should avoid /proc. The idea here is to copy all of the files out of /, excluding anything that is mounted from another partition. [Looking at the man page again, after the fact, -x would probably be exactly what’s needed here. -jcn]

	1. cp -ax / /tmp
	(can't actually remember the cp command, but this should work)
	2. Edit /tmp/etc/fstab to not mount the partition that /tmp resides on

Once that is done, it is simply a matter of rebooting. At the LILO prompt, tell the existing kernel to use the new partition (which is normally /tmp) as the root partition.

	LILO: kernel root=/dev/sd5 single

Once booted, I ran unmount -a and proceeded as above.

Done!

This seems to have worked. resize2fs is, in fact, non-destructive and now when I run e2fsck, it just runs – it does not give me the error about a mismatched physical vs. filesystem partition.

Followup

Did this document help you? If so, I’d love if you would let me know, and let me know if there is anything I left out or was confusing. Thanks!

Follow @jcn

Filed under: Technology

At 6:28 pm on 09.29.07, Jeremy Truax said,

Hey there

Thanks a ton for this. It helped me when I ran into a bind with the same type of errors.

~Jeremy

At 10:20 pm on 11.18.07, Jeremy Bongio said,

haha, yes, a different Jeremy but same problem. Thanks a lot!!

At 6:18 pm on 12.03.07, Yannis Tsop said,

Thank you, you saved me a lot of time!

At 12:10 pm on 12.23.07, Thomas said,

ha! had the exact same problem! the resize2fs hint was exactly what I needed!

Cheers

Thomas

At 9:02 pm on 01.01.08, SanjiLYH said,

Yo~ It it a pretty old post.

But anyway, it save me too! So thank you!

At 10:56 am on 01.13.08, vladimir_v said,

nice one, you’re a day saver 😉

At 12:31 am on 03.05.08, alan said,

Thanks… this worked fine. In the end, after a lot of messing around, it was the e2fsck -cc /dev/md0 that was required to clear the errors that were giving me the ‘short read’ problem.

Thanks!

At 12:03 am on 03.30.08, dombessi said,

This guide helped me fix a 2G eeepc issue after removing UnionFS and getting a superblock / partition table error.
I ran
e2fsck -cc /dev/hdc1
and
resize2fs /dev/hdc1
while booted into gparted live usb version gparted-liveusb-0.2.5-3
the newer gparted(gparted-livecd-0.3.4-11) had a kernel panic when booting on my asus 2g surf eeepc.
thanks.

At 10:49 am on 05.05.09, Paul said,

Thanks for the guide, I had discovered the same problem but had not figured out marking bad blocks was the way to go about fixing it. Since all the “bad blocks” will be at the end of the partition you can considerable speed up the bad block scan by doing the following.
1) badblocks -b 4096 -n -o badfile /dev/md3 end-block start-block
Creates a list of badblocks and stores in file badfile.
2) e2fsck -l badfile -f /dev/md3
Uses badfile as source for new badblocks, does not require -cc

In badblocks -b is the block size and was 4096 for my case. You’ll also have to find out end-block and start-block needed for your drive, they show up when you run e2fsck and it reports

The filesystem size (according to the superblock) is xxx blocks
The physical size of the device is xxx blocks

I started 10 blocks before the size of physical device. And ended at filesystem size.

At 7:44 am on 05.08.09, georges said,

thank you very much. saved me alot of time! 😉

At 10:08 pm on 05.30.09, Ryan said,

I know this is an old post now, but it saved my life! I was really struggling and had no idea what was going on. This post clarified my raid problem and solved… all of my problems. Thank you.

At 9:57 am on 08.19.09, turerkan said,

old old article but still goes on to save lives.. this time it happened to be me:)

At 2:54 pm on 01.15.10, Tyler said,

Thanks for this write-up!

I *stupidly* ran the gparted check-disk function from an ubuntu boot disk and it left my system unbootable. Somehow it threw the superblock info out of sync with the actual size of the disk(!). So, the problem was a bit different but reading this page helped me figure out what happened and the fix is essentially the same.

At 10:27 pm on 03.03.10, Otavio said,

The same as above, except for the stupidly part 🙂

I owe you some beers!

At 4:48 pm on 08.21.11, hutch said,

The e2fsck -cc scan took the whole weekend, but my Fedora is finally booting up without errors. Thanks!

At 9:24 pm on 08.24.11, Oliver said,

I’ve got an luks encrypted filesystem (thats why it’s in /dev/mapper) which is a 1TB Data Backup that wouldn’t mount on boot after repartitioning the original data partition.
I was a LITTLE bit terrified after I found out this:

# e2fsck /dev/mapper/wd1tb
e2fsck 1.42-WIP (02-Jul-2011)
The filesystem size (according to the superblock) is 244190269 blocks
The physical size of the device is 244189236 blocks
Either the superblock or the partition table is likely to be corrupt!

I skip some hours of hope and failure and come to:
# badblocks -b 4096 -n -o ./badfile /dev/mapper/wd1tb 244190269 244189226

with the generated badfile I did:
# e2fsck -l badfile /dev/mapper/wd1tb
Several errors were reported and got fixed but I had to run e2fsck twice before resizing the partition:

# resize2fs /dev/mapper/wd1tb
resize2fs 1.42-WIP (02-Jul-2011)
Resizing the filesystem on /dev/mapper/wd1tb to 244189236 (4k) blocks.
The filesystem on /dev/mapper/wd1tb is now 244189236 blocks long.

GREAT!

# e2fsck /dev/mapper/wd1tb e2fsck
1.42-WIP (02-Jul-2011)
WD10EVCS: clean, 178496/61054976 files, 231945548/244189236 blocks

And thats it. Disk repaired!

This old post probably saved me 1TB of data and the comments saved me estimated 34 hours of waiting.

Thank You!

At 7:25 pm on 09.16.11, Toby Bartels said,

Yea, helped me too!

pith.org

Hello, and welcome to my internet