PDA

View Full Version : Slightly OT - RAID corruption



mctubster
2008-06-14, 18:08
Hi all,

Apologies if this is a bit off topic for some, but all of my music is on this device, and well, I have run out of ideas to diagnose the fault .... I'm hoping somewhere here may have seen something like this in their travels.

I have a FC7 server, running 2.6.23.17-88.fc7 (just updated).

I have a md raid5 set, of four 300GB volumes.

The problem is that data is coming off this md device inconsistantly.

For example if I run md5sum on say 1000 files, and then run it again on the same 1000 files, I will have 20 that are different. If I run it again, I will have 28 that are different, and generally all of those files will be different files from the first 20 that were different. If one of the same files happens to be different in both the second and third md5sum run, it will have three different checksum values.

From a purely abstracted viewpoint, I am quite curious as to what is going on. From a personal viewpoint I am fairly worried, as I certainly don't have everything on this device backed up, and more to the point when did this issue first happen?

Diagnosis

I have run a lot of tests, and I am loath to implement any of the repair / write aspects of them, as I will be actually corrupting what I believe to be consistant data on the actual array.

For example, running a consistancy check on the array


echo check > /sys/block/md1/md/sync_action
and then waiting for a couple of hours

(to check the progress)


cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdc1[0] sdf1[3] sde1[2] sdd1[1]
879100416 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
[=====>...............] check = 25.6% (75090008/293033472) finish=128.4min speed=28286K/sec

Once finished


cat /sys/block/md1/md/mismatch_cnt

gives back around 10000 metablocks with errors. Each time that I run this I get a different count. For those that are interested each block is 128KB (from what I have read), so it is a fair wack. This test is comparing the actual data on the array to the parity data.

My theory

Under high load / or a load threshold, errors are being introduced in the data coming off the array. I have noticed when doing checks with larger files, whenever there is a corruption, that process has stuttered reading the data from the array.

The big question is why?! and what is it?

Appreciate anyone else with other ideas / angles

Cheers
Steve

syburgh
2008-06-15, 07:41
Don't have any specific advice, but you do have my sympathies... :(

Did you run the md sync_action check with the filesystem unmounted? What filesystem is you used? No swap partition on this md?

OT: 2008 may not be the year, but ZFS is supposed to make these sorts of consistency issues much less likely

sebp
2008-06-15, 09:20
OT: 2008 may not be the year, but ZFS is supposed to make these sorts of consistency issues much less likely
Still OT, but I would wait for Sun to make it the default filesystem on Solaris before even thinking to try it (even on Solaris) ...