PDA

View Full Version : Linux RAID, Which options are best?



thattommyhall
2006-11-12, 09:49
I started this topic as I have done some testing with Linux's mdraid subsystem and various file-systems and here are my views.

RAID5 or RAID10?
Assuming more than 2 drives these are basically your 2 choices.
RAID5:
http://www.acnc.com/04_01_05.html
+ only uses 1/n of total space for parity (n is number of drives)
+ good read speed
+ can add drives and "reshape" array
- (relatively) slow writes
RAID10
http://www.acnc.com/04_01_10.html
+ better write speed
- Uses 1/2 of total space for parity (for a 4 drive setup i recommend it for situations where write speed is important, for more than 4 drives i recommend a combination of raid0 RAID10 and/or RAID5 depending on what fraction needs to have fast writes or RAID50 if the whole lot needs fast writes (see http://www.acnc.com/04_01_50.html)

Filesystem:
XFS - Tested journaling file-system with good support for big files.
ZFS - I really want this since i read http://www.sun.com/2004-0914/feature/ , particularly "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans."
ext3 - backwards compatible but dodgy on big files
reiser4 - I used reiserfs a few years ago and loved it, some debate about his benchmarking procedures but looks cool (i eschewed it as its not in standard kernel and I want rescue tools to work easily)

Tweaking mdraid:
The most important parameter is the chunk size, I use 512k as that's what benched best for me

mdadm --create /dev/md0 --chunk=512k -l 5 -n 8 /dev/sd[abcdefgh]5
(-n is numer of devices, final bit is which partitions used for raid (set to type FD in fdisk))

Tweaking xfs to match
Based on SGI's advice here http://oss.sgi.com/archives/xfs/2006-06/msg00100.html

mkfs.xfs -d su=512k,sw=7 /dev/md0
(sw=one less than number of drives)


Got some pretty good benchmarks with bonnie++


Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
NAS 2G 38997 84 81471 15 29529 8 33606 75 175495 27 321.4 1
NAS 2G 39392 85 65338 12 28627 8 35113 77 175635 27 322.2 1
NAS 2G 37908 82 58694 11 28801 8 34849 78 176082 27 317.3 1
NAS 2G 39275 85 59746 11 28959 8 35068 77 175534 27 332.1 1
NAS 2G 38264 83 84587 16 28633 8 33381 75 175577 28 327.8 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
NAS 16 2220 21 +++++ +++ 1927 17 2009 19 +++++ +++ 1554 15
NAS 16 2271 21 +++++ +++ 1913 17 2190 22 +++++ +++ 1574 16
NAS 16 2275 21 +++++ +++ 1916 15 2189 22 +++++ +++ 1589 15
NAS 16 2300 21 +++++ +++ 1972 17 2297 22 +++++ +++ 1175 10
NAS 16 2412 23 +++++ +++ 2147 19 2336 23 +++++ +++ 1708 18

nobspangle
2006-11-12, 11:19
It's important to note that XFS is very heavy on caching so it should really only be used with a UPS otherwise power cuts could cause data loss.

thattommyhall
2006-11-12, 11:24
Interesting, and a slight worry as I cant stop folk doing a hard power off of my NAS's (though I have configured ACPId to catch a press of the power button and go into a safe shutdown)
JFS then ?


It's important to note that XFS is very heavy on caching so it should really only be used with a UPS otherwise power cuts could cause data loss.

Havoc
2006-11-12, 12:05
Being able to reshape and add drives depends more on the controller than the type of raid used. And raid110 doesn't use parity, but an identical copy.

How important is write speed for you?

thattommyhall
2006-11-12, 12:31
I favor RAID5 as its the most efficient.
Re: reshaping; I was talking about linux's mdraid (software raid) but some (mainly expensive) cards offer it too.
You are of course correct i should have said
"RAID10 - Uses 1/2 of total space for REDUNDANCY"

Cheers, Tom



Being able to reshape and add drives depends more on the controller than the type of raid used. And raid10 doesn't use parity, but an identical copy.

How important is write speed for you?

georgem
2006-11-13, 09:31
Got some pretty good benchmarks with bonnie++


Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
NAS 2G 38997 84 81471 15 29529 8 33606 75 175495 27 321.4 1
NAS 2G 39392 85 65338 12 28627 8 35113 77 175635 27 322.2 1
NAS 2G 37908 82 58694 11 28801 8 34849 78 176082 27 317.3 1
NAS 2G 39275 85 59746 11 28959 8 35068 77 175534 27 332.1 1
NAS 2G 38264 83 84587 16 28633 8 33381 75 175577 28 327.8 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
NAS 16 2220 21 +++++ +++ 1927 17 2009 19 +++++ +++ 1554 15
NAS 16 2271 21 +++++ +++ 1913 17 2190 22 +++++ +++ 1574 16
NAS 16 2275 21 +++++ +++ 1916 15 2189 22 +++++ +++ 1589 15
NAS 16 2300 21 +++++ +++ 1972 17 2297 22 +++++ +++ 1175 10
NAS 16 2412 23 +++++ +++ 2147 19 2336 23 +++++ +++ 1708 18


would you share the options for the bonnie++ command ? I would like to compare to yours.

thattommyhall
2006-11-13, 09:56
I believe the only option I passed was for size of file (to avoid RAM caching)
Off the top of my head (check man page for exact syntax)


bonnie++ -s 2G -x 5 -r 1024 -m NAS

Ram was 1G at the time, bonnie may not accept 2G as a file size, might be 2048

Havoc
2006-11-13, 12:20
Re: reshaping; I was talking about linux's mdraid (software raid) but some (mainly expensive) cards offer it too.

Well, I was thinking about filesystem and wrote controller instead. Must have been the Laphroaig.

Anyway, I'm going to see if I can get bonnie working. And run it once on my raid5 hardware setup. Have been looking for a reason to do that, so why not.

Edit: got it running, but where is the output???

thattommyhall
2006-11-13, 13:33
Well, I was thinking about filesystem and wrote controller instead. Must have been the Laphroaig.

Anyway, I'm going to see if I can get bonnie working. And run it once on my raid5 hardware setup. Have been looking for a reason to do that, so why not.

Edit: got it running, but where is the output???

It outputs a csv file to standard out (usually the screen you type the command in). You can redirect it with

bonnie++ -s 2G -x 5 -r 1024 -m NAS > bonnie.csv
You can get text out with the following

bonnie++ -s 2G -x 5 -r 1024 -m NAS | bon_csv2txt > bonnie.txt
"|" is <shift>+\ in uk keyboard

Get back soon, I'm very curious.

georgem
2006-11-13, 13:58
It's important to note that XFS is very heavy on caching so it should really only be used with a UPS otherwise power cuts could cause data loss.

Once a while there is a discussion at http://www.gossamer-threads.com/lists/mythtv/users/ regarding the optimal fs for raid and lvm combo. Mythtv being about video has similiar, even more demanding requirements to slimserver. Besides some technical issues under some distros, the common complain is that you can not shrink xfs, you can only grow it. You can do both in ext3.

Personally, I'm running xfs in lvm configuration without raid for mythtv for more than a year and have no issues. File delete is instant. For slimserver I've chosen ext3 as files are much smaller.

thattommyhall
2006-11-13, 14:30
I have heard LVM is quite a bit slower than just mdraid. i have never really got the point of LVM to be honest. Anybody using EVMS?


Once a while there is a discussion at http://www.gossamer-threads.com/lists/mythtv/users/ regarding the optimal fs for raid and lvm combo. Mythtv being about video has similiar, even more demanding requirements to slimserver. Besides some technical issues under some distros, the common complain is that you can not shrink xfs, you can only grow it. You can do both in ext3.

Personally, I'm running xfs in lvm configuration without raid for mythtv for more than a year and have no issues. File delete is instant. For slimserver I've chosen ext3 as files are much smaller.

georgem
2006-11-13, 15:23
i have never really got the point of LVM to be honest.

I think that LVM compliments Raid splendidly. Here is the scenario - I have one Raid 5 devices called md0 with lets say 1TB . On this I have few partitions for video, audio and perhaps a backup. Sooner or later 1TB is not sufficient and the only way to add additional storage is via an add on an external box that uses some kind of high speed connection (infiniband, eSATA ?). I will add additional 2TB (drives are cheaper by now...) as a md1 device. If I'm running LVM, I can add the new physical device to the existing logical device (LVM) that I already have and grow my partitions using the additional 2TB. You can do the same without RAID of course. Exciting possibilities, especially if you consider that all this magic is for free...

Havoc
2006-11-14, 13:01
Have set up the slimserver with LVM2. Just with the idea that whatever the future, I can expand the music folder. Very useful since slim only recognises a single music folder.

Patrick Dixon
2006-11-15, 03:34
Very useful since slim only recognises a single music folder.Yes, but since you can have shortcuts to other folders in that one folder, it's a bit of a non-issue.

Havoc
2006-11-18, 08:05
Here the bonnie++ numbers for the main disk and the raid 5 harware array. Had some problems before I could get them, pc related and not...

Main disk:


Version 1.93c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
daw 2G 186 98 40371 18 19889 6 1368 97 55714 11 1299 11
Latency 143ms 1206ms 253ms 70178us 40916us 85782us
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
daw 16 11597 65 +++++ +++ 15194 99 16862 96 +++++ +++ 14071 99
Latency 7625us 1572us 1710us 440us 19us 164us

Array:


Version 1.93c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
daw 2G 190 99 79063 37 43389 15 928 78 161557 31 5256 34
Latency 61791us 821ms 1067ms 612ms 48681us 60301us
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
daw 16 5000 29 +++++ +++ 14310 99 16343 99 +++++ +++ 13273 100
Latency 1091ms 1551us 1648us 439us 14us 700us


Find them rather disappointing really. Would have expected better. But I never did any optimising or so.

thattommyhall
2006-11-19, 10:57
Havoc, what sort of RAID card is it?

Havoc
2006-11-20, 12:35
Areca ARC1120 pci-x. It's sitting in 133MHz pci slot. I'm running it under Gentoo and compiled the drivers in 64 bit. There are 4 160GB Hitachi's attached in a raid 5 array. All the same drives, all sataII.

It isn't a problem, but I had expected more of it. Not that I have anything to compare to. So it is all a bit meaningless.

MrD
2007-02-08, 09:48
You doubled your write performance, tripled your read performance, and seeks increased by a factor of 4 and you're disappointed?

Is your raid stripe size tuned to the filesystem stripe (assuming xfs filesystem).

Havoc
2007-02-08, 12:44
You doubled your write performance, tripled your read performance, and seeks increased by a factor of 4 and you're disappointed?

I admit I can't make a meaningfull compare to other raid setups. So maybe it isn't that bad. But relative to the cost it is a "minor" improvement I feel. Most important is that a disk may fail. Since I did have a disk like the main disk crash before.


Is your raid stripe size tuned to the filesystem stripe (assuming xfs filesystem).

It is Reiserfs3. And since I don't know what you are talking about I guess it isn't :)

thattommyhall
2007-02-09, 00:15
I think you may be thinking of software RAID provided by linux's mdraid subsystem.
The SGI mailing list outlines settings here (and they wrote xfs)
http://oss.sgi.com/archives/xfs/2006-06/msg00100.html
I find a chunk size of 512k in mdraid is nice for big files.
Its easier to specify su=???,sw=?? than to use sunit and swidth (like the example i gave at the start)

For hardware RAID the performance gains are not as pronounced
http://oss.sgi.com/archives/xfs/2003-04/msg00137.html
is one discussion about it, but there have been more.

Possibly trivial point: For true hardware RAID it must have a processor onboard. For reliability gains you should have battery backed RAM on there too. I keep telling folk this as low end RAID cards are not true hardware, they are partial "RAID accelerators" at best - fakeraid (completely worthless) at worst. See
http://linux-ata.org/faq-sata-raid.html

Yours, Tom



You doubled your write performance, tripled your read performance, and seeks increased by a factor of 4 and you're disappointed?

Is your raid stripe size tuned to the filesystem stripe (assuming xfs filesystem).

tommypeters
2007-02-09, 01:13
"You doubled your write performance, tripled your read performance, and seeks increased by a factor of 4 and you're disappointed?"

I admit I can't make a meaningfull compare to other raid setups. So maybe it isn't that bad. But relative to the cost it is a "minor" improvement I feel. Most important is that a disk may fail. Since I did have a disk like the main disk crash before.
Those stats seem on the contrary "too good to be true"...
Read performance is of course helped by a RAID-5, but write performance usually gets a tad slower. That's if you have data sets large enough to make the tests meaningful (eventual caches not making an impact on the result).

Havoc
2007-02-09, 12:42
The comparision is between the main system drive (Hitachi sataII 80GB) and a 4-disk raid5 (4x hitach sataII 160GB). The card has an accelerator and 128MB cache. I don't have the specs here but the main disk may be a generation older than the raid disks. Both use the same filesystem.

Anyway, this system doesn't feel as fast as my old setup with a two 10k U160 disks (system and data).

MrD
2007-02-09, 14:39
Actually, my performance is a lot higher with this card (Supermicro H8DAE MB, Opteron 270, 2Gigs RAM, 2.6.19.2 kernel, xfs filesystem)

The Areca card is hardware RAID5 (RAID6 actually), it has an Intel IOP331 processor on it. I have this card too.

Not sure what motherboard you are using, but it is best to make sure the PCI-X bus is running at 133MHz (sometimes they run slower).

The Areca card allows one to set the RAID stripe size when the array is created.

I like the xfs filesystem because it allows very large files and can be grown while the filesystem is on-line.

For xfs filesystems:

"For a RAID device, the default stripe unit is 0, indicating that the feature is disabled. It is prudent of the sysadmin to configure the stripe unit and width sizes of RAID devices. This should be done to avoid unexpected performance anomalies caused by the filesystem doing non-optimal I/O operations to the RAID unit. For example, if a block write is not aligned on a RAID stripe unit boundary and is not a full stripe unit, the RAID will be forced to do a read/modify/write cycle to write the data. This can have a significant performance impact. By setting the stripe unit size properly, XFS will avoid unaligned accesses."

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/LX_XFS_AG/sgi_html/ch02.html

egd
2007-02-09, 15:19
I think that LVM compliments Raid splendidly. Here is the scenario - I have one Raid 5 devices called md0 with lets say 1TB . On this I have few partitions for video, audio and perhaps a backup. Sooner or later 1TB is not sufficient and the only way to add additional storage is via an add on an external box that uses some kind of high speed connection (infiniband, eSATA ?). I will add additional 2TB (drives are cheaper by now...) as a md1 device. If I'm running LVM, I can add the new physical device to the existing logical device (LVM) that I already have and grow my partitions using the additional 2TB. You can do the same without RAID of course. Exciting possibilities, especially if you consider that all this magic is for free...

I'm running two external RAID5 boxes connected to my PC via infiniband. Works pretty well but I can see myself getting to the point where combining using LVM could be pretty handy. One question though, if, for some reason, LVM is corrupted, what are your chances of recovering the data? Before someone launches into RAID not being a substitute for backups, I know this and use an LTO2 to backup to tape, however, I don't backup daily so I could lose a fortnight or so at any given point.

Havoc
2007-02-10, 09:53
Actually, my performance is a lot higher with this card

I would have expected that. Thanks for the confirmation. Only problem is I haven't any clue how to strat troubleshooting it. (yes, I know this means I shouldn't be running it...)

MrD
2007-02-11, 10:53
Array with Maxtor 250 drives

Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
pompei 5G 48553 83 69925 13 41061 8 51035 83 155904 13 207.5 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 4299 23 +++++ +++ 5954 27 7379 40 +++++ +++ 7874 40


Array with Western Digital / Samsung drives

Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
pompei 5G 55349 92 94432 16 42812 8 53227 86 170185 14 205.7 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 2456 12 +++++ +++ 8430 37 11118 55 +++++ +++ 7277 34


Given these are static file (flac, photos, dvd) media server array, I only care about read performance.

Both arrays have 4 drives

-Mrd

davis
2007-02-28, 08:52
if, for some reason, LVM is corrupted, what are your chances of recovering the data?

I've been using Linux LVM-2 on everything from laptops to systems with big storage systems for $some_number of years now, and LVM-based systems for longer than that.

I've come to the conclusion that you're far, far more likely to lose data due to hardware failures and human error than to software failure. If Linux LVM gets its knickers in a twist then there's always /etc/lvm/backup. I should point out that I've never, ever had to use it, and I've done some fairly eccentric stuff with LVM.[1]

Talking about the backup part, for home use I really like rsync to sync to either a local external USB/FireWire disk, or perhaps a remote site -- stick rsync in a nightly cronjob and you'll only ever be a day behind. For home use, IMHO, it stomps all over tape based storage (Disclaimer: I've never liked single-spool tape systems, and yes, I admit this is prejudiced :) ).

My opinion: LVM is exceptionally stable, and it gives you all kinds of advantages.

1: I mean manually root around in the files. I have, however, used vgcfgrestore et al.