* * 2006-07-27 -- 2006-07-28 * Nathaniel Taylor. * Tests of disk and array speed on the new ETS fileserver. * PROBLEM: a `hardware' RAID controller supplied as part of the new server seems to have very low performance. Several tests have been performed to establish where the problem lies. EXPECTATION: given that a single 10000rpm U320 SCSI disk can be read at about 65-85 MBps, several disks in RAID5 should be able to be read in a similar test at considerably higher speed (one might hope for about double this speed). Such speeds have been reported by someone using an LSI card, although at least four disks were needed in order to get the good speeds of about 110 MBps for four, 130 MBps for five. LIMITATION: the speed mentioned in this report is always the buffered disk read speed given by use of the hdparm command with the options -tT. This may not represent very well for example a series of small accesses on different parts of the disk, or the speed of writing. But, as being able to read non-cached files quickly is expected to be an important part of this fileserver's duties, a large difference in the measured speed is taken as significant. Is there some other aspect, e.g. how the command flushes caches, that suggests comparisons using hdparm to be very unfair? HARDWARE: the system is an H8DA8 motherboard with AOC-LPZCR2 Supermicro "All in one zero channel RAID" card mounted in a (special, green) PCI-X 66MHz 64bit slot and coupled via the motherboard's SCSI controller to six U320 SCSI hotswappable disks. For later tests the RAID card was removed. The system was supplied, ready built, by "southpole.se". DISK ID: the physical disks are here referred to as 0, 1, ... 5, which are their SCSI IDs. The first one, 0, is 73GB 10krpm and is used directly (single disk volume) for the system, so it is not directly involved in these tests. The next three, 1,2,3 are all Maxtor Atlas 10K5147SCA, 10 krpm, reported by the RAID card after initialisation as 136.9 GB. The next one, 4, is Maxtor Atlas 10K4147SCA, bought a while before the previous three, also10 krpm, reported as 136.6 GB. The last, 5, is a Seagate ST3146707LC, 10 kprm, also reported as 136.6 GB. ------------------------------------------------ Test 1: 5 whole disks, RAID5 through controller ------------------------------------------------ Initially all 5 non-system disks were in a 540 GB RAID5 configuration through the controller card, making them appear as a single SCSI disk, but even after a full build/verify, the read speed achieved (hdparm) was only about 75 to 80 MBps, rather lower than the 80 to 85 MBps of the single system disk. Then, setting the PCI jumper to PCI-X instead of Auto was tried, in case the bus had been on PCI only. This made no difference. No jumper or BIOS setting for varying the frequency of the bus to force it to its maximum 66 MHz was found. The SCSI setup seemed to report all the devices as using U320. ------------------------------------------------ Test 2: many small arrays through controller ------------------------------------------------ To try to narrow the field of search a little in the attempt to solve this problem, the RAID5 array was removed, and the five disks 1,2,3,4,5 were configured to provide several small virtual disks through the controller. From this, it was possible to try the three identical disks in RAID5, various numbers in mirror or stripe, and to access disks individually, as well as to try linux md (software) raid. The total space used was quite a small (~20%) part of the available disk capacity, which can be hoped to make the variation of speed with disk radius be not too many percent different for the different arrays -- this variation really ought to be checked to make more use of these figures! The configuration and results were as in the following table. The result is the speed reported by hdparm for five 3s buffered reads, of which the first is reported separately. In some cases there was a notable change between high and low speeds for alternate tests on a particular array! -- in this case low-high has been written instead. The raw results, in chronological order of the tests, are given below the table. -------------------------------------------------------------------------------- reported name physical disks type of array size of aray initial subsequent of array (as used, by SCSI raid0 - stripe (GB) speed speed shown by card ID raid1 - mirror (MBps) (MBps) to system) -------------------------------------------------------------------------------- sdb 1,2,3 raid0 10 112 116 sdc 1,2,3 raid5 10 53 52-78 sdd 1 vol 5 51 100 sde 2 vol 5 51 99 sdf 3 vol 5 51 100 sdg 4,5 raid0 10 82 94 sdh 4,5 raid1 10 51 51-79 sdi 4 vol 5 52 52-79 sdj 5 vol 5 71 71 sdk 1,2,3,4,5 raid5 1 75 75 sdl 1,2,3,4,5 raid0 1 97 107 -------------------------------------------------------------------------------- Significant points: * the RAID5 speed with either all the disks or just the three matched ones is still low, with a maximum about the same as was noticed before when all space on all disks formed the array -- about 75 MBps; this does not really say anything definite as it could be that we need more than the minimum three disks to get a fast RAID5, but that adding disks that aren't exactly matched prevents the advantage from being seen * the speed at which the controller manages to provide the data to the computer is higher than this, as seen in the mirrored case (116MBps) -- so it's not a controller transport problem * the peculiar result of alternating speed occurred on different arrays with no member disks in common, so cannot easily be blamed on a particular device; the alternation was confirmed (on sdc) to happen consistently over ten tests ----------------------------------------------------------- Test 3: software (linux md) RAID on the volumes of Test 2 ----------------------------------------------------------- Software RAID (RedHat Enterprise 4, kernel 2.6.9-34.ELsmp, mdadm v. v1.6.0, 4 June 2004) was then tried on the single-disk-partition volumes presented by the RAID controller as shown in test 2 (/dev/sd{d,e,f,i,j}). Taking just the parts of the three identical disks, 1,2,3, a RAID5 array was created, which gave very varied results: the speed was measured from 29 to 98 MBps, with a mean around 60 MBps. Then, after removing the above array and taking parts of all five disks, 1,2,3,4,5, to make a new RAID5 array, the speed of this array came out also as about 60 MBps, but with much much less variation -- only a few MBps. Significant points: * it seems that using the controller's hardware RAID was rather quicker than doing the RAID in the OS, acting upon volumes presented by the controller * it's therefore not just a matter of a processing bottleneck for the RAID5 algorithm on the controller * but, does the slow speed of this test reflect the SCSI bus, or a bad linux md algorithm, or slowness caused by the abstraction of the `volumes' by the controller? -- the next test answers this! ----------------------------------------------------------- Test 4: software (linux md) RAID without RAID card present ----------------------------------------------------------- The RAID controller card was removed, to get plain SCSI disk access. This of course meant that the already installed RedHat OS could not be used as the physical disks contain the RAID controller's own format. A Gentoo livecd was therefore used, with kernel 2.6.12-gentoo-r10 #1 SMP and mdadm v1.11.0, 11 April 2005. The five disks 1,2,3,4,5 were cleansed of the RAID controller's format (dd if=/dev/zero of=/dev/sd$n count=1000 bs=1024) and partitioned with a 10000 MB primary partition at the start of the disk. Before each array's creation and testing, old arrays were stopped and the kernel modules for raid[015] and md were removed and reinserted. Between creation and testing of raid1 or raid5 arrays, the syncing was allowed to finish. mdadm defaults were used for parameters other than those on the command line: the RAID5 block size is therefore 64K. First, each disk (/dev/sd{b,c,d,e,f}, now physical disks) was tested alone, giving an interesting difference from the tests on single-disk volumes through the RAID controller: the three similar disks were seen to be faster than the others, and the measurements on each disk were very consistent. Disks 1,2,3: 84 MBps (new Maxtor) Disk 4: 67 MBps (older Maxtor) Disk 5: 75 MBps (Seagate) Then, just the fast, similar three, 1,2,3, were put in RAID0 (stripe), RAID1 (mirror) and RAID5. Again, results were considerably more consistent than with the RAID controller, but not as much as with the individual disks. RAID0: 1,2,3: 160 MBps RAID1: 1,2,3: 84 MBps RAID5: 1,2,3: 160 MBps Then, all five disks were put in the same RAID configurations as above. The results were less consistent than with the similar three disks. The RAID0 became faster, while the RAID5 became a bit slower. RAID0: 1,2,3,4,5: 176 MBps RAID1: 1,2,3,4,5: 77 MBps RAID5: 1,2,3,4,5: 153 MBps Significant points: * the SCSI bus is not having any difficulty in getting the sorts of speeds that had been expected * access to a single disk (volume) is faster without the RAID controller present, and use of linux md RAID is much faster without the RAID controller * linux md RAID5 gives about twice the (large, buffered) read speed of the quite expensive RAID controller card, without using much CPU -- around 10% of 1 CPU was taken during by [md0_raid5] during the five-disk test. * it might, perhaps, be the case that this controller would perform better with more than three exactly _similar_ disks, which we don't have to try; but this isn't very helpful, as we may sometime wish to change or add a disk without having to find a particular model of a particular manufacturer in order to avoid halving our performance! Further, unkind point: using linux md RAID, we can use the superbly simple mdadm command and monitoring tools that come with the system -- no extra software is needed; using the card we have to install some java-based GUI configuration and monitoring program that didn't seem to want to start and that put a nice display of error messages down the console. Id est: why bother paying a lot of money for a badly performing card with bad software? This is not just a rhetorical question: I can well imagine there might be some situations where the card's performance may come closer to the linux md -- I am just interested to hear of any test for which this happens, or of stories of unreliability of linux md RAID. -------------------------------------------------------------------------- RAW RESULTS -------------------------------------------------------------------------- The raw results: test 2 (comparing arrays made by the hardware raid card) -------------------------------------------------------------------------- /dev/sda Timing buffered disk reads: 252 MB in 3.01 seconds = 83.68 MB/sec Timing buffered disk reads: 296 MB in 3.00 seconds = 98.52 MB/sec Timing buffered disk reads: 298 MB in 3.02 seconds = 98.56 MB/sec Timing buffered disk reads: 290 MB in 3.01 seconds = 96.49 MB/sec Timing buffered disk reads: 304 MB in 3.01 seconds = 100.88 MB/sec /dev/sdb Timing buffered disk reads: 336 MB in 3.01 seconds = 111.57 MB/sec Timing buffered disk reads: 348 MB in 3.01 seconds = 115.48 MB/sec Timing buffered disk reads: 350 MB in 3.01 seconds = 116.41 MB/sec Timing buffered disk reads: 356 MB in 3.01 seconds = 118.17 MB/sec Timing buffered disk reads: 354 MB in 3.00 seconds = 117.82 MB/sec /dev/sdc Timing buffered disk reads: 160 MB in 3.01 seconds = 53.09 MB/sec Timing buffered disk reads: 238 MB in 3.02 seconds = 78.82 MB/sec Timing buffered disk reads: 158 MB in 3.02 seconds = 52.36 MB/sec Timing buffered disk reads: 234 MB in 3.00 seconds = 77.93 MB/sec Timing buffered disk reads: 158 MB in 3.02 seconds = 52.27 MB/sec /dev/sdd Timing buffered disk reads: 154 MB in 3.02 seconds = 50.92 MB/sec Timing buffered disk reads: 294 MB in 3.00 seconds = 97.98 MB/sec Timing buffered disk reads: 306 MB in 3.00 seconds = 101.85 MB/sec Timing buffered disk reads: 302 MB in 3.00 seconds = 100.58 MB/sec Timing buffered disk reads: 298 MB in 3.01 seconds = 99.12 MB/sec /dev/sde Timing buffered disk reads: 154 MB in 3.02 seconds = 51.03 MB/sec Timing buffered disk reads: 294 MB in 3.02 seconds = 97.49 MB/sec Timing buffered disk reads: 304 MB in 3.02 seconds = 100.58 MB/sec Timing buffered disk reads: 300 MB in 3.01 seconds = 99.68 MB/sec Timing buffered disk reads: 298 MB in 3.02 seconds = 98.59 MB/sec /dev/sdf Timing buffered disk reads: 154 MB in 3.01 seconds = 51.12 MB/sec Timing buffered disk reads: 306 MB in 3.01 seconds = 101.51 MB/sec Timing buffered disk reads: 302 MB in 3.00 seconds = 100.65 MB/sec Timing buffered disk reads: 300 MB in 3.02 seconds = 99.45 MB/sec Timing buffered disk reads: 296 MB in 3.01 seconds = 98.35 MB/sec /dev/sdg Timing buffered disk reads: 248 MB in 3.01 seconds = 82.27 MB/sec Timing buffered disk reads: 290 MB in 3.00 seconds = 96.58 MB/sec Timing buffered disk reads: 284 MB in 3.03 seconds = 93.84 MB/sec Timing buffered disk reads: 282 MB in 3.01 seconds = 93.61 MB/sec Timing buffered disk reads: 280 MB in 3.00 seconds = 93.32 MB/sec /dev/sdh Timing buffered disk reads: 156 MB in 3.03 seconds = 51.53 MB/sec Timing buffered disk reads: 244 MB in 3.01 seconds = 80.99 MB/sec Timing buffered disk reads: 154 MB in 3.04 seconds = 50.73 MB/sec Timing buffered disk reads: 236 MB in 3.01 seconds = 78.44 MB/sec Timing buffered disk reads: 154 MB in 3.04 seconds = 50.60 MB/sec /dev/sdi Timing buffered disk reads: 156 MB in 3.00 seconds = 51.92 MB/sec Timing buffered disk reads: 238 MB in 3.01 seconds = 78.98 MB/sec Timing buffered disk reads: 156 MB in 3.02 seconds = 51.66 MB/sec Timing buffered disk reads: 244 MB in 3.02 seconds = 80.81 MB/sec Timing buffered disk reads: 156 MB in 3.03 seconds = 51.44 MB/sec /dev/sdj Timing buffered disk reads: 214 MB in 3.02 seconds = 70.87 MB/sec Timing buffered disk reads: 216 MB in 3.02 seconds = 71.44 MB/sec Timing buffered disk reads: 216 MB in 3.02 seconds = 71.63 MB/sec Timing buffered disk reads: 216 MB in 3.03 seconds = 71.35 MB/sec Timing buffered disk reads: 216 MB in 3.02 seconds = 71.42 MB/sec /dev/sdk Timing buffered disk reads: 226 MB in 3.01 seconds = 74.97 MB/sec Timing buffered disk reads: 224 MB in 3.01 seconds = 74.41 MB/sec Timing buffered disk reads: 226 MB in 3.03 seconds = 74.48 MB/sec Timing buffered disk reads: 224 MB in 3.01 seconds = 74.41 MB/sec Timing buffered disk reads: 230 MB in 3.00 seconds = 76.55 MB/sec /dev/sdl Timing buffered disk reads: 292 MB in 3.01 seconds = 97.09 MB/sec Timing buffered disk reads: 320 MB in 3.02 seconds = 106.12 MB/sec Timing buffered disk reads: 324 MB in 3.01 seconds = 107.48 MB/sec Timing buffered disk reads: 326 MB in 3.02 seconds = 107.89 MB/sec Timing buffered disk reads: 320 MB in 3.01 seconds = 106.40 MB/sec ---------------------------------------------------------------------------- The raw results: test 3 (linux md on single volumes from hardware raid card) ---------------------------------------------------------------------------- # mdadm --create /dev/md0 -l raid5 -n 3 /dev/sd[def] Timing buffered disk reads: 240 MB in 3.00 seconds = 79.91 MB/sec Timing buffered disk reads: 296 MB in 3.01 seconds = 98.48 MB/sec Timing buffered disk reads: 244 MB in 3.00 seconds = 81.32 MB/sec Timing buffered disk reads: 296 MB in 3.00 seconds = 98.62 MB/sec Timing buffered disk reads: 92 MB in 3.03 seconds = 30.41 MB/sec Timing buffered disk reads: 146 MB in 3.06 seconds = 47.78 MB/sec Timing buffered disk reads: 92 MB in 3.05 seconds = 30.14 MB/sec Timing buffered disk reads: 138 MB in 3.05 seconds = 45.22 MB/sec Timing buffered disk reads: 92 MB in 3.08 seconds = 29.91 MB/sec Timing buffered disk reads: 136 MB in 3.05 seconds = 44.64 MB/sec # mdadm --create /dev/md0 -n 5 -l 5 /dev/sd[defij] Timing buffered disk reads: 168 MB in 3.04 seconds = 55.29 MB/sec Timing buffered disk reads: 192 MB in 3.00 seconds = 63.97 MB/sec Timing buffered disk reads: 170 MB in 3.02 seconds = 56.32 MB/sec Timing buffered disk reads: 166 MB in 3.02 seconds = 55.01 MB/sec Timing buffered disk reads: 188 MB in 3.02 seconds = 62.28 MB/sec ----------------------------------------------------------------------------- The raw results: test 4 (linux md directly on disks -- no hardware raid card) ----------------------------------------------------------------------------- Individual disk speeds (these are now, respectively, SCSI IDs 1,2,3,4,5, accessed directly as SCSI devices instead of being hidden behind the controller's abstraction. /dev/sdb: Timing buffered disk reads: 254 MB in 3.01 seconds = 84.34 MB/sec Timing buffered disk reads: 254 MB in 3.02 seconds = 84.17 MB/sec Timing buffered disk reads: 254 MB in 3.01 seconds = 84.40 MB/sec /dev/sdc: Timing buffered disk reads: 254 MB in 3.02 seconds = 84.15 MB/sec Timing buffered disk reads: 254 MB in 3.02 seconds = 84.09 MB/sec Timing buffered disk reads: 254 MB in 3.02 seconds = 84.09 MB/sec /dev/sdd: Timing buffered disk reads: 254 MB in 3.01 seconds = 84.37 MB/sec Timing buffered disk reads: 254 MB in 3.01 seconds = 84.40 MB/sec Timing buffered disk reads: 254 MB in 3.01 seconds = 84.45 MB/sec /dev/sde: Timing buffered disk reads: 204 MB in 3.03 seconds = 67.27 MB/sec Timing buffered disk reads: 202 MB in 3.02 seconds = 66.90 MB/sec Timing buffered disk reads: 202 MB in 3.00 seconds = 67.30 MB/sec /dev/sdf: Timing buffered disk reads: 228 MB in 3.01 seconds = 75.68 MB/sec Timing buffered disk reads: 226 MB in 3.00 seconds = 75.32 MB/sec Timing buffered disk reads: 228 MB in 3.01 seconds = 75.68 MB/sec Note: partition 1 on each disk (b,c,d,e,f == SCSI-ID 1,2,3,4,5) is a 10000MB partition at the start of the disk. # mdadm --create /dev/md0 --auto=yes -l 0 -n 3 /dev/sd[bcd]1 Timing buffered disk reads: 476 MB in 3.01 seconds = 158.27 MB/sec Timing buffered disk reads: 510 MB in 3.01 seconds = 169.69 MB/sec Timing buffered disk reads: 494 MB in 3.01 seconds = 163.98 MB/sec Timing buffered disk reads: 466 MB in 3.01 seconds = 154.74 MB/sec # mdadm --create /dev/md0 --auto=yes -l 1 -n 3 /dev/sd[bcd]1 Timing buffered disk reads: 252 MB in 3.02 seconds = 83.46 MB/sec Timing buffered disk reads: 254 MB in 3.01 seconds = 84.26 MB/sec Timing buffered disk reads: 252 MB in 3.01 seconds = 83.62 MB/sec Timing buffered disk reads: 252 MB in 3.01 seconds = 83.73 MB/sec # mdadm --create /dev/md0 --auto=yes -l 5 -n 3 /dev/sd[bcd]1 Timing buffered disk reads: 492 MB in 3.01 seconds = 163.59 MB/sec Timing buffered disk reads: 484 MB in 3.01 seconds = 160.72 MB/sec Timing buffered disk reads: 474 MB in 3.00 seconds = 157.97 MB/sec Timing buffered disk reads: 486 MB in 3.01 seconds = 161.70 MB/sec # mdadm --create /dev/md0 --auto=yes -l 0 -n 5 /dev/sd[bcdef]1 Timing buffered disk reads: 506 MB in 3.00 seconds = 168.64 MB/sec Timing buffered disk reads: 542 MB in 3.00 seconds = 180.57 MB/sec Timing buffered disk reads: 546 MB in 3.00 seconds = 181.79 MB/sec Timing buffered disk reads: 532 MB in 3.01 seconds = 176.89 MB/sec # mdadm --create /dev/md0 --auto=yes -l 1 -n 5 /dev/sd[bcdef]1 Timing buffered disk reads: 226 MB in 3.02 seconds = 74.87 MB/sec Timing buffered disk reads: 252 MB in 3.01 seconds = 83.62 MB/sec Timing buffered disk reads: 254 MB in 3.01 seconds = 84.40 MB/sec Timing buffered disk reads: 202 MB in 3.02 seconds = 66.79 MB/sec # mdadm --create /dev/md0 --auto=yes -l 5 -n 5 /dev/sd[bcdef]1 Timing buffered disk reads: 454 MB in 3.00 seconds = 151.21 MB/sec Timing buffered disk reads: 450 MB in 3.00 seconds = 149.87 MB/sec Timing buffered disk reads: 460 MB in 3.01 seconds = 152.85 MB/sec Timing buffered disk reads: 484 MB in 3.01 seconds = 161.04 MB/sec --------------------------------------------------------------------------------