I've been testing various commercial solid state drives for a project that involves stateless, mostly Windows-based, virtual machines (vms).
One thing about Windows vms is that they can use tens of thousands of IOPS while booting, which is part of why we are investigating solid state drives.
Because SSDs are so fast, there are few kernel settings, and even scheduler changes, that can be used to improve performance.
I find that most kernel parameters have great default settings, but are really set for generic use. Once we start getting over 100,000 IOPS, it might make sense to make a few changes to the kernel parameters, or even change the scheduler. At least, these changes and their effect on performance are what I'd like to investigate.
Note that there are many variables involved here, including the page sizes the SSDs are using, or the chunk size of the mdadm device, among others. Some variables are just completely unknown.
The setup
We have a single Dell C6220 node with 128GB of RAM and 32 cores.
For SSDs, we have three 512GB Crucial M4s and three 480GB Sandisk drives, mostly configured in a stripe. (Yes ' we've striped six SSDs using mdadm!) That said ' because this setup is for stateless vms ' if the stripe fails (which technically is about six times more likely than a single drive), we will have to replace the failed drive and completely rebuild the OpenStack compute node to put it back into production.
The SSD's documentation says this is what each SSD is individually capable of:
SSD model |
4k Read IOPS |
4k Write IOPS |
Read BW/s |
Write BW/s |
Crucial M4 512GB |
45,000 |
50,000 |
450M |
260M |
Sandisk Extreme 480GB |
44,000 |
46,000 |
540M |
480M |
Max in a stripe (assuming all 6.0Gpbs) |
267,000 |
288,000 |
2970M |
2220M |
The below output shows that we have three mdadm devices: md0 is a RAID1 for /boot, md1 is a stripe (or RAID0) for the server's root directory, and finally, md2 is a LVM physical volume made up of a mdadm stripe of partitions on all six of the drives. Also note that we are using 256K chunking.
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid0 sda2[0] sdf5[5] sdb2[1] sde5[4] sdc2[2] sdd5[3]
2757445632 blocks super 1.2 256k chunks
md1 : active raid0 sdf2[2] sde2[1] sdd2[0]
97516800 blocks super 1.2 256k chunks
md0 : active raid1 sde1[1] sdd1[0]
524224 blocks [2/2] [UU]
unused devices: <none>
Unfortunately, it seems that the C6220 only has 6.0 Gbps SATA interfaces for the first two drives, and 3.0 Gbps for the last four, so that will limit how fast we can go with this stripe of SSDs.
# dmesg | grep -i gbps | grep ata
[ 4.779898] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4.779928] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 4.779961] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 4.779988] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4.780016] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 4.780050] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Baseline
Here are some baseline performance measurements of six 1TB SATA drives in a mdadm-based RAID10 configuration (the test was run on top of a LVM device as well, unlike the SSD tests, which are run on pure mdadm devices). The numbers are all averages, and note that it's a RAID10 configuration, rather than a stripe.
I'm not suggesting this is necessarily a fair comparison to the striped SSDs, but it's something to contrast to. Having 1331 IOPS out of a software RAID10 of SATA Drives is not bad!
Test |
Bandwidth |
IOPS |
Latency |
Read IOPS |
5326KB |
1331 |
96 msec |
Read Bandwidth |
313222KB |
305 |
418 msec |
Write IOPS |
3073KB |
767 |
166 msec |
Write Bandwidth |
164564KB |
160 |
795 msec |
Default SSD Stripe
In this test, we show the performance of the SSD stripe with the default kernel parameters and scheduling, though the mdadm was created with 256K chunking. Note that usecs are much smaller than msecs (a usec is 1/1000 of a msec). I ran each of these tests four times and averaged the results.
Test |
Bandwidth |
IOPS |
Latency |
Read IOPS |
523424KB |
130855 |
964 usec |
Read Bandwidth |
1245MB |
1244 |
102 msec |
Write IOPS |
373699KB |
93423 |
1359 usec |
Write Bandwidth |
1128MB |
1128 |
111 msec |
With 'tuned' kernel parameters and scheduler
Again, I ran each of these tests four times and averaged the results.
Test |
Bandwidth |
IOPS |
Latency |
Read IOPS |
538894KB |
134722 |
943 usec |
Read Bandwidth |
1258MB |
1258 |
101 msec |
Write IOPS |
580296KB |
145073 |
869 usec |
Write Bandwidth |
1172MB |
1171 |
107 msec |
What we 'tuned'
Scheduler
The current scheduler is cfq:
# for i in $(echo {a..f}); do cat /sys/block/sd$i/queue/scheduler; done
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
Let's change it to noop, which may be faster in this situation.
# for i in $(echo {a..f}); do echo noop > /sys/block/sd$i/queue/scheduler; done
# for i in $(echo {a..f}); do cat /sys/block/sd$i/queue/scheduler; done
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
add_random
Let's disable disk entropy.
# for i in $(echo {a..f}); do echo 0 > /sys/block/sd$i/queue/add_random; done
rq_affinity
And ensure the processing of completed IO is handled by the same CPU that initiated it.
# for i in $(echo {a..f}); do echo 0 > /sys/block/sd$i/queue/rq_affinity; done
Write IOPS are affected immensely
As can be seen in the results, changing the scheduler, add_random, and rq_affinity settings greatly increases (by about 55%) the speed of write operations, not only in terms of bandwidth, but also IOPS and latency.
Is it worth making these changes? I think so, but more testing should be done, especially with regards to changing the scheduler. Perhaps in a future post I'll try to determine which of these changes is allowing for this performance increase. But these are interesting results nonetheless; results that may help to speed up Windows vms.
Test |
Bandwidth |
IOPS |
Latency |
Write IOPS (default) |
373699KB |
93423 |
1359 usec |
Write IOPS (tuned) |
580296KB |
145073 |
869 usec |
As usual, if you have any questions, concerns or criticisms, please let me know in the comments. I would love to find out better methodologies for these tests or other tuning ideas.