Kernel settings, schedulers, and solid state drives

By Curtis Collicutt, Cloud Developer, Edmonton

I'€™ve been testing various commercial solid state drives for a project that involves stateless, mostly Windows-based, virtual machines (vms).

One thing about Windows vms is that they can use tens of thousands of IOPS while booting, which is part of why we are investigating solid state drives.

Because SSDs are so fast, there are few kernel settings, and even scheduler changes, that can be used to improve performance.

I find that most kernel parameters have great default settings, but are really set for generic use. Once we start getting over 100,000 IOPS, it might make sense to make a few changes to the kernel parameters, or even change the scheduler. At least, these changes and their effect on performance are what I'€™d like to investigate.

Note that there are many variables involved here, including the page sizes the SSDs are using, or the chunk size of the mdadm device, among others. Some variables are just completely unknown.

The setup

We have a single Dell C6220 node with 128GB of RAM and 32 cores.

For SSDs, we have three 512GB Crucial M4s and three 480GB Sandisk drives, mostly configured in a stripe. (Yes '€” we'€™ve striped six SSDs using mdadm!) That said '€” because this setup is for stateless vms '€” if the stripe fails (which technically is about six times more likely than a single drive), we will have to replace the failed drive and completely rebuild the OpenStack compute node to put it back into production.

The SSD'€™s documentation says this is what each SSD is individually capable of:

SSD model

4k Read IOPS

4k Write IOPS

Read BW/s

Write BW/s

Crucial M4 512GB

45,000

50,000

450M

260M

Sandisk Extreme 480GB

44,000

46,000

540M

480M

Max in a stripe (assuming all 6.0Gpbs)

267,000

288,000

2970M

2220M

The below output shows that we have three mdadm devices: md0 is a RAID1 for /boot, md1 is a stripe (or RAID0) for the server's root directory, and finally, md2 is a LVM physical volume made up of a mdadm stripe of partitions on all six of the drives. Also note that we are using 256K chunking.

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid0 sda2[0] sdf5[5] sdb2[1] sde5[4] sdc2[2] sdd5[3]
     2757445632 blocks super 1.2 256k chunks
    
md1 : active raid0 sdf2[2] sde2[1] sdd2[0]
     97516800 blocks super 1.2 256k chunks
    
md0 : active raid1 sde1[1] sdd1[0]
     524224 blocks [2/2] [UU]
    
unused devices: <none>

Unfortunately, it seems that the C6220 only has 6.0 Gbps SATA interfaces for the first two drives, and 3.0 Gbps for the last four, so that will limit how fast we can go with this stripe of SSDs.

# dmesg | grep -i gbps | grep ata
[    4.779898] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    4.779928] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    4.779961] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    4.779988] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    4.780016] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    4.780050] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Baseline

Here are some baseline performance measurements of six 1TB SATA drives in a mdadm-based RAID10 configuration (the test was run on top of a LVM device as well, unlike the SSD tests, which are run on pure mdadm devices). The numbers are all averages, and note that it'€™s a RAID10 configuration, rather than a stripe.

I'€™m not suggesting this is necessarily a fair comparison to the striped SSDs, but it'€™s something to contrast to. Having 1331 IOPS out of a software RAID10 of SATA Drives is not bad!

Test

Bandwidth

IOPS

Latency

Read IOPS

5326KB

1331

96 msec

Read Bandwidth

313222KB

305

418 msec

Write IOPS

3073KB

767

166 msec

Write Bandwidth

164564KB

160

795 msec

Default SSD Stripe

In this test, we show the performance of the SSD stripe with the default kernel parameters and scheduling, though the mdadm was created with 256K chunking. Note that usecs are much smaller than msecs (a usec is 1/1000 of a msec). I ran each of these tests four times and averaged the results.

Test

Bandwidth

IOPS

Latency

Read IOPS

523424KB

130855

964 usec

Read Bandwidth

1245MB

1244

102 msec

Write IOPS

373699KB

93423

1359 usec

Write Bandwidth

1128MB

1128

111 msec

With '€œtuned'€ kernel parameters and scheduler

Again, I ran each of these tests four times and averaged the results.

Test

Bandwidth

IOPS

Latency

Read IOPS

538894KB

134722

943 usec

Read Bandwidth

1258MB

1258

101 msec

Write IOPS

580296KB

145073

869 usec

Write Bandwidth

1172MB

1171

107 msec

What we '€œtuned'€

Scheduler

The current scheduler is cfq:

# for i in $(echo {a..f}); do cat /sys/block/sd$i/queue/scheduler; done
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]

Let'€™s change it to noop, which may be faster in this situation.

# for i in $(echo {a..f}); do echo noop > /sys/block/sd$i/queue/scheduler; done
# for i in $(echo {a..f}); do cat /sys/block/sd$i/queue/scheduler; done
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq
[noop] deadline cfq

add_random

Let'€™s disable disk entropy.

# for i in $(echo {a..f}); do echo 0 >  /sys/block/sd$i/queue/add_random; done

rq_affinity

And ensure the processing of completed IO is handled by the same CPU that initiated it.

# for i in $(echo {a..f}); do echo 0 > /sys/block/sd$i/queue/rq_affinity; done

Write IOPS are affected immensely

As can be seen in the results, changing the scheduler, add_random, and rq_affinity settings greatly increases (by about 55%) the speed of write operations, not only in terms of bandwidth, but also IOPS and latency.

Is it worth making these changes? I think so, but more testing should be done, especially with regards to changing the scheduler. Perhaps in a future post I'€™ll try to determine which of these changes is allowing for this performance increase. But these are interesting results nonetheless; results that may help to speed up Windows vms.

Test

Bandwidth

IOPS

Latency

Write IOPS (default)

373699KB

93423

1359 usec

Write IOPS (tuned)

580296KB

145073

869 usec

As usual, if you have any questions, concerns or criticisms, please let me know in the comments. I would love to find out better methodologies for these tests or other tuning ideas.