Linux bcache with writeback cache (how it works and doesn’t work)

Photo by Andrey Matveev on Unsplash

bcache is a simple and good way to have large disks (typically rotary and slow) exhibit performance quite similar to an SSD disk, using a small SSD disk or a small part of an SDD.

In general, bcache is a system for having devices composed of slow and large disks, with fast and small disks attached as a cache.

This article will discuss performance and some optimization tips as well as configuration of bcache.

The following terms are used in bcache to describe how it works and the parts of bcache:

backing deviceslow and large disk (disk intended to actually hold the data)
cache devicefast and small disk (cache)
dirty cachedata present only in the cache device
writebackwriting to the cache device and later (much later) to the backing device
writeback ratecache write speed in the backing device

A disk data cache has always existed, it is the free RAM in the operating system. When data is read from the disk it is copied to RAM. If the data is already in RAM, it is read from RAM rather than being read from disk again. When data is written to the disk, it is written to RAM and after a few moments written to the disk as well. The time data spends only in RAM is very short since RAM is volatile.

bcache is similar, only it has various modes of cache operation. The mode that is faster in writing data is writeback. It works the same as for RAM, only instead of RAM there is a SATA or NVME SSD device. The data may reside only in the cache for much longer, even forever, so it is a bit riskier (if you break the SSD, the data that resided only in the cache is lost, with a good chance that the whole filesystem becomes inaccessible).

Performance Comparison

It is very difficult to gather reliable data from any tests, either with real cases or with special programs. They always give extremely variable, different, unstable values. The various caches present and the type of filesystem (btrfs, journaled, etc.), make the values very variable. It is advisable to ignore small differences (say 5-10%).

The following performance data refers to the test below (random and multiple reads/writes), trying to always maintain the same conditions and repeating three times in immediate sequence.

$ sysbench fileio --file-total-size=2G --file-test-mode=rndrw --time=30 --max-requests=0 run

The tables below show the performance of the separate devices:

Performance of the backing device (RAID 1 with 1TB rotary disks)

Throughput:
read, MiB/s: 0.22read, MiB/s: 0.23read, MiB/s: 0.19
written, MiB/s: 0.15written, MiB/s: 0.16written, MiB/s 0.13
Latency (ms):
max: 174.92max: 879.59 max: 1335.30
95th percentile: 87.5695th percentile: 87.56 95th percentile: 89.16
RAID 1 with 1TB rotary disks

Performance of the cache device (SSD SATA 100GB)

Throughput:
read, MiB/s: 7.28 read, MiB/s: 7.21read, MiB/s: 7.51
written, MiB/s: 4.86 written, MiB/s: 4.81written, MiB/s 5.01
Latency (ms):
max: 126.55max: 102.39max: 107.95
95th percentile: 1.4795th percentile: 1.47 95th percentile: 1.47
Cache device (SSD SATA 100GB)

The theoretical expectation that a bcache device will be as fast as the cache device is (physically) impossible to achieve. On average, bcache is significantly slower and only sometimes approaches the same performance as the cache device. Improved performance almost always requires various compromises.

Consider an example assuming there is a 1TB bcache device and a 100GB cache. When writing a 1TB file, the cache device is filled, then partially emptied to the backing device, and refilled again, until the file is fully written.

Because of this (and also because part of the cache also serves data when reading) there is a limit on the length of the file’s sequential data that are written to the cache. Once the limit is exceeded, the file data is written (or read) directly to the backing device, bypassing the cache.

bcache also limits the response delay of the disks, but disproportionately so, especially for SSD SATA, degrading the performance of the cache.

The dirty cache should be emptied to decrease the risk of data loss and to have cache available when it is needed. This should only be done when the devices exhibit little or no activity, otherwise the performance available for normal use collapses.

Unfortunately, the default settings are too low, and the writeback rate adjustment is crude. To improve the writeback rate adjustment it is necessary to write a program (I wrote a script for this).

The following commands provide the necessary optimizations (required at each startup) to get better performance from the bcache device.

# echo 0 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
# echo 0 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us
# echo 600000000 > /sys/block/bcache0/bcache/sequential_cutoff
# echo 40 > /sys/block/bcache0/bcache/writeback_percent

The following tables compare the performance with the default values and the optimization results.

Performance with default values

Throughput:
read, MiB/s: 3.37read, MiB/s: 2.67read, MiB/s: 2.61
written, MiB/s: 2.24written, MiB/s: 1.78written, MiB/s 1.74
Latency (ms):
max: 128.51 max: 102.61  max: 142.04
95th percentile: 9.2295th percentile: 10.8495th percentile: 11.04
Default values (SSD SATA 100GB)

Performance with optimizations

Throughput:
read, MiB/s: 5.96 read, MiB/s: 3.89 read, MiB/s: 3.81
written, MiB/s: 3.98 written, MiB/s: 2.59 written, MiB/s 2.54
Latency (ms):
max: 131.95 max: 133.23max: 117.76
95th percentile: 2.6195th percentile: 2.66 95th percentile: 2.66
Optimization (SSD SATA 100GB)

Performance with the writeback rate adjustment script

Throughput:
read, MiB/s: 6.25read, MiB/s: 4.29read, MiB/s: 5.12
written, MiB/s: 4.17 written, MiB/s: 2.86written, MiB/s 3.41
Latency (ms):
max: 130.92max: 115.96max: 122.69
95th percentile: 2.6195th percentile: 2.6695th percentile: 2.61
Writeback rate adjustment (SSD SATA 100GB)

In single operations (without anything else happening in the system) on large files, adjusting the writeback rate becomes irrelevant.

Prepare the backing, cache and bcache device

To create a bcache device you need to install the bcache-tools. The command for this is:

# dnf install bcache-tools

bcache devices are visible as /dev/bcacheN (for example /dev/bcache0 ). Once created, they are managed like any other disk.

More details are available at https://docs.kernel.org/admin-guide/bcache.html

CAUTION: Any operation performed can immediately destroy the data on the partitions and disks on which you are operating. Backup is advised.

In the following example /dev/md0 is the backing device and /dev/sda7 is the cache device.

WARNING: bcache device cannot be resized.
NOTE: bcache refuses to use partitions or disks with a filesystem already present.

To delete an existing filesystem you can use:
# wipefs -a /dev/md0 
# wipefs -a /dev/sda7 

Create the backing device (and therefore the bcache device)

# bcache make -B /dev/md0
if necessary (device status is inactive)
# bcache register /dev/md0

Creating the cache device (and hooking the cache to the backing device)

# bcache make -C /dev/sda7
if necessary (device status is inactive)
# bcache register /dev/sda7
# bcache attach /dev/sda7 /dev/md0
# bcache set-cachemode /dev/md0 writeback

Check the status

# bcache show

The output from this command includes information similar to the following:
(if the status of a device is inactive, it means that it must be registered)

NameTypeStateBnameAttachToDev
/dev/md01 (data)clean(running)bcache0/dev/sda7
/dev/sda73 (cache)activeN/AN/A
bcache show

Optimize

# echo 0 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
# echo 0 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us
# echo 600000000 > /sys/block/bcache0/bcache/sequential_cutoff
# echo 40 > /sys/block/bcache0/bcache/writeback_percent

In closing

Hopefully this article will provide some insight on the benefits of bcache if it suits your needs.

As always, nothing fits all cases and all people’s preferences. However, understanding (even roughly) how things work, and especially how they don’t work, as well as how to adapt them, makes the difference in having satisfactory results or not


Addendum

The following charts show the performance with a SSD NVME cache device rather than SSD SATA as shown above.

Performance of the cache device (SSD NVME 100GB)

Throughput:
read, MiB/s: 16.31read, MiB/s: 16.17read, MiB/s: 15.77
written, MiB/s: 10.87 written, MiB/s: 10.78written, MiB/s 10.51
Latency (ms):
max: 17.50max: 15.30max: 46.61
95th percentile: 1.1095th percentile: 1.1095th percentile: 1.10
Cache device (SSD NVME 100GB)

Performance with optimizations

Throughput:
read, MiB/s: 7.96 read, MiB/s: 6.87 read, MiB/s: 7.73
written, MiB/s: 5.31 written, MiB/s: 4.58 written, MiB/s 5.15
Latency (ms):
max: 50.79 max: 84.40max: 108.71
95th percentile: 2.0095th percentile: 2.03 95th percentile: 2.00
Optimization (SSD NVME da 100GB)

Performance with the writeback rate adjustment script

Throughput:
read, MiB/s: 8.43read, MiB/s: 7.52read, MiB/s: 7.34
written, MiB/s: 5.62 written, MiB/s: 5.02written, MiB/s 4.89
Latency (ms):
max: 72.71max: 78.60max: 50.61
95th percentile: 2.0095th percentile: 2.0395th percentile: 2.11
Writeback rate adjustment (SSD NVME 100GB)
Fedora Project community

23 Comments

  1. Andriy

    I understand correctly, this is the ideal solution for old laptops, with a small ssd and a slow hard drive?

    • For a old laptop there are various conditions
      – Being able to connect two internal disks (backing and cache).
      – Already have an SSD to use at least partially as a cache.
      (otherwise maybe it’s better to buy a bigger SSD and use it as the main disk.)
      – If bcache is used for /root, know how to bypass installers of distributions that don’t allow you to install to bcache.
      (If you don’t have good experience it is very difficult to do it.)

  2. If I understand correctly, bchache imitates the behavior of an SSHD, in that it reads data to the SSD drive, then when either the cache is filled up or drive activity pauses, the data is written to the physically spinning Hard Disk. Do I have that correct, and does the performance of bcache approximate that of an SSHD?

    Just wondering,

    Ernie

    • The speed of any cache system, depending on the various data reading and writing situations, varies from the speed of the slowest disk to that of the fastest disk.
      It can be said that it is a matter of probability, the average speed of the various speeds obtained is probably about half that of the fastest disk.

      Let’s say we have a store, with a far (and slow) warehouse that can hold 1000 items, and a nearby (and fast) one that can hold 100 items.

      100 different items are requested for, they are not in the nearby warehouse, so they are requested (slowly) from the far warehouse and taken to the nearby warehouse.

      From this condition we can have the two extremes of the probabilities and the median one.

      1000 items are requested, always different type from the ones in the nearby warehouse, all items are always slowly requested from the far warehouse.
      1000 items are requested, always of the same type as those ones in the nearby warehouse, all items are always quickly requested from the nearby warehouse.
      A few items are requested from the nearby warehouse, some from the far one and taken to the nearby warehouse, and in the future requested from the nearby warehouse.

      The average retrieval speed of items from warehouses is the average of the speeds of the various request types.
      (Depends on the likelihood of various types of requests.)

    • The difference between bcache and SSHD is that in bcache you can choose the size and ratio of disks, and vary the settings that affect disk usage.

      It’s more accurate to say that SSHD mimic bcache.

  3. Grandpa Leslie Satenstein, Montreal,Que

    A very well done presentation. Thank you

    • Thank you for your comment.

      This article required a lot of finishing work, it was born as a mass of data and technical information that wanted to get out of my head 🤣 .

      50% of the credit for the final result belongs to the reviewers, @rlengland Richard England .

  4. Neil Darlow

    Can’t the same be achieved with lvm-cache?

    I have an ASUS laptop with Intel Optane drive and I use lvm-cache to implement a similar scheme. It also has the advantage that LVM2 is in the base install so no additional packages are required.

    • I’ve never used lvm-cache, so anything I say has little weight or relevance.
      My impression is that bcache is easier to manage.

      For the rest I suppose that both bcache and lvm-cache have their merits and faults, and that the choice is a matter of personal preference.

      Both cache systems have been in the mainline linux kernel for many years, so both require only the maintenance and management programs to be installed.

  5. Jan Slupski

    Does a single bcache (SSD) partition always serve a single slow (HDD) partition (filesystem) or can be hooked to multiple?

    And what about full-disk encryption? Can both be in luks configuration?

    And one more – I see that your master is a RAID configuration. Can both be based on a RAID1 configuration (otherwise bcache-stored data is more prone to hardware failures)?

    Right now I’m using both RAID1 and encryption in my setup.

    • A cache disk or partition can be attached to multiple backing devices (slow disks).

      bcache is filesystem agnostic, working at the sector level of the disk.

      It shouldn’t have problems with any kind of filesystem, apart from possible bugs, for example several years ago problems with btrfs were reported (I’m using bcache and btrfs).

      If the cache disk fails, it doesn’t matter whether you have a RAID or not, the damage would be the same (see above).
      I have a RAID 1 because I don’t trust rotating disks.

    • P.S. I have no experience with cache systems and encryption

  6. Rolf Fokkens

    As the maintainer of bcache-tools I’m really happy to see this article! Unfortunately I haven’t been using bcache myself for a while (the cost off SSD’s have dropped over the years, so I don’t have a regular HDD anymore), so my maintainership has become a really passive role. In general bcache is really stable, so not much effort is required, but testing and fixing issues will be too much effort for me. It would be good if somebody else would take over the maintainership. So if anybody is interested… please let me know.

    • Thanks for your job.
      Encouragement, best wishes and good luck in whatever you do in the future.

    • A non-cheap SSD costs 4 times a rotating disk.

      It is not recommended to use SSD in RAID.

      If you need a lot of storage and backup space, using SSD can be a waste of resources and reliability.

      For example, I have 10TB of spinning disks for storage, backups, virtual machines, etc., and 1.5TB of SSD for operating systems, normal data, etc., and I use a small part of it as a cache.

      If I have to do large data maintenance, I disable the cache so as not to stress and unnecessarily reduce the life of the SSD.

      However, it is true that using a caching system is a particular and rare condition and partly a matter of personal preference.

  7. Hugo Gaudin

    Is this any easier with Stratis? I wish there was an updated tutorial on it. Cockpit has no documentation on how to use the feature or any guidance. Does a Stratis cache pool work automatically if set up using Cockpit functions? Cockpit seems to make it simple, similar to the functions of TrueNAS’s cache setup, but there are a lot of questions left unanswered.

    Also, how come the Stratis docs mention that it’s designed to consolidate a bunch of features of the storage stack, and there’s a passing mention of RAID being a feature of the regular storage stack, yet Stratis has no RAID or redundancy feature besides some mention of error correction. Sure, you can put a Stratis pool on an MD RAID array, but there are some noteworthy problems with doing this including boot stability issues (kind of a big issue if you asked me). What is the best practise method to get drive redundancy in a Stratis pool? Nobody has mentioned this anywhere. When you’re dealing with a storage pool of many disks, it’s pretty important. The only demonstration videos I see are single drive pool demonstrations in a VM.

  8. Eddie

    I’ve been running bcache for a couple of years, with the default configuration, and it’s worked fine. I’m curious about the details of the optimizations you mention though, as I didn’t see any mention of that in the article.

    What do the optimizations accomplish?
    What is the reason for each specific tweak?
    Is there a source to your rate adjustment script?

    • In the article there are gray boxes with a title or text above them regarding optimizations.

      The article is just a description of a certain configuration, and the performance before and after setting changes.

      The script is primarily meant for
      – Be launched at system startup to activate optimizations, adjust writeback, activate some data monitor.
      – Be launched at system shutdown to force dirty cache writeback.

      It can be found by searching ‘bcacheMonitor SourceForge’ in google.
      (tested on my system, there are no guarantees, everyone has their own cases and preferences.)

  9. Hugo Gaudin

    What’s your take on medium speed but cheap, large SATA SSD’s, vs. cheaper slow hard drives combined with faster NVMe SSD’s? Compared to hard drives, you’ve got considerable speed boosts all the time with SATA drives, but with smaller more-expensive NVMe drives, you’ve got potentially even faster boost speeds but only some of the time… I guess it’s always a question of storage budget vs. fiscal budget.

    The other side of things around SSD’s (both types) is whether cache-less drives are worth the cost savings. Does Linux have a good-enough in-memory drive cache, or would you see more performance just mapping a RAM-drive and setting up caching manually, assuming you have a bullet-proof battery backup (this might’ve been the best usage scenario for persistent memory)?

    • Personally I think it’s a matter of preference and what you have available.
      I already have two NVME SSD and one SATA SSD available, using a small partition of the SSD disk as a cache only gives me advantages.
      And I prefer to avoid using SSD in a RAID, or cheap SSD as storage or backup.
      And I try to avoid spending 4 times more on big SSD disk.

      SATA SSD can’t go above about 550MB/s, NVME SSD often exceed 2GB/s, there is no comparison in raw performance.
      In real use it depends on your use case whether the difference has impact or not.

      The write cache in RAM for times that are not extremely short is very risky, there is not only a power failure, but also an operating system crash or any other similar problem, and it can be mitigated at a high price.
      SSD cache is persistent, permanent, with vastly lower risk, and can be mitigated at low cost.

      Personally I believe that one can only try, avoiding the search for a non-existent perfection or the best.

  10. FeRD (Frank Dana)

    Interesting writeup, thanks! bcache is something I haven’t had an opportunity to look into at all, until now.

    …The thing I’m wondering is, why are your initial numbers (pre-bcache) so abysmal? The improvements here are nice, and certainly any kind of speedup is welcome. But it all feels colored by the fact that the entire I/O subsystem being profiled just seems like a complete dog — in all honesty I can’t even figure out whether the gains are more, or less, impressive in context.

    The initial benchmarks looked so terrible to me, just on the face of them, that I ran the same tests on my own hardware just to have a basis for comparison.

    My 9-year-old (I like to live dangerously?) 2TB spinning-rust HDD, actively in use in my fileserver, posts 1.60 MiB/s read, 1.07 MiB/s write, on the rndrw test. Latency breakdown is 0.01 / 2.55 / 54.62, with the 95th percentile being 12.08. And that’s WITH active BitTorrent traffic hitting the same volume! (Although it probably wasn’t writing anything, during the test.)

    On my other machine, with a SATA-6Gbps SSD, I get 37.41 MiB/s read, 24.94 MiB/s write.

    I just feel like there’s something way wrong if a 4-year-old SSD on the SATA bus (a Cheap & Cheerful TEAMGroup 1TB model I grabbed from Newegg for $smallest) is lapping an NVMe SSD with direct PCIe bus access, no?

    • Yes, your argument is interesting and maybe partly right (for example I have PCIE 3).
      No, you support it with non-solid or inconsistent evidence.

      First there is the fact that the results of any test (as written in the article) are highly variable, unstable, unreliable.
      There are too many things you don’t know about how tools used and test conditions work (and I don’t know how it works for you).
      Let’s say it’s like for statistics, which measures precise and limited things, if you extend the range of what is detected you can prove anything, but without any real solidity.
      The purpose of the article and therefore also of the tests, was not to measure absolute performance (or worse peak performance), but to measure differences in the (probably) obtainable average performance, before and after some improvements.
      And highlight the limits and advantages of a cache system.
      I spent several hours looking for tests and conditions that showed the difference in average use in various cases, and that were at least reliable.
      And despite this I had to repeat the tests dozens of times, sometimes even recreating the conditions from scratch.
      Had I had the huge variation you had in your tests, I would have regarded the values as unreliable, and discarded or completely revised the tests and conditions.
      A SATA connection can transfer data up to 550MB/s, if the disks process data at less than this speed, basically it doesn’t matter if the connection is SATA, USB3, PCIE.

      Not knowing what your test tools and your conditions really do, and not being able to verify them (as you cannot verify mine), for me to give weight to hypotheses other than those of the variability of the obtainable values, is equal to throwing a dice with written random hypotheses on top.

      However, trying to give you something solid on which you can draw your eventual assessments,

      hdparm -tT /dev/…

      /dev/bcache0: (bcache device)
      Timing cached reads: 28058 MB in 2.00 seconds = 14053.30 MB/sec
      Timing buffered disk reads: 456 MB in 3.02 seconds = 151.01 MB/sec

      /dev/md0: (backend spin disk)
      Timing cached reads: 28868 MB in 2.00 seconds = 14461.11 MB/sec
      Timing buffered disk reads: 442 MB in 3.00 seconds = 147.21 MB/sec

      /dev/sda: (cache sata disk)
      Timing cached reads: 28432 MB in 2.00 seconds = 14242.78 MB/sec
      Timing buffered disk reads: 1574 MB in 3.00 seconds = 524.55 MB/sec

      /dev/nvme0n1 (cache nvme disk)):
      Timing cached reads: 28178 MB in 2.00 seconds = 14115.01 MB/sec
      Timing buffered disk reads: 4630 MB in 3.00 seconds = 1542.93 MB/sec

      Gnome Disks Benchmark

      /dev/bcache0
      143,1 MB/s

      /dev/md0
      139,5 MB/s

      /dev/sda
      556,0 MB/s

      /dev/nvme0n1
      2,7 GB/s

Comments are Closed

The opinions expressed on this website are those of each author, not of the author's employer or of Red Hat. Fedora Magazine aspires to publish all content under a Creative Commons license but may not be able to do so in all cases. You are responsible for ensuring that you have the necessary permission to reuse any work on this site. The Fedora logo is a trademark of Red Hat, Inc. Terms and Conditions