Skip to main content
ModPageSpeed 2.0: AVIF, WebP, and critical CSS — up to 69% less page weight on the live demo

Cyclone vs. the file cache: benchmarking a memory-mapped page cache

By Otto van der Schaaf

performance benchmarks cache

Every cache backend behind ModPageSpeed sits under one interface, so the optimizer above it never changes. That makes a direct question easy to ask: how much does the storage engine itself matter? We put Cyclone, the memory-mapped cache in mod_pagespeed 1.15 and ModPageSpeed 2.0, against the classic file-per-entry disk cache it replaced, and measured both across concurrency, realistic mixed traffic, latency tails, and eviction, on two very different machines.

The most useful result is a trend. Cyclone’s advantage is smallest on a laptop with a near-RAM SSD and grows to 4x-6x on the kind of storage a real server runs on. The faster the disk, the more the old design’s per-file cost stays hidden.

Two engines, one interface

Cyclone keeps one memory-mapped file per volume. A read returns a pointer into mapped memory, with no open/read/close and no heap copy. An in-process RAM tier holds the hot set, eviction runs inline as entries are written, and the admission policy resists one-hit-wonders flushing the working set.

The file-per-entry cache stores one file per entry. Every read is open, then read, then close, plus a copy into a fresh buffer; every write creates a file. Eviction is deferred to a periodic janitor that walks the whole cache directory, an O(n) scan that grows with the file count.

Everything below is the median of repeated runs on quiesced machines. Where the file cache wins, the numbers say so.

Concurrency: the file cache stops scaling

A live server answers many requests at once. Ramping concurrent readers from 1 to 64 against a warm cache, Cyclone’s mapped reads carry no per-operation lock or system call, so aggregate throughput climbs with cores. The file cache peaks near 4-8 threads, then declines as lock and system-call contention take over.

Aggregate reads per second at 64 threads:

Object sizeLaptop (fast SSD)Workstation (realistic storage)
100 B4.5x (2.72M vs 601k)5.6x (1.53M vs 274k)
10 KB3.5x (2.09M vs 592k)4.7x (1.25M vs 269k)
1 MB4.0x (84.7k vs 20.9k)4.0x (164k vs 41k)

One case cuts the other way. At a single thread with large objects, the file cache is faster, because a warm read is a page-cache copy and nothing more. That edge disappears the moment real concurrency arrives.

Realistic traffic, under pressure

A Zipfian mix of reads, writes, and deletes over a heavy-tailed object-size distribution, with a working set about twice the cache size so eviction runs continuously, the way it does on a busy site. Eight worker threads, a 1 GB cache with a 128 MB RAM tier.

Metric (workstation)CycloneFile cache
Throughput111k ops/s27k ops/s
p99 latency1.0 ms2.1 ms
p999 tail2.1 ms8.4 ms

On this storage Cyclone runs 4.1x the throughput with roughly a quarter of the p999 tail. The file cache often posts a higher hit rate, because it admits everything and lets its footprint grow to do it, but it serves what it holds far more slowly. On the laptop’s near-RAM SSD the throughput gap narrows to 1.6x, which is the point: the slower and more realistic the disk, the wider the margin.

The full latency distribution, machine by machine, is in the interactive data explorer. The curves make the tail difference obvious in a way percentiles alone do not.

The through-line: a slower disk amplifies the win

Here is every headline result as a ratio, Cyclone over the file cache, for both machines. The laptop is the conservative case, with cheap per-file cost, so the file cache looks its best. The workstation’s realistic storage is closer to production. In every pairing, the slower disk makes the gap larger.

TestLaptop (fast SSD)Workstation (realistic)
Concurrency, 64 threads4.5x5.6x
Realistic, under pressure1.6x4.1x
Eviction sweep, 100k files2.0x6.0x
Overflow write rate2.9x5.6x

The practical lesson: benchmarks run on developer laptops with near-RAM SSDs understate what a memory-mapped cache buys in production.

Eviction: a flat line vs. a stall

Stream 100,000 files into a small cache and watch write throughput batch by batch. Cyclone evicts inline and holds steady. The file cache defers to its janitor; each time the scan fires, writers stall, and the walk gets more expensive as the file count climbs. On the workstation the file cache finished the same sweep 6x slower.

There is a real-world version of this. The file cache also writes a small file for every miss it remembers, so every not-found input URL becomes an inode the janitor later has to walk. On a busy server that churn inflates the very scan above. Cyclone’s single-file design and selective admission do not accumulate that debris.

Staying inside the cap

Write far more than a 16 MB cache holds, and the difference in discipline shows. Cyclone evicts inline and holds the cap at 15.5 MB. The file cache’s deferred janitor lets the directory balloon well past the limit before it catches up: 5x over on the workstation, 9x on the laptop. Predictable disk usage is worth as much to an operator as raw speed.

The one caveat: size the cache to RAM

Memory-mapped I/O is fast while the mapped file stays resident in RAM. Push the cache far past physical memory and it pages to disk instead of reading from memory. We measured it: a 4 GB cache in a 2 GB memory budget runs about 1.8x slower than a right-sized one. Two things matter here. Even while paging, it still beat the file cache at the same sizes; and the fix is just good practice, spelled out in the production deployment guide. Size the cache to fit RAM, and none of this applies.

Where this lands in the products

mod_pagespeed 1.15 already reads from Cyclone with no copy: a hit is a pointer into the mapped file, not a system call and a buffer allocation, which is exactly the read path these numbers measure. ModPageSpeed 2.0 carries the memory-mapped path further into the request, serving cached bytes toward the socket without an intermediate copy. How it does that, streaming large hits with kernel sendfile straight from the shared cache file, copying small ones because the syscall would cost more, and staying correct while the cache changes underneath a slow client, is its own deep-dive: zero-copy serving between nginx and the worker. If you run a file-cache configuration under real concurrency or eviction pressure, the update is worth taking.

A fair boundary on the claim: this is a cache benchmark, not an end-to-end page-serving benchmark. It shows the cache component is faster, not that a page renders some fixed percentage sooner. On most requests, origin fetch and optimization work dominate the clock. The cache win reaches the visitor precisely when a server is busy, with many concurrent hits, a large working set, and eviction running hot. That is also when a server most needs the help.

Frequently asked questions

Is Cyclone faster than the old file-per-entry cache? Under concurrency, eviction pressure, and on realistic server storage, yes: 4x to 6x on throughput in our tests, with a tighter latency tail. On an idle, over-provisioned cache backed by a near-RAM SSD, the two are close and the file cache can edge ahead.

Does mod_pagespeed 1.15 benefit from Cyclone? Yes. 1.15 already ships Cyclone and reads from it zero-copy: a cache hit is a pointer into the mapped file rather than a system call plus a heap copy. That read path is exactly what this benchmark measures.

Will a faster cache make my site load faster? It helps most when the cache is on the hot path: high concurrency, lots of already-optimized assets, and cache eviction under load. On a lightly loaded site the cache was rarely the bottleneck, so the end-to-end gain is smaller.

How large should I size the cache? Size it to fit comfortably in RAM. A memory-mapped cache is fast while the mapped file stays resident; push it far past physical memory and it pages to disk. Fitting RAM avoids that entirely.

The machines

Two deliberately different machines. The specs that matter for a cache benchmark are core count (concurrency), memory (the RAM tier and mapping headroom), and above all the storage stack and filesystem, which set the per-file system-call cost.

  • Laptop. Apple-silicon, 10 cores, near-RAM APFS SSD. A very fast core and an SSD close to RAM speed, but few cores. This is where per-file cost is cheapest, so it is the conservative case for every filesystem claim here.
  • Workstation. 64-core x86, 96 GB, ext4 on virtualized storage. Many threads draw the clean concurrency curve, and the virtualized storage adds the kind of filesystem-metadata overhead a production server actually pays.

Explore the numbers yourself, with per-machine toggles and the full latency distribution, in the data explorer.


In memory of Alan M. Carroll. Some of Cyclone’s design grew out of conversations with Alan, known to the Apache Traffic Server community as SolidWallOfCode and among the people who understood its architecture most deeply. His thinking about how a cache lays bytes out on disk and reclaims space without stalling helped shape how we thought about ours, and he shared it generously. We remember him with gratitude. Alan’s memorial at the Apache Software Foundation.

Like this kind of writeup?

We write about how mod_pagespeed and ModPageSpeed actually work, and what we learn shipping them. Get the next post by email.

Read next