Cyclone vs. the file cache: benchmarking a memory-mapped page cache
Every cache backend behind ModPageSpeed sits under one interface, so the optimizer above it never changes. That makes a direct question easy to ask: how much does the storage engine itself matter? We put Cyclone, the memory-mapped cache in mod_pagespeed 1.15 and ModPageSpeed 2.0, against the classic file-per-entry disk cache it replaced, and measured both across concurrency, realistic mixed traffic, latency tails, and eviction, on two very different machines.
The most useful result is a trend. Cyclone’s advantage is smallest on a laptop with a near-RAM SSD and grows to 4x-6x on the kind of storage a real server runs on. The faster the disk, the more the old design’s per-file cost stays hidden.
Two engines, one interface
Cyclone keeps one memory-mapped file per volume. A read returns a pointer into mapped memory, with no open/read/close and no heap copy. An in-process RAM tier holds the hot set, eviction runs inline as entries are written, and the admission policy resists one-hit-wonders flushing the working set.
The file-per-entry cache stores one file per entry. Every read is open, then read, then close, plus a copy into a fresh buffer; every write creates a file. Eviction is deferred to a periodic janitor that walks the whole cache directory, an O(n) scan that grows with the file count.
Everything below is the median of repeated runs on quiesced machines. Where the file cache wins, the numbers say so.
Concurrency: the file cache stops scaling
A live server answers many requests at once. Ramping concurrent readers from 1 to 64 against a warm cache, Cyclone’s mapped reads carry no per-operation lock or system call, so aggregate throughput climbs with cores. The file cache peaks near 4-8 threads, then declines as lock and system-call contention take over.
Aggregate reads per second at 64 threads:
| Object size | Laptop (fast SSD) | Workstation (realistic storage) |
|---|---|---|
| 100 B | 4.5x (2.72M vs 601k) | 5.6x (1.53M vs 274k) |
| 10 KB | 3.5x (2.09M vs 592k) | 4.7x (1.25M vs 269k) |
| 1 MB | 4.0x (84.7k vs 20.9k) | 4.0x (164k vs 41k) |
One case cuts the other way. At a single thread with large objects, the file cache is faster, because a warm read is a page-cache copy and nothing more. That edge disappears the moment real concurrency arrives.
Realistic traffic, under pressure
A Zipfian mix of reads, writes, and deletes over a heavy-tailed object-size distribution, with a working set about twice the cache size so eviction runs continuously, the way it does on a busy site. Eight worker threads, a 1 GB cache with a 128 MB RAM tier.
| Metric (workstation) | Cyclone | File cache |
|---|---|---|
| Throughput | 111k ops/s | 27k ops/s |
| p99 latency | 1.0 ms | 2.1 ms |
| p999 tail | 2.1 ms | 8.4 ms |
On this storage Cyclone runs 4.1x the throughput with roughly a quarter of the p999 tail. The file cache often posts a higher hit rate, because it admits everything and lets its footprint grow to do it, but it serves what it holds far more slowly. On the laptop’s near-RAM SSD the throughput gap narrows to 1.6x, which is the point: the slower and more realistic the disk, the wider the margin.
The full latency distribution, machine by machine, is in the interactive data explorer. The curves make the tail difference obvious in a way percentiles alone do not.
The through-line: a slower disk amplifies the win
Here is every headline result as a ratio, Cyclone over the file cache, for both machines. The laptop is the conservative case, with cheap per-file cost, so the file cache looks its best. The workstation’s realistic storage is closer to production. In every pairing, the slower disk makes the gap larger.
| Test | Laptop (fast SSD) | Workstation (realistic) |
|---|---|---|
| Concurrency, 64 threads | 4.5x | 5.6x |
| Realistic, under pressure | 1.6x | 4.1x |
| Eviction sweep, 100k files | 2.0x | 6.0x |
| Overflow write rate | 2.9x | 5.6x |
The practical lesson: benchmarks run on developer laptops with near-RAM SSDs understate what a memory-mapped cache buys in production.
Eviction: a flat line vs. a stall
Stream 100,000 files into a small cache and watch write throughput batch by batch. Cyclone evicts inline and holds steady. The file cache defers to its janitor; each time the scan fires, writers stall, and the walk gets more expensive as the file count climbs. On the workstation the file cache finished the same sweep 6x slower.
There is a real-world version of this. The file cache also writes a small file for every miss it remembers, so every not-found input URL becomes an inode the janitor later has to walk. On a busy server that churn inflates the very scan above. Cyclone’s single-file design and selective admission do not accumulate that debris.
Staying inside the cap
Write far more than a 16 MB cache holds, and the difference in discipline shows. Cyclone evicts inline and holds the cap at 15.5 MB. The file cache’s deferred janitor lets the directory balloon well past the limit before it catches up: 5x over on the workstation, 9x on the laptop. Predictable disk usage is worth as much to an operator as raw speed.
The one caveat: size the cache to RAM
Memory-mapped I/O is fast while the mapped file stays resident in RAM. Push the cache far past physical memory and it pages to disk instead of reading from memory. We measured it: a 4 GB cache in a 2 GB memory budget runs about 1.8x slower than a right-sized one. Two things matter here. Even while paging, it still beat the file cache at the same sizes; and the fix is just good practice, spelled out in the production deployment guide. Size the cache to fit RAM, and none of this applies.
Where this lands in the products
mod_pagespeed 1.15 already reads from Cyclone with no copy: a hit is a pointer into the mapped file, not a system call and a buffer allocation, which is exactly the read path these numbers measure. ModPageSpeed 2.0 carries the memory-mapped path further into the request, serving cached bytes toward the socket without an intermediate copy. How it does that, streaming large hits with kernel sendfile straight from the shared cache file, copying small ones because the syscall would cost more, and staying correct while the cache changes underneath a slow client, is its own deep-dive: zero-copy serving between nginx and the worker. If you run a file-cache configuration under real concurrency or eviction pressure, the update is worth taking.
A fair boundary on the claim: this is a cache benchmark, not an end-to-end page-serving benchmark. It shows the cache component is faster, not that a page renders some fixed percentage sooner. On most requests, origin fetch and optimization work dominate the clock. The cache win reaches the visitor precisely when a server is busy, with many concurrent hits, a large working set, and eviction running hot. That is also when a server most needs the help.
Frequently asked questions
Is Cyclone faster than the old file-per-entry cache? Under concurrency, eviction pressure, and on realistic server storage, yes: 4x to 6x on throughput in our tests, with a tighter latency tail. On an idle, over-provisioned cache backed by a near-RAM SSD, the two are close and the file cache can edge ahead.
Does mod_pagespeed 1.15 benefit from Cyclone? Yes. 1.15 already ships Cyclone and reads from it zero-copy: a cache hit is a pointer into the mapped file rather than a system call plus a heap copy. That read path is exactly what this benchmark measures.
Will a faster cache make my site load faster? It helps most when the cache is on the hot path: high concurrency, lots of already-optimized assets, and cache eviction under load. On a lightly loaded site the cache was rarely the bottleneck, so the end-to-end gain is smaller.
How large should I size the cache? Size it to fit comfortably in RAM. A memory-mapped cache is fast while the mapped file stays resident; push it far past physical memory and it pages to disk. Fitting RAM avoids that entirely.
The machines
Two deliberately different machines. The specs that matter for a cache benchmark are core count (concurrency), memory (the RAM tier and mapping headroom), and above all the storage stack and filesystem, which set the per-file system-call cost.
- Laptop. Apple-silicon, 10 cores, near-RAM APFS SSD. A very fast core and an SSD close to RAM speed, but few cores. This is where per-file cost is cheapest, so it is the conservative case for every filesystem claim here.
- Workstation. 64-core x86, 96 GB, ext4 on virtualized storage. Many threads draw the clean concurrency curve, and the virtualized storage adds the kind of filesystem-metadata overhead a production server actually pays.
Explore the numbers yourself, with per-machine toggles and the full latency distribution, in the data explorer.
In memory of Alan M. Carroll. Some of Cyclone’s design grew out of conversations with Alan, known to the Apache Traffic Server community as SolidWallOfCode and among the people who understood its architecture most deeply. His thinking about how a cache lays bytes out on disk and reclaims space without stalling helped shape how we thought about ours, and he shared it generously. We remember him with gratitude. Alan’s memorial at the Apache Software Foundation.
Read next
-
Benchmarking ModPageSpeed 2.0: real numbers on real sites
Measured ModPageSpeed 2.0 results on e-commerce, blog, news, and portfolio sites: WebP/AVIF image savings and real LCP, FCP, CLS, and Lighthouse gains on 3G, 4G, and broadband.
-
Stopping Cache Fragmentation: Stripping Tracking Params and Normalizing URLs
Strip tracking parameters to stop cache fragmentation: ModPageSpeed normalizes the URL before keying, dropping UTM params, sorting the query, aliasing hosts.
-
Default Cache TTL: Heuristic Freshness When the Origin Sends No Cache-Control
Default cache TTL when no Cache-Control: per-content-type heuristic TTLs, RFC 9111 Age adjustment, and the shared-vs-private cache split in ModPageSpeed 2.0.