The Agentic Web Hits Your Origin, Not a Proxy
Status: the capabilities below sit at different stages. C2PA provenance preservation is available now in mod_pagespeed 1.15. Web Bot Auth verification is an experimental preview. Pay-per-crawl enforcement and agent markdown are upcoming, opt-in layers. Maturity is noted per feature.
The agentic web changed your traffic
Look at your access logs for the last quarter. A growing slice of the requests are not browsers. They are AI crawlers building training corpora, retrieval bots fetching pages to answer a user’s question right now, and autonomous agents acting on someone’s behalf. This is the agentic web arriving at your origin. Multiple CDNs report that automated traffic is a large and rising share of what they front. Your own numbers will vary, but the direction is the same.
This is not the old SEO crawler problem. A search bot fetched your page, indexed it, and sent you a human later. The new clients often consume your content and never send anyone. They cost you bandwidth and compute, and you cannot tell which one you are talking to.
Four problems show up at the origin. Each has a partial answer somewhere in the stack today. The question is where you solve it, and the answer that runs through all four is the same: on your own servers, not on a third party’s proxy.
Four problems, one address
1. Identity: the User-Agent string is unverifiable
A request arrives with User-Agent: GPTBot. Is it GPTBot? The User-Agent string is a free-text field. Anyone can send any value. Reverse-DNS checks help for the major crawlers that publish IP ranges, but they only cover those crawlers and lag behind new ranges.
The emerging answer is cryptographic. Web Bot Auth, built on RFC 9421 HTTP Message Signatures, lets a bot sign its requests with an Ed25519 key whose public half is published. The origin verifies the signature and knows the request came from the holder of that key, not from someone copying a header. That is identity you can act on, not a string you have to trust.
mod_pagespeed 1.15 includes a pre-release, observe-only Web Bot Auth verifier, off by default. When you enable it, it checks the signature and classifies each request into an nginx variable you can read in your config and your logs. It does not block. It tells you which requests carry a valid signature and who signed them, so you can build policy on real data before you enforce anything.
The signature format, the key directory, and the classification variable are covered in Verify AI crawlers with Web Bot Auth and RFC 9421.
2. Economics: metering crawl access at the origin
Once you can name a bot, the next question is whether it should be here for free. A bot that pulls ten thousand pages to train a model uses your infrastructure differently than a human reading one article. “Pay per crawl” has moved from a talking point to something operators want to enforce, and the origin is where you already meter the bytes.
mod_pagespeed 1.15 ships RSL-CAP, a capability-token check that runs at the origin. It reads an Authorization: License token, validates it, and refuses access unless the token carries the right license and scope. It turns “I can see this bot” into “this bot needs the right credential to proceed.”
Be clear-eyed about its status. RSL-CAP is off by default, and it is an opt-in commercial / early-access layer for operators ready to run the token side, not a feature that lights up on a stock install. The mechanism and the failure modes are in Pay per crawl at the origin. For the exact responses a request gets with and without a valid token, read the shipped smoke test rather than trusting a status code quoted from memory.
3. What to serve: agents waste tokens on your markup
An AI agent does not want your navigation or your nested <div> wrappers. It wants the content. Serve a retrieval bot a full HTML page and it spends tokens parsing markup it will throw away, and it parses HTML poorly enough that it sometimes throws away the content too.
The fix is to serve agents something they can read: clean markdown, plus an /llms.txt index that points them at it. ModPageSpeed 2.0 agent_optimize does this through content negotiation. When a client signals it wants the agent-friendly variant, the origin serves markdown synthesized from your existing pages. There is no separate content pipeline to maintain.
This is also off by default and a paid layer. Turn it on when your content is the product and your agent traffic is real. The generation rules, the negotiation header, and the shape of the synthesized /llms.txt are in Serve clean Markdown to AI agents.
4. Provenance: image optimization usually destroys it
C2PA Content Credentials attach signed provenance to an image: where it came from and what edited it, including whether a tool generated it with AI. Newsrooms, stock libraries, and camera makers are adopting them because the agentic web makes “is this image real?” a question buyers ask. Then the image hits your optimizer.
Most image pipelines strip everything that is not pixels. They re-encode to WebP or AVIF, drop the metadata, and hand back a smaller file with its provenance gone. You optimized the image and quietly broke the chain of custody you meant to keep.
By default, mod_pagespeed 1.15 is built to carry C2PA / Content Credentials through image optimization rather than strip them, and it adds an opt-in carry-through for cases where the credential must follow the image across a transform. This is the one feature in this set that is on by default, because the safe behavior, not silently destroying provenance, should not require a flag. Confirm it on your own assets. What is preserved, and the carry-through option, are in Preserve Content Credentials and C2PA through image optimization.
Why the origin, and why your servers
You could answer some of these at a third-party proxy. Several CDNs now offer bot verification and pay-per-crawl as managed products, and for some operators the managed path is the right call.
But notice what the proxy model asks of you. To verify a bot’s signature, it terminates your TLS and reads your requests. To gate crawl access, it sits between you and your clients and decides who proceeds. To serve agent-friendly content, it rewrites your responses on its own boxes. Every one of these moves a decision about your content and your traffic onto infrastructure you do not control, governed by a contract you renegotiate.
ModPageSpeed runs these checks at your origin, on your own servers. The nginx interceptor verifies the signature and checks the crawl token before your origin spends a byte. Your worker generates the markdown from your pages. The same optimizer that was already touching the image carries the provenance through rather than stripping it. No request leaves your perimeter to ask a vendor whether it should be allowed. The policy lives where the content lives.
It also means you keep the escape hatch. Each feature is independently configurable with its own default, so you can enable one without the others and test on staging first. The spoke posts and docs carry the exact directive and how to disable it if it does the wrong thing on your inputs. Observe-mode verification and default-on provenance preservation are the low-risk entry points. The gating and the markdown layers are opt-in for a reason.
What to do this quarter
Do not flip four switches at once. The honest sequence:
- Measure. Turn on the Web Bot Auth verifier in observe mode. Read the classification variable in your logs for a few weeks. Now you know your real agent mix instead of guessing from User-Agent strings.
- Protect provenance. If you serve images that carry Content Credentials, confirm the default-on C2PA preservation is doing what you expect on your assets.
- Decide on economics and content. Pay-per-crawl gating and agent markdown are paid, opt-in layers. Whether they are worth it depends on how much agent traffic you have and whether your content is the thing being consumed. The observe-mode data from step one makes that an informed decision rather than a hunch.
The agentic web is not a future you prepare for. It is in your access logs today. The choice is whether you meet it at your own origin, with levers you control, or hand the decision to someone else.
Start with the identity layer: the verifier is pre-release, observe-only, and default-off, and the data it gives you informs everything else. See how the Web Bot Auth verifier classifies each request.
Read next
-
Pay-Per-Crawl: An Origin Enforcement Gate (Early Access)
mod_pagespeed 1.15 ships the RSL-CAP enforcement primitive for nginx (default-off, early-access): gate or refuse unauthorized crawl access at your origin.
-
Serve markdown to AI agents from your own origin
ModPageSpeed 2.0 can serve a markdown variant of a page on Accept text/markdown and synthesize an /llms.txt index. Opt-in paid layer, off by default.
-
Web Bot Auth: Verify AI Crawlers at Your Origin
mod_pagespeed 1.15 ships an nginx Web Bot Auth verifier that checks RFC 9421 signatures and labels each request in $x_verified_bot. Observe-only, default off.