# Onto-Reader, robots.txt, and the cooperative agent web · Onto Blog
> ScrapingBee bypasses blocks. We honor them. A look at why Onto declares itself, respects robots.txt, and bets on a cooperative agent web — and why that's the only durable path.

**Source:** /blog/onto-reader-and-the-cooperative-agent-web
**Extracted:** 2026-05-20T20:59:17.688Z

---
There are two ways to read the web at scale.

You can declare yourself, honor the site's preferences, and bet on a long-term equilibrium where sites and agents both have reason to participate. Or you can rotate residential proxies, bypass captchas, and treat every site as adversarial. The first path is the cooperative agent web. The second is the adversarial scrape stack that's existed for fifteen years. Both are real products. They're solving different problems.

Onto is firmly on the cooperative side. This post explains why — and why we think it's the only durable path for AI grounding specifically, even though we lose use cases as a result.

### What "cooperative" actually means

Our crawler ships under a single, declared user-agent: `Onto-Reader`. We don't rotate it. We don't spoof Googlebot or pretend to be a browser. If a site wants to inspect our traffic, they'll see consistent identification — and the UA string itself points at [our crawler policy page](/about-bot) so any admin who sees it in their logs can find out who we are.

Before any fetch, the engine checks `/robots.txt` for that host. If GPTBot or the wildcard user-agent is disallowed, the call short-circuits with the `ROBOTS_BLOCKED` error code and the outbound fetch never happens — it doesn't count against your quota, it doesn't hit the origin. If the site has a WAF that blocks our user-agent at the edge, we return `WAF_BLOCKED` and your agent gets a clear signal to skip this URL.

We don't try to work around either one. That's a feature, not a limitation.

### The case against bypass

For general web scraping, bypassing blocks is reasonable. The market has known rules: scrapers escalate, sites defend, both sides accept the dynamic. ScrapingBee is excellent at this job and we recommend it over us when bypass is the actual need.

For AI grounding specifically, bypass is the wrong tool. Here's why:

**1\. The legal posture is different.** An agent grounding on content it shouldn't have crawled inherits the legal risk of how it got there. We've seen this pattern with copyright suits — the question isn't just "is the model output infringing" but "was the training corpus assembled legally." If you're shipping an agent product to enterprise customers, you want clean provenance on every read.

**2\. The product economics are different.** Site owners who block AI crawlers today are often the ones who would, if the alternative were better, opt into a structured way to participate. We're building that alternative — the [Serve SDK](/serve-sdk) — so site owners can serve agents pre-cleaned content with analytics, context injection, and rate-limit control. A site that installed Serve gets paid attention; a site that just gets scraped doesn't. The asymmetry is what makes opt-in attractive.

**3\. The relationships compound.** The agent web has "everyone's a stakeholder" dynamics. Sites, agents, model labs, and end users all interact. A vendor whose product depends on adversarial behavior accumulates enemies. A vendor whose product makes participation rewarding for every party accumulates allies. Year one of either bet looks the same. Year five does not.

### The classifier

On the Serve side, we maintain a registry of AI crawler user-agents — currently 17 declared bots (GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User, Google-Extended, Bytespider, CCBot, Meta-ExternalAgent, FacebookBot, cohere-ai, Applebot-Extended, DuckAssistBot, and a handful of others). When a known agent hits a site running the Serve middleware, we route them to clean Markdown and log the event for the site owner's analytics.

The classifier expands over time as new AI products ship. Anything we don't classify shows up in the "Unidentified agents" panel on the analytics page — raw User-Agent strings grouped by hit count, so the site owner can see exactly what's crawling them. If a real agent shows up there, we add it to the classifier.

This is the cooperative loop made concrete. Crawlers identify themselves; sites see who's reading them; we close the feedback loop between the two. A site owner who's been wondering "is anyone reading my product docs with AI?" can find out in real time.

### What we give up

The honest tradeoff: there are use cases we can't serve. Sites that block all AI crawlers in robots.txt — we respect that, even if the content would be useful to your agent. Sites behind captcha or proxy walls — we'll return `WAF_BLOCKED` and move on. If your product fundamentally needs to read content the site has actively opted out of, we're not the right vendor.

That's an explicit choice. We'd rather lose those use cases than build the bypass machinery that comes with them.

### The bet

We're betting that the agent web settles into something closer to email than to scraping — a multi-party system with declared participants, standards (MCP, content negotiation, robots.txt extensions), and a working set of norms that nobody centrally enforces but everyone has reason to follow. We may be wrong. The adversarial side might dominate for years before the cooperative side catches up.

But the durable position is the cooperative one. Onto-Reader, the classifier, the Serve SDK, the robots.txt enforcement — all of these are designed for the world where AI agents are first-class citizens of the web, not anonymous parasites on it. That's the world we're building toward, and it's the world the rest of our product only makes sense inside.

If you're a site owner who's ready to opt in: [install the Serve SDK](/serve-sdk). If you're an AI developer who wants reads with clean provenance: [the Read API](/read-api) respects robots.txt and gives you the receipt. Either way, you're building the cooperative side.