marco@altran ~/blog $ cat crawling-100m-pages-at-kagi.md

Crawling and indexing 100M+ pages at Kagi

I joined Kagi's index team in mid-2025. Kagi is a privacy-first, paid search engine. My team's job is to crawl the web, build the index, and keep search results fresh. This post covers the architecture I helped build to crawl and index over 100 million pages per month, the constraints that shaped it, and the trade-offs I made along the way.

The constraints

Building a search index is not just a scale problem. The hard parts are all in the constraints:

Architecture overview

The system has five main components:

Crawler API + Workers Domain jobs, realtime Storage + Kafka GCS, Postgres Indexer Workers Embeddings, parsers Storage + Kafka GCS, Postgres Vespa Hybrid search Crawler API + Workers Domain jobs, realtime Storage + Kafka GCS, Postgres Indexer Workers Embeddings + parsers Storage + Kafka GCS, Postgres Vespa Hybrid search

Crawling the web from scratch is slow and expensive. To reach 100M+ pages per month without burning through infrastructure budget, we supplement live crawling with two external datasets: Common Crawl, which publishes monthly snapshots of billions of web pages, and Wikipedia Enterprise, which provides structured access to the full English Wikipedia. Common Crawl lets us bootstrap coverage for domains we have not crawled ourselves — pulling pages from their latest snapshot is orders of magnitude cheaper than fetching them live. Wikipedia is critical for search quality and does not need a crawler at all; it is a known, structured dataset that we can ingest directly through Wikipedia Enterprise. All three sources — live crawls, Common Crawl, and Wikipedia — feed into the same indexing pipeline. We plan to continue adding new sources over time.

The Crawler API is a FastAPI service that manages domain crawl jobs, realtime crawl requests, and queue status. The Crawler Workers handle background tasks: Common Crawl batch ingestion, Wikipedia Enterprise snapshots, link extraction, and domain-info enrichment. Frontera is the scheduled crawling engine, built on Scrapy's Frontera framework with a SQL backend and multiple Kafka topics. Indexer Workers are Kafka consumers that parse HTML, generate embeddings, and feed documents to Vespa. Vespa serves the search index with hybrid search.

Every component runs on Kubernetes as its own pod. We went all in on distributed systems from the start — there is no monolith hiding behind the diagram. But we do like to keep all the code in a monorepo. We manage all deployments with Helm charts. We tried Kustomize first and found it painful to design pod configs for a system with this many moving parts: partition-pinned pods, KEDA triggers, per-environment overrides, and sharded deployments all fighting each other in overlays. Helm charts were a breeze by comparison. It also helps that frontier LLMs are significantly better at writing Helm templates than Kustomize patches, which made the migration faster than expected.

The crawling side of the system looks like this:

Crawler API Domain jobs Workers CC, Wiki, links SQL Backend Queue, states FRONTERA Batchgen Drains queue Spider Scrapy + Frontera Strategy Crawl intelligence DB Worker Incoming + scoring frontier-todo frontier-done frontier-score ↺ DB Worker writes queue rows back to SQL Backend — Batchgen drains them next cycle Crawler API Domain jobs Workers CC, Wiki, links SQL Backend Queue, states FRONTERA Batchgen Drains queue frontier-todo Spider Scrapy + Frontera frontier-done Strategy Crawl intelligence frontier-score DB Worker Incoming + scoring ↺ Writes queue rows back to SQL Backend

The Frontera crawling pipeline

Frontera is the heart of scheduled crawling. It is built on Scrapy's Frontera framework, which provides the partition-based architecture for distributed crawling: a SQL backend for URL state, Kafka topics for inter-pod communication, and a strategy worker pattern for crawl intelligence. The Frontera cluster setup docs cover the general architecture well. Here I will focus on how we configured and extended it for our use case.

Bootstrap from Common Crawl, not seed URLs

When a domain crawl job starts, the Crawler Worker does not begin by fetching pages from the live web. Instead, it first pulls all available pages for the domain from the latest Common Crawl snapshot, extracts href links, and seeds unvisited URLs into the Frontera SQL backend. This bootstrap-from-CC strategy means we start every job with coverage rather than starting from a single seed URL. The worker tracks progress via a progress-spider-log consumer group and marks the job complete when pages_crawled ≥ max_pages.

Our production configuration

Each Frontera pod — spider, strategy worker, DB worker — is pinned to exactly one partition via FRONTERA_SPIDER_PARTITION_ID, with maxReplicaCount=1. Batchgen is the exception: we shard it across two Helm deployments (frontera-batchgen-even and frontera-batchgen-odd) to drain millions of queue rows per hour. All partition-pinned pods scale to zero via KEDA watching per-partition Kafka lag. Those are all tiny pods with minimal CPU and memory requests, so we can have lots of them. If we wanto to increase throughput, we simply increase the number of partitions, which will automatically create more pods.

Custom strategy: KagiDiscoveryStrategy

The strategy worker runs KagiDiscoveryStrategy, which is where we encode judgment about which pages are worth fetching. It applies per-domain caps and buffer limits, decides which discovered links to schedule, and emits scoring-log updates for the DB worker to materialize into queue rows. This is the component that makes the difference between "crawl everything" and "crawl what matters."

The painpoints of scaling

Things starts to break when you add more partitions, and thus more pods. The first thing we needed to do was to use PgBouncer to avoid blowing up the Postgres connection limit. External systems such as Postgres, Redis and Kafka are hammered the more partitions we have. If one those fails or becomes slow, the whole pipeline stalls. Postgres is the most common bottleneck: we use as the source of truth and multiple services depend on it, so it gets hit with a high volume of queries. We had to optimize our schema and queries, add indexes, and increase the instance size to keep up with the load. We also added caching layers where possible to reduce the number of queries hitting Postgres directly. I am still learning how to best optimize Postgres. It goes on for days at full speed and everything is good, but then a new cron job kicks in, adds additional db load, and the entire instance suffers. We had far too many incidents, and migrating to another database is still on the table.

Deduplication with Redis Bloom filters

At this scale, you see the same URLs constantly. Without deduplication, the crawler wastes most of its budget re-fetching pages it already has. We use Redis Bloom filters for fast, memory-efficient URL deduplication.

Three data sources, one pipeline

Not every page comes from live crawling. We ingest from three sources, all feeding into the same indexing pipeline:

Live crawling (Frontera)

The scheduled pipeline described above. This is how we crawl domains on our own terms, following links, respecting politeness, and applying our own crawl strategy.

Common Crawl

Common Crawl publishes monthly web snapshots. We query their CDX index or Athena for specific domains, fetch WARC records, process them through the same storage and indexing pipeline, and track everything in the same crawled_content table. This lets us bootstrap coverage for domains we haven't crawled ourselves, at a fraction of the cost of live-fetching.

Wikipedia Enterprise

Wikipedia is critical for search quality. We ingest full Wikipedia snapshots via the Enterprise API, streaming NDJSON data through 100 concurrent download workers. Wikipedia jobs skip Frontera entirely — there is no need for link discovery or crawl scheduling when you are ingesting a known dataset.

The indexing pipeline

Every crawled page, regardless of source, ends up as a Kafka message on the indexer topic. Indexer workers consume these messages and:

The Vespa schema supports hybrid search, combining text matching and vector similarity. Documents carry quality, authority, and popularity scores alongside the text and embeddings. Multiple ranking profiles handle different query types.

Vespa Cluster

Vespa is the serving layer for the whole index, so I designed it as a real cluster rather than a single search node. In production, the query tier runs on three container nodes, and the content tier is split across three distribution groups spread across three GCP zones. Each group has two content nodes, for six content nodes total. The cluster is configured with redundancy=3, min-redundancy=2, and searchable-copies=3, which means documents are replicated across the cluster and remain searchable even if a node or an entire group is unavailable.

That topology gives me both availability and predictable scaling. If I need more index capacity or more backend search and feed throughput, I add content nodes. More content nodes means more RAM for the index, more disk for documents, and more search work distributed across the content layer. If I need more front-door query concurrency, I add more container nodes instead. Vespa separates those concerns cleanly: content nodes scale storage and backend execution, while container nodes scale query serving.

The cluster is also tuned to stay useful under failure, not just healthy on paper. Query timeout is set to one second, minimum coverage is 90%, and dispatch uses the adaptive policy with availability prioritized. In practice, that means the system prefers returning good results quickly from the healthy parts of the cluster instead of turning a partial outage into a full outage. That trade-off matters more to users than theoretical perfection during an incident.

Operationally, the expensive part is the content layer. Each content node runs with 7 CPU, 56 GiB of RAM, and 5 TiB of premium storage. Those are the machines that actually hold the index, so they dominate both cost and warm-up time. By comparison, the query-serving nodes are much lighter. That split is intentional: most of the money goes into the content tier, because that is where storage capacity, search capacity, and indexing throughput actually live.

Scalability and throughput

The scaling primitive for Frontera is the partition. In production, partitionCount=50 means up to 50 spiders, 50 strategy workers, and 50 DB worker pods — all with scale-to-zero via KEDA. Each partition-pinned pod has replica ≤ 1, so you never have two consumers fighting over the same partition.

In production, we sustain 3–5 million pages crawled per day across 50 Frontera partitions, and a similar 3–5 million pages indexed per day with up to 200 indexer pods. That puts the system comfortably above the 100M pages/month mark when running at full capacity.

The throughput mental model is:

The batchgen shards even/odd across two pods and drains millions of queue rows per hour. The key levers are partitionCount, KEDA triggers on per-partition Kafka lag, and the batchgen even/odd sharding.

Non-Frontera components scale differently. The Search API and Crawler Workers use standard HPA (2–10 and 2–15 replicas). Indexer workers scale from 1 to 200 replicas based on CPU and memory thresholds.

All pods except the API services and Vespa run on preemptible (spot) nodes, which saves up to 90% on compute costs. Since crawl workers, indexer workers, and the Frontera pods are stateless services that tolerate restarts gracefully, this was a straightforward change — Kubernetes reschedules preempted pods within seconds, and Kafka consumers simply resume from their last committed offset.

The operational signals to watch are Kafka lag per topic and partition (frontier-todo, frontier-done, frontier-score) and the DB queue growth rate versus the batchgen drain rate.

Observability

You cannot operate a system like this without knowing what it is doing. Every component exposes Prometheus metrics on dedicated ports. The key metrics I track:

What broke

A few things that bit us and shaped the architecture:

Database pressure from Frontera scoring. The scoring consumer writes batches to PostgreSQL's queue table. Early on, large batch sizes caused lock contention and slowed down the entire pipeline. The fix was adaptive batching — start small, grow when writes are fast, shrink when the database pushes back. A target of 1-second processing time per batch, with growth increments of 64 and a 0.5x shrink factor, keeps the database happy.

Writing to Vespa directly from indexer pods. This was fine when the fleet was small, but it broke once production scaled past 100 indexer pods. Vespa can stall during ingestion while it flushes, sometimes for several minutes, and when every indexer pod was feeding Vespa directly that pause turned into fleet-wide backpressure: throughput dropped, timeouts climbed, Kafka lag spread across the whole consumer group, and expensive indexer workers sat idle waiting on Vespa when they could have been processing embeddings instead. The fix was to decouple ingestion — indexers now write compressed document payloads to shared storage and publish lightweight pointers to a vespa-feed topic, while a small vespa-writer deployment owns the actual feed into Vespa. That isolated Vespa slowdowns to a much smaller set of workers instead of letting them stall the whole indexer fleet.

Bloom filter capacity misses. When a crawl job discovered far more URLs than expected, the Bloom filter expanded repeatedly, eating Redis memory. Adding the 2x expansion factor and 30-day TTL was the fix, but we also had to add BF.INFO monitoring to catch filters that were growing out of control before they impacted other tenants.

Wikipedia ingestion timeouts. Streaming hundreds of thousands of articles from the Enterprise API is not a fast operation. Network hiccups would kill the download mid-stream. The fix was chunked downloading (10MB chunks), up to 5 retries with 30-second delays, and progress logging every 100 articles so you can see where it stalled.

These are just a few examples. Every time we scale the system further — more partitions, more crawler workers, higher document throughput — something unexpected breaks. The architecture is shaped as much by these failures as by any upfront design.

What I intentionally did not optimize

We use Google Cloud Storage for everything — crawled HTML files, raw blobs, intermediate artifacts. GCS is extremely easy to use and scales without any capacity planning. The trade-off is cost: it is significantly more expensive than local SSD storage, and the bill grows linearly with the corpus. So far the simplicity and reliability are worth the premium, but this is the kind of decision that gets revisited as the index grows.


If you are building something similar or want to discuss trade-offs in search infrastructure, I would enjoy hearing from you. I am reachable at or on LinkedIn.