Crawling and indexing 100M+ pages at Kagi

I joined Kagi's index team in mid-2025. Kagi is a privacy-first, paid search engine. My team's job is to crawl the web, build the index, and keep search results fresh. This post covers the architecture I helped build to crawl and index over 100 million pages per month, the constraints that shaped it, and the trade-offs I made along the way.

The constraints

Building a search index is not just a scale problem. The hard parts are all in the constraints:

Cost discipline. Kagi is a paid product with no ads. Every dollar spent on infrastructure comes directly from subscriber revenue. There is no VC-funded margin to waste.
Freshness vs. cost. Users expect fresh results, but re-crawling everything constantly is expensive. You need to be smart about what to re-crawl and when.
Politeness. Crawlers that hammer websites get blocked. You need rate limiting per domain, backoff on errors, and respect for robots.txt.
Fault tolerance. At 100M+ pages per month, something is always failing. A network blip, a storage timeout, a malformed HTML page. The system has to keep running.
Quality over recall. Unlike massive crawlers that index everything, we care about indexing pages that are actually useful for search. This changes the architecture significantly.

Architecture overview

The system has five main components:

Crawling the web from scratch is slow and expensive. To reach 100M+ pages per month without burning through infrastructure budget, we supplement live crawling with two external datasets: Common Crawl, which publishes monthly snapshots of billions of web pages, and Wikipedia Enterprise, which provides structured access to the full English Wikipedia. Common Crawl lets us bootstrap coverage for domains we have not crawled ourselves — pulling pages from their latest snapshot is orders of magnitude cheaper than fetching them live. Wikipedia is critical for search quality and does not need a crawler at all; it is a known, structured dataset that we can ingest directly through Wikipedia Enterprise. All three sources — live crawls, Common Crawl, and Wikipedia — feed into the same indexing pipeline. We plan to continue adding new sources over time.

The Crawler API is a FastAPI service that manages domain crawl jobs, realtime crawl requests, and queue status. The Crawler Workers handle background tasks: Common Crawl batch ingestion, Wikipedia Enterprise snapshots, link extraction, and domain-info enrichment. Frontera is the scheduled crawling engine, built on Scrapy's Frontera framework with a SQL backend and multiple Kafka topics. Indexer Workers are Kafka consumers that parse HTML, generate embeddings, and feed documents to Vespa. Vespa serves the search index with hybrid search.

Every component runs on Kubernetes as its own pod. We went all in on distributed systems from the start — there is no monolith hiding behind the diagram. But we do like to keep all the code in a monorepo. We manage all deployments with Helm charts. We tried Kustomize first and found it painful to design pod configs for a system with this many moving parts: partition-pinned pods, KEDA triggers, per-environment overrides, and sharded deployments all fighting each other in overlays. Helm charts were a breeze by comparison. It also helps that frontier LLMs are significantly better at writing Helm templates than Kustomize patches, which made the migration faster than expected.

The crawling side of the system looks like this:

The Frontera crawling pipeline

Frontera is the heart of scheduled crawling. It is built on Scrapy's Frontera framework, which provides the partition-based architecture for distributed crawling: a SQL backend for URL state, Kafka topics for inter-pod communication, and a strategy worker pattern for crawl intelligence. The Frontera cluster setup docs cover the general architecture well. Here I will focus on how we configured and extended it for our use case.

Bootstrap from Common Crawl, not seed URLs

When a domain crawl job starts, the Crawler Worker does not begin by fetching pages from the live web. Instead, it first pulls all available pages for the domain from the latest Common Crawl snapshot, extracts href links, and seeds unvisited URLs into the Frontera SQL backend. This bootstrap-from-CC strategy means we start every job with coverage rather than starting from a single seed URL. The worker tracks progress via a progress-spider-log consumer group and marks the job complete when pages_crawled ≥ max_pages.

Our production configuration

Each Frontera pod — spider, strategy worker, DB worker — is pinned to a single partition so work stays isolated and horizontally scalable. Batch generation is the exception: we shard it across separate Helm deployments so it can drain the queue fast enough. All partition-pinned pods scale to zero via KEDA watching per-partition Kafka lag. Those are all tiny pods with minimal CPU and memory requests, so we can have lots of them. If we want to increase throughput, we simply increase the number of partitions, which will automatically create more pods.

Custom strategy: KagiDiscoveryStrategy

The strategy worker runs KagiDiscoveryStrategy, which is where we encode judgment about which pages are worth fetching. It applies per-domain caps and buffer limits, decides which discovered links to schedule, and emits scoring-log updates for the DB worker to materialize into queue rows. This is the component that makes the difference between "crawl everything" and "crawl what matters."

The pain points of scaling

Things start to break when you add more partitions, and thus more pods. The first thing we needed to do was to use PgBouncer to avoid blowing up the Postgres connection limit. External systems such as Postgres, Redis, and Kafka are hammered more as partitions increase. If one of those fails or becomes slow, the whole pipeline stalls. Postgres is the most common bottleneck: we use it as the source of truth and multiple services depend on it, so it gets hit with a high volume of queries. We had to optimize our schema and queries, add indexes, and increase the instance size to keep up with the load. We also added caching layers where possible to reduce the number of queries hitting Postgres directly. I am still learning how to best optimize Postgres. It goes on for days at full speed and everything is good, but then a new cron job kicks in, adds additional database load, and the entire instance suffers. We had far too many incidents, and migrating to another database is still on the table.

Deduplication with Redis Bloom filters

At this scale, you see the same URLs constantly. Without deduplication, the crawler wastes most of its budget re-fetching pages it already has. We use Redis Bloom filters for fast, memory-efficient URL deduplication.

Three data sources, one pipeline

Not every page comes from live crawling. We ingest from three sources, all feeding into the same indexing pipeline:

Live crawling (Frontera)

The scheduled pipeline described above. This is how we crawl domains on our own terms, following links, respecting politeness, and applying our own crawl strategy.

Common Crawl

Common Crawl publishes monthly web snapshots. We query their CDX index or Athena for specific domains, fetch WARC records, process them through the same storage and indexing pipeline, and track everything in the same crawled_content table. This lets us bootstrap coverage for domains we haven't crawled ourselves, at a fraction of the cost of live-fetching.

Wikipedia Enterprise

Wikipedia is critical for search quality. We ingest full Wikipedia snapshots via the Enterprise API, streaming data through a large pool of concurrent download workers. Wikipedia jobs skip Frontera entirely — there is no need for link discovery or crawl scheduling when you are ingesting a known dataset.

The indexing pipeline

Every crawled page, regardless of source, ends up as a Kafka message on the indexer topic. Indexer workers consume these messages and:

Parse HTML and extract text, titles, descriptions, and metadata
Generate embeddings
Feed the document to Vespa through a high-throughput writer with substantial concurrency

The Vespa schema supports hybrid search, combining text matching and vector similarity. Documents carry quality, authority, and popularity scores alongside the text and embeddings. Multiple ranking profiles handle different query types.

Vespa Cluster

Vespa is the serving layer for the whole index, so I designed it as a real cluster rather than a single search node. In production, the query tier and content tier are separated, the content layer is spread across multiple GCP zones, and the cluster is configured with replication so documents remain searchable even if a node or an entire group is unavailable.

That topology gives me both availability and predictable scaling. If I need more index capacity or more backend search and feed throughput, I add content nodes. More content nodes means more RAM for the index, more disk for documents, and more search work distributed across the content layer. If I need more front-door query concurrency, I add more container nodes instead. Vespa separates those concerns cleanly: content nodes scale storage and backend execution, while container nodes scale query serving.

The cluster is also tuned to stay useful under failure, not just healthy on paper. Query timeout is set to one second, minimum coverage is 90%, and dispatch uses the adaptive policy with availability prioritized. In practice, that means the system prefers returning good results quickly from the healthy parts of the cluster instead of turning a partial outage into a full outage. That trade-off matters more to users than theoretical perfection during an incident.

Operationally, the expensive part is the content layer. Each content node runs with 7 CPU, 56 GiB of RAM, and 5 TiB of premium storage. Those are the machines that actually hold the index, so they dominate both cost and warm-up time. By comparison, the query-serving nodes are much lighter. That split is intentional: most of the money goes into the content tier, because that is where storage capacity, search capacity, and indexing throughput actually live.

Scalability and throughput

The scaling primitive for Frontera is the partition. In production, that means dozens of spiders, strategy workers, and DB worker pods, all with scale-to-zero via KEDA. Each partition is owned by a single consumer, so you never have two workers fighting over the same slice of work.

In production, we sustain millions of pages crawled per day, and a similar number of pages indexed per day. That puts the system comfortably above the 100M pages/month mark when running at full capacity.

The throughput mental model is:

Fetch throughput ≈ active spider partitions × per-spider concurrency × success rate
Scheduling throughput is bounded by frontier-done → SW lag
Queue materialization is bounded by frontier-score → DBW lag and database write capacity

Batch generation is sharded separately and drains millions of queue rows per hour. The key levers are partitionCount, KEDA triggers on per-partition Kafka lag, and the way batch generation is split across workers.

Non-Frontera components scale differently. The Search API and Crawler Workers use standard HPA, while indexer workers scale aggressively based on CPU and memory thresholds.

All pods except the API services and Vespa run on preemptible (spot) nodes, which saves up to 90% on compute costs. Since crawl workers, indexer workers, and the Frontera pods are stateless services that tolerate restarts gracefully, this was a straightforward change — Kubernetes reschedules preempted pods within seconds, and Kafka consumers simply resume from their last committed offset.

The operational signals to watch are Kafka lag per topic and partition (frontier-todo, frontier-done, frontier-score) and the DB queue growth rate versus the batchgen drain rate.

Observability

You cannot operate a system like this without knowing what it is doing. Every component exposes Prometheus metrics on dedicated ports. The key metrics I track:

Search latency: if p95 starts climbing, users feel it quickly.
Crawl HTTP status distribution: if successful fetches drop noticeably, it usually signals either a crawler bug or a wave of broken domains.
Indexer throughput: when throughput falls well below the expected baseline for current load, something is stuck.
Dead letter queue size: messages that failed processing, grouped by reason. This is the early warning system for new categories of failure.

What broke

A few things that bit us and shaped the architecture:

Database pressure from Frontera scoring. The scoring consumer writes batches to PostgreSQL's queue table. Early on, large batch sizes caused lock contention and slowed down the entire pipeline. The fix was adaptive batching — start small, grow when writes are fast, shrink when the database pushes back. That kept the database responsive without turning the scoring path into a bottleneck.

Writing to Vespa directly from indexer pods. This was fine when the fleet was small, but it broke once production scaled up. Vespa can stall during ingestion while it flushes, and when every indexer pod was feeding Vespa directly that pause turned into fleet-wide backpressure: throughput dropped, timeouts climbed, Kafka lag spread across the whole consumer group, and expensive indexer workers sat idle waiting on Vespa when they could have been processing embeddings instead. The fix was to decouple ingestion — indexers now write compressed document payloads to shared storage and publish lightweight pointers to a vespa-feed topic, while a small vespa-writer deployment owns the actual feed into Vespa. That isolated Vespa slowdowns to a much smaller set of workers instead of letting them stall the whole indexer fleet.

Bloom filter capacity misses. When a crawl job discovered far more URLs than expected, the Bloom filter expanded repeatedly, eating Redis memory. The fix was to tune the filter growth behavior and lifetime, but we also had to add BF.INFO monitoring to catch filters that were growing out of control before they impacted other tenants.

Wikipedia ingestion timeouts. Streaming hundreds of thousands of articles from the Enterprise API is not a fast operation. Network hiccups would kill the download mid-stream. The fix was chunked downloading, retries with backoff, and frequent progress logging so you can see where it stalled.

These are just a few examples. Every time we scale the system further — more partitions, more crawler workers, higher document throughput — something unexpected breaks. The architecture is shaped as much by these failures as by any upfront design.

What I intentionally did not optimize

We use Google Cloud Storage for everything — crawled HTML files, raw blobs, intermediate artifacts. GCS is extremely easy to use and scales without any capacity planning. The trade-off is cost: it is significantly more expensive than local SSD storage, and the bill grows linearly with the corpus. So far the simplicity and reliability are worth the premium, but this is the kind of decision that gets revisited as the index grows.

If you are building something similar or want to discuss trade-offs in search infrastructure, I would enjoy hearing from you. I am reachable at or on LinkedIn.