Perplexity tunes its Search API for span-level precision and speed

Perplexity has turned the search infrastructure that powers its own answer engine into a full-blown developer product, opening up a Search API that’s clearly aimed at the new world of RAG apps, agents, and AI-native products rather than old-school “10 blue links” search. It’s not just exposing URLs; it’s effectively selling a real-time, AI-optimized retrieval layer designed to drop clean, ranked evidence straight into large language models.

At the heart of this launch is a fairly opinionated view of what “search for AI” should look like. Instead of returning entire documents and forcing developers to handle messy parsing, Perplexity’s infrastructure breaks the web down into fine-grained spans—sections and snippets inside pages—that are individually scored against each query. The system leans on a hybrid retrieval stack that combines classical lexical matching with dense embeddings and multi-stage ranking, so by the time results hit your application, you’re getting compact, high-signal chunks that are already ordered by relevance. For anyone who has ever watched their context window evaporate because a retriever dumped in half a PDF, this is the pain point Perplexity is going after.

Internally, the company has been running this infrastructure at real web scale for a while. In a technical overview of its AI-first Search API, Perplexity describes an engine that processes around 200 million queries per day, backed by a web index of more than 200 billion unique URLs and a distributed system designed to keep latency in the sub-300-millisecond range. That stack relies on “tens of thousands” of CPU cores and large in-memory shards to keep span-level data close to ranking models, which is overkill for a typical SaaS app but exactly what you want if you’re trying to be a default retrieval layer for AI workloads. The company’s pitch is that instead of bolting a generic search engine onto a model, you plug directly into a system that was purpose-built to feed models only what they actually need.

The March update to the Search API makes that philosophy even more obvious because it focuses almost entirely on snippet quality. Perplexity built a new span-level labeling and evaluation pipeline that annotates parts of a document as “vital,” “irrelevant,” or duplicative in the context of a specific query, then uses those labels to measure how much of the right content—and how little of the wrong content—lands in the returned snippet. In practice, that let them aggressively shrink snippet size without hurting answer quality; in fact, internal tests showed that smaller snippets, once they were pruned and scored correctly, actually improved downstream accuracy while cutting token usage and response payloads. For developers, that translates into lower context bloat, lower OpenAI/Anthropic bills, and more predictable behavior when you wire this into an LLM chain.

Perplexity is also using benchmarks to make the case that this isn’t just marketing. One of the most interesting pieces is SEAL, a time-sensitive retrieval benchmark that checks whether a search system can consistently surface the current correct answer when that answer changes over time—think live sports stats, market caps, or policy changes. When Perplexity ran its open-source search_evals framework on the February SEAL release with Anthropic’s Claude Sonnet 4.5 as the downstream model, its own Search API scores climbed while competing providers’ performance on the harder SEAL variant actually fell. Benchmarks are always nuanced, but the message is clear: they want to be the go-to option when you care about freshness and real-time indexing, not just static corpora.

Beyond quality, the feature set is starting to look like something you can actually architect a product around. The API now accepts up to five queries in a single request, returning grouped results in order, which is particularly useful for agentic systems that break a complex question into multiple sub-queries or for apps that want to fan out a batch of related searches to keep latency in check. Filtering is more granular than the typical “time and site” controls: Perplexity supports allowlists and denylists for up to 20 domains, recency windows, ISO 639-1 language filters, and ISO country code-based regional search that can be combined to narrow scope—say, English-language content from German domains in the last week. This kind of control matters if you’re building, for example, a region-specific financial assistant, a multilingual research tool, or a news product that must respect compliance boundaries.

On the developer experience side, Perplexity is trying to make integration feel familiar if you’ve used any modern AI API. There’s a Python SDK that exposes Search right alongside the existing Agent API and Sonar API, with a simple client.search.create style interface. The official quickstart docs show how to configure parameters like max_results, language and domain filters, and recency windows with a few lines of code, which should reduce the friction for teams already experimenting with Perplexity’s other APIs. The company’s broader API platform page makes it clear that Search is meant to be one of four pillars—alongside Agent and model APIs—rather than a bolt-on extra, positioning it as part of a more unified AI stack.

Taken together, the launch reads less like “yet another search API” and more like a strategic pivot. Perplexity is essentially productizing the same retrieval backbone that underpins its consumer answer engine, betting that developers want direct access to that infrastructure instead of stitching together their own cocktail of web search, scraping, chunking, and heuristic ranking. In a landscape where RAG quality often lives or dies on the retrieval layer, a purpose-built, span-aware, benchmarked search API is a strong move—and one that could quietly become the default choice for teams that care more about grounded, real-time answers than about where the links came from.