Document Search, Built End-to-End: A Pare Studio Case Study

Most demonstrations of document search hand-wave the difficult parts. We wanted to know what happens when you build the whole thing. So we did, against open property data, because property listings have the same shape as the document problems we see in regulated firms. Long descriptions, structured facts, and questions that mix the two.

Worked example, not a client delivery.

Built end-to-end against a public dataset. Shows the architecture in motion, with no client behind it. See delivered projects →

1.8 hours/day Average time knowledge workers spend searching for information (McKinsey)

Why property listings

The choice of dataset matters less than it looks. We picked property listings because the open data exists, the structure is rich, and the questions buyers ask resemble the questions a paralegal asks a contract library or a compliance officer asks a regulations binder. A buyer doesn't search "3-bed semi under 450k near a good primary school." They ask it in plain language, and the system has to handle the structured constraints and the soft preferences at the same time.

That is the same problem a lawyer has when they ask "show me clauses that cap liability at twelve months in our last twenty SaaS contracts." Structured filter, semantic search, cited answer. The pipeline does not care that one is bedrooms and the other is liability caps.

You probably have this problem if…

Your team searches a shared drive by filename and opens several files to find one answer.
The same questions get asked repeatedly. Different people, different clients, same underlying point.
Senior staff get interrupted to answer questions that already exist in writing.
You're not always confident the version on the shared drive is the current one.
Onboarding new hires is slow because there's no fast way to look things up.

Any two of those is a strong signal. All of them and you're losing meaningful billable capacity to search.

How the build works

Imagine someone in the team who has read every document you own, remembers what each passage means, and is on call to answer any question. When you ask something, they recall the few passages most relevant and read them back to you with the source attached. That's the system.

Four things have to work for it to be useful.

From a corpus of documents to a cited answer.

1. Normalise the documents

The raw data came from CrawlFeeds, a public Zoopla export. Each listing is messy in its own way, with price as text in some rows and as a number in others, addresses split across three fields, descriptions concatenated with marketing copy. The first job is to flatten that into a single clean document per listing, with the structured facts (price, bedrooms, tenure, location) sitting alongside the prose (description, features). For regulated work the same pattern applies. A contract has a numeric value, an effective date, a counterparty, and several thousand words of clauses. Both have to be searchable.

2. Build fingerprints from meaning, not words

Each listing yields several typed snippets. The buyer's question lands closest to the description cluster of a specific listing.

Each listing is broken into typed snippets. A summary, a description, a features block, a location string. For every snippet, the system creates a numerical fingerprint that captures what the snippet means, rather than which words it uses. A question about "somewhere I could grow vegetables" will find listings that mention a south-facing garden, even though neither uses the same words. That happens because the fingerprints are close in mathematical space.

The technical name for the fingerprint is a vector embedding. The build runs in two modes. A deterministic local mode that needs no API keys and is fine for development. A hosted mode using OpenAI's text-embedding-3-small when you want production-grade retrieval quality. For a real client deployment we would benchmark domain-specialised embeddings like Voyage AI, which consistently improve recall on regulatory and legal text.

3. Pull back the snippets closest to the question

This is where the worked example earns its place. A naive system embeds the question, compares to every snippet, and returns the top few. That works for "tell me about gardens." It falls over the moment you ask "three-bedroom flat in Putney under £600k with a garden."

So the system does two things at once. It extracts the structured constraints from the question (price ceiling, bedroom count, location, property type) and uses them as filters before similarity ranking. Then it ranks the survivors by semantic similarity to the question, with a small boost for listings whose structured facts directly match. If too few listings clear the bar, the system relaxes the least-important constraint and tries again, flagging which constraint was relaxed so the buyer knows.

That hybrid (filter, then rank, then relax) is what makes the system feel like it understands the question. In a legal corpus the equivalent might be "show me contracts signed after 2024 with US counterparties where the liability cap is below twelve months." Same shape, different facts.

4. Write the answer, prove every citation

The shortlisted snippets are handed to a language model with the original question. The model returns a structured answer, with a recommendation per listing, a match reason, the trade-offs, and the IDs of the snippets it cited. The system then validates that response before the user sees it. Every snippet ID has to resolve to a real snippet the retriever returned. Every citation has to belong to the listing it's attached to. If any of that fails, the response is rejected and a deterministic fallback (assembled from the retriever's own output) is shown instead.

That validation step is the difference between "an LLM said this" and "this answer is grounded in known sources." In regulated work, the latter is the only acceptable answer.

What the user sees

The interface has three panels. A chat panel where the buyer asks questions and refines them. A shortlist of property cards on the right, each with the headline facts and a button to expand the cited evidence. An evidence drawer that opens to show the exact snippets behind each recommendation, with a link to the original listing. There is also a lead-capture form for booking a viewing, but in the demo that just writes to a local file. No CRM, no email.

The interaction pattern transfers cleanly. Swap "shortlist of properties" for "shortlist of relevant clauses" or "shortlist of audit findings" and the screen looks the same.

Where this same approach applies

We chose property listings because the data is public and the structure is rich. The pipeline itself is domain-agnostic. The same architecture (normalise documents, embed by meaning, filter on structured constraints, retrieve, ground the answer, validate citations) is what we build for clients in domains like these.

Legal and compliance. Contract libraries, regulatory guidance, internal policies. The structured facts are dates, counterparties, jurisdictions, monetary values. The prose is the clauses.
Audit and assurance. Working papers, prior-year files, regulator publications. The questions look like "how did we handle X last year" and "what did the regulator say about Y."
Technical documentation. Engineering manuals, installation procedures, troubleshooting guides. The structured facts are part numbers, models, dates of last revision.
Internal knowledge. SOPs, onboarding documents, accumulated email threads, post-incident write-ups. The structured facts are teams, dates, owners.

If your team's institutional knowledge sits in documents and the questions you ask mix soft semantics with hard constraints, this pattern earns its keep.

Why this is safe for regulated work

Four properties matter.

Can run entirely on your network. For teams handling confidential client data, sensitive IP, or anything bound by data residency rules, the whole system can be deployed on-premise with open-weight models. Embedding, retrieval, and answer generation all stay inside the perimeter. Documents never leave your network. Quality is typically very close to hosted models on retrieval-grounded Q&A, because the model's job is to summarise text it has been given, not to draw on outside knowledge.

Cited sources, always. Every answer carries inline citations, with a link to the source section. Verification takes seconds. In a regulated context, an AI answer without a traceable source is worse than no answer at all.

Closed-book, not open-book. The model only answers from text retrieved from your own documents. It does not use its general training data. If the answer isn't in the index, the system returns a "no match" response rather than guessing. In our build that's enforced by the citation validator, not just the prompt.

Version-aware. New documents are re-ingested through a simple upload. The system can flag answers drawn from documents that haven't been updated in over a year, prompting a review of currency.

The stack we used

Document normalisation TypeScript adapters for Zoopla and Trulia CSV/JSON exports

Snippet construction Typed snippets per listing (summary, description, features, location)

Embedding Deterministic local mode or OpenAI text-embedding-3-small

Vector storage Local JSON for the demo, Postgres with pgvector for production

Retrieval Constraint extraction, filtered cosine ranking, relaxed-fallback

Answer generation Structured JSON response, schema-validated, citations checked

Interface React, Vite, TypeScript, with chat, shortlist, and evidence drawer

Two engineering decisions matter most.

Structured constraints before semantic search. Embeddings alone don't reliably honour "under £600k." Extracting the constraint and using it as a hard filter is the difference between a system that feels precise and one that feels approximate. For regulated retrieval the same point holds for dates, jurisdictions, and monetary thresholds.

Validated citations, not just prompted ones. Asking a model to "cite your sources" is not the same as enforcing that it did. Our agent validates that every cited snippet ID actually exists in the retrieved set and belongs to the listing it's attached to. If the model invents a citation, the response is rejected before it reaches the user.

When this isn't the right fit

The pattern is powerful, but it's the wrong tool for some problems.

Small document sets. Below roughly fifty documents, full-text search and a tidy filing structure usually outperform the cost and complexity of an embedding pipeline.

Highly visual content. Diagrams, schematics, scanned drawings, anything where the meaning lives in layout rather than text needs a different approach. That usually means multimodal models, OCR, or a structured database with proper metadata.

Real-time or transactional data. "What's the current stock level" or "what did this customer order yesterday" isn't a documents problem. It's a data problem, and a query against the right system will be faster and more accurate.

Highly contested interpretation. If a regulation's meaning is actively disputed and the firm's value lies in the human judgement of which reading is right, an AI summary is a liability rather than an asset. Use the system for retrieval. Keep the interpretation human.

What to expect

Implementation time 3–6 weeks for a typical first build, depending on document volume and complexity.

Deployment options Cloud-hosted by default. Can run entirely on your network with open-weight models for confidential data.

Infrastructure cost Roughly £50–200 per month for cloud deployments, covering embeddings, storage, and query usage combined. On-premise deployments shift the cost to hardware and ops.

Typical query time Under a minute end-to-end, compared with 15–30 minutes of unstructured searching.

Time recovered McKinsey benchmarks suggest knowledge workers spend around 20% of their time searching. Well-deployed Q&A systems typically recover most of that.

Secondary benefits Consistency across the team, faster junior onboarding, reduced reliance on tribal knowledge.

If this pattern fits your team

A Pare Audit is the way to find out whether it does, and what a delivery would look like in your specific situation. We spend a focused few days with you, look at the real documents and the real questions your team asks, and come back with a written recommendation, a scoped build, and a costed plan.

A document search agent, built end-to-end on open data

Why property listings

You probably have this problem if…

How the build works

1. Normalise the documents

2. Build fingerprints from meaning, not words

3. Pull back the snippets closest to the question

4. Write the answer, prove every citation

What the user sees

Where this same approach applies

Why this is safe for regulated work

The stack we used

When this isn't the right fit

What to expect

If this pattern fits your team

Find out for sure with a Pare Audit.