PoliticalPassport Case Study

Screenshot of a PoliticalPassport conversation, in which the agent asks a nuanced question about generational tax fairness after the user says 'the courts' — The conversational agent in action. Rather than asking flat multiple-choice questions, it presents concrete tensions and follows up on what the user actually said.

Most political profiling tools give you a dot on a chart. Left versus right. Authoritarian versus libertarian. You answer twenty multiple-choice questions and walk away with a label. PoliticalPassport was built because that's not what political identity actually looks like. Fixing it turned out to be an engineering problem.

The core issue is that multiple-choice questions impose the question designer's frame. They capture what someone thinks, not why. Two people can both "agree" that the NHS needs reform, one from a fiscal perspective, one because they've experienced poor care firsthand. The quiz treats them identically. Their actual political profiles are very different.

The solution was a conversational AI that explores views through natural dialogue. But that introduces a harder problem. How do you build an AI that holds a genuine conversation and produces structured, comparable data at the end of it?

18 Policy dimensions

3 Conversation stages

4 Specialised agents

8 UK parties compared

The agent problem

A conversational AI doing political profiling has to maintain dialogue, track which of 18 policy categories have been covered, extract scorable signal from free-form answers, and know when to move on. A single prompt can't carry all four jobs without compromising at least two of them.

This is fundamentally an agent problem. It requires planning, state tracking, and multi-step reasoning that a simple prompt-and-response can't deliver. But the harder we looked at it, the clearer it became that a single monolithic agent trying to do all of this at once would compromise every part of the job. A good conversationalist is curious and open. A good scorer is evaluative and precise. These are competing personalities, and you can't have both in one prompt.

The answer was to separate concerns entirely across four specialised agents, each invoked at a different point in the user journey.

Four agents, four jobs

Each agent is invoked at a different phase of the user journey, with its own prompt, output format, and model choice.

The conversational agent

This is the agent the user actually talks to. Its sole job is to have a good conversation, asking thoughtful, open-ended questions that draw out genuine views across the stage's policy categories. The prompt invests heavily in question quality. Rather than asking "what do you think about immigration?", it presents the real policy tension ("some argue we need more immigration to fill skills gaps; others say pressure on housing is already too high. Where do you land?"). This approach consistently produces richer signal than anything a quiz can capture.

To avoid a separate state-tracking API call, the agent returns structured XML alongside its response. The <response> tag contains the text shown to the user; the <covered> tag contains a JSON array of categories the agent believes have sufficient signal. Both travel in a single call.

<response>Your conversational reply here.</response>
<covered>["economic_policy", "criminal_justice"]</covered>

The system prompt is rebuilt dynamically for every message, injecting the current stage, uncovered categories, and prior scores. The agent's behaviour shifts naturally as the conversation progresses. Early on it explores breadth; later it plugs gaps.

The scoring agent

When the user ends a stage, the full transcript is sent to a separate scoring agent. It reads the conversation as a historical document and produces a structured score for each category. A 0–100 position on a labelled spectrum, a confidence level, and an evidence summary grounded in the user's actual words.

An earlier design had the conversational agent score in real-time as it chatted. We abandoned this because it created an irreconcilable tension. The conversational persona wants to be non-judgmental, while the scorer needs to be precise and evaluative. Separating them produced both better conversations and more accurate scores.

The scoring rubrics are the most carefully engineered part of the system. Each of the 18 categories has 5–6 score bands with qualitative anchors and explicit "pushes higher / pushes lower" guidance. Early versions produced clustered scores (everything landing between 40–60) because the guidance wasn't specific enough. The band-level detail is what gives the model the resolution it needs to differentiate positions.

Excerpt from the Economic Policy rubric
Score	Position	Example signal
0–15	Strongly state-led	Widespread nationalisation, heavy redistribution, workers' ownership
16–35	Leans state-led	Selective nationalisation, progressive taxation, expanded public services
36–55	Mixed economy	Accepts both state and market roles, moderate on taxation
56–75	Leans free market	Trusts markets, lower taxation, targeted intervention only
76–100	Strongly free market	Minimal state, significant tax cuts, deregulation, privatisation

70% reduction in token usage, by including only the current stage's rubrics in the scoring prompt

The narrative agent

After scoring, a third agent synthesises the user's scores into a readable political profile, identifying patterns, tensions, and notable positions across all completed categories. The narrative is cached against a hash of the underlying scores, so it's generated at most once per score change. A user can view their passport a hundred times; the LLM is only called again when something actually changes.

The party comparison agent

The fourth agent matches the user's scores against pre-defined profiles for 8 UK political parties, generating a ranked alignment with narrative explanations. Rather than making 8 separate API calls, it receives all party data in a single batch prompt and returns all 8 comparisons in one response. This agent also uses a smaller, cheaper model. Party comparison is less nuanced than the main scoring and doesn't require the same reasoning quality. At scale, this reduces the per-user cost of the comparison feature by approximately 90%.

Streaming and real-time UX

LLM API calls take 2–10 seconds. In a traditional request-response model, the user stares at a spinner. We used Server-Sent Events (SSE) to stream the agent's reply to the frontend token-by-token, and more importantly to stream metadata alongside the text.

The stream carries three event types. The agent's message text, a covered event that updates the progress indicator in real-time ("5 of 7 categories explored"), and a done event that re-enables the input. This means the progress ring updates as the conversation happens, giving users a visible sense of forward momentum, a detail that matters a lot for keeping people engaged through a multi-stage process.

PoliticalPassport dashboard showing an 18/18 progress ring and buttons to view the passport, compare against parties, or invite friends — The dashboard. A progress ring updates in real time as the agent tracks category coverage across all three conversation stages.

Production reliability

LLMs are non-deterministic. The same prompt can return well-formed XML one time and bare JSON the next. Building for production means accepting this and designing around it. The output parser uses a three-layer strategy. Primary extraction from XML tags, a fallback that searches for bare JSON blocks, and graceful degradation that treats the full output as the message if both fail. The conversation never crashes. It logs the anomaly and continues.

Conversation state is persisted in the database rather than held in memory. A user can close the browser, come back three days later, and resume exactly where they left off, with the correct stage, the correct covered categories, and the correct prior scores all restored and injected into the next prompt automatically.

The output

The agent pipeline produces a multi-layered political profile that goes far beyond what a traditional quiz could offer. Each of the 18 categories gets a position on a labelled spectrum, a confidence level, and an evidence summary quoting the user's own reasoning. The scores feed into a radar overlay against 8 UK parties, showing alignment at a glance.

The party comparison view. The user's profile is overlaid on 18 axes, with party positions shown as coloured dots.

Two scored category cards (Economic Policy at 78/100 and Healthcare & Social Care at 65/100), each with a labelled spectrum, confidence level, and evidence summary — Per-category results. Every score comes with a confidence level and an evidence summary explaining the placement.

What we learned

Most of the work wasn't prompt-writing. It was deciding how agents hand off, where state lives, which model handles which job, and how to keep costs sensible across multiple LLM calls per session.

Separating concerns let us improve the conversational agent's questions and the scoring agent's rubrics independently. Changes to one didn't ripple into the other, which is what made it possible to keep tuning the system after launch without regressions.

Policy dimensions mapped 2 (left-right axis) → 18

Scoring token usage Full prompt every time → 70% reduction

Party comparison cost 8 separate LLM calls → ~90% reduction

Narrative generation Per page view → Per score change

Designing a Multi-Agent AI for Political Profiling

The agent problem