Reference Architecture Demonstration

From Volume to Voice

AI Infrastructure for a U.S. Senate Office — built, demonstrated, deployable.

Launch Demo → 📄 Briefing PDF or read on for context

A large-state Senate office is not a communications operation — it is a signal-processing organization that also does communications. Tens of thousands of constituent contacts per week. Dozens of bills per legislative day. Press inquiries, casework, social mentions, meeting requests arriving faster than any 60-person staff can absorb. Most of this data is read once, then disappears.

This demo shows what changes when the office runs its own AI infrastructure — on its own hardware, inside the Senate enclave, with constituent PII that never leaves the building.

What's in this demo

69,398

Real Congressional bills

50,565

Bills with summaries indexed

60,000

Synthetic casework cases

145,520

Synthetic phone-call logs

107,857

Generated PDF records

158,120

bge-m3 vector chunks

What you'll be able to do

Ask plain-English questions over the entire corpus — bills, casework, phone calls, social mentions.
See the intersection of constituent issues and active legislation — "VA Benefits" returns relevant bills and the cases in your office about VA issues.
Drill into any case — see the original constituent contact (phone screen, walk-in form, letter, email) AND the office's outcome reply, both as PDFs.
Browse 16 issue categories mapped to bill policy areas — pre-built intersections for the most common Senate-office workflows.

How the data gets here

Two data sources, two different stories:

Bills — real, from Congress.gov

Initial corpus ingested directly from the Congress.gov public data API. Bill text, sponsors, cosponsors, action history, policy areas, and committee assignments. Each bill summary is chunked, embedded with the bge-m3 model (1024-dim vectors), and stored in pgvector with HNSW index for semantic search.

Daily auto-refresh planned. Currently the corpus represents Congresses 117 and 118 as of the initial ingest.

Constituent data — entirely synthetic

Generated to reflect realistic volume, MD geographic spread, and casework complexity for a large Maryland office. Every name, address, phone number, and case detail is fictitious. Each generated document (outcome letter, incoming letter, email, phone screen, walk-in form, web-form submission) is rendered as a real PDF and stored on disk — the same way scanned office records would exist in a real deployment.

Generated 2026-05-10 / 11. Watermarked "MOCK DATA — FOR TESTING ONLY" on every PDF.

The hardware behind it

A commodity Linux file server (PostgreSQL, pgvector, FastAPI, the ingest pipelines, the React UI) plus an NVIDIA GB10 Superchip with 128GB unified memory for AI inference. Three local models: bge-m3 for embeddings, Qwen 2.5 14B for classification and NL→SQL, Llama 3.3 70B (planned) for drafting and long-form briefings. Total inference stack ~50GB; 78GB headroom.

Open-source throughout. No cloud, no vendor lock-in, no per-seat licensing.

The Learning Engine

NewContinuous self-training from public data

A separate microservice that watches public, free data sources every two hours — Congress.gov bill actions and floor votes, govinfo committee hearings, news headlines, Bluesky social posts — and turns each new observation into candidate question-and-SQL training pairs. Validated pairs flow automatically into the search index, so the demo gets measurably smarter the longer it runs without anyone in the loop.

Five data sources active. Polls autonomously every 2 hours via systemd timer. Pairs are auto-validated against live data and only added if the SQL returns sensible results.

Open Learning Engine → Launch Demo

Ready to explore?

Demo is read-only. Clicking, querying, and drilling in are all safe.

📄 Download Briefing (PDF) Launch Demo →