agentic-voice-assistant
What it does: This is a voice assistant that understands what you say and gives you the best possible answer, using different tools depending on your question. It's like having a smart helper who can quickly find information from documents, search the web for current news, or look up specific details in a database – all through voice commands. It also remembers what you talked about earlier in the conversation, so you don't have to repeat yourself. What problem it solves: Many businesses struggle to provide quick and accurate answers to customer or employee questions because the information is spread across different systems. This assistant solves that by automatically directing each
README
# agentic-voice-assistant
# Agentic Voice Assistant

> A voice-controlled assistant that listens to what you say, figures out the best way to answer it, and speaks back — routing your query through RAG, live web search, or a structured data store depending on what you asked. Built with semantic routing (no LLM needed for routing decisions), three independent MCP servers, and two-layer agent memory that actually remembers what you said earlier in the conversation.
Enterprise support desks, internal knowledge bases, and customer-facing
voice agents all share the same problem: a single LLM call can't
reliably answer every query type — knowledge questions need grounded
retrieval, current events need live web access, and structured lookups
need deterministic data. This assistant solves that by routing each
spoken query to the right tool automatically, with sub-1800ms
end-to-end latency. **The target use case is any team that needs a voice
interface over mixed data sources without building separate systems for
each one.**
---
## What it looks like
### All three agents working in one session

*Turn 1 asked about the Golden Visa policy → routed to RAG, answered from CRAG documents*
*Turn 2 asked about AI news today → routed to Web, DuckDuckGo returned live results*
*Turn 3 asked about the tech stack → routed to Data, answered from the structured store*
### Cross-turn memory working

*Turn 4 said "tell me more about the first thing I asked" — the agent correctly recalled Turn 1 without the user repeating anything. That's FAISS semantic memory doing its job.*
---
## Architecture
```
BROWSER (microphone)
│ raw PCM audio — 16kHz, 16-bit, mono — over WebSocket
▼
FASTAPI WebSocket Server (/ws/voice)
│
▼
DEEPGRAM Nova-2 ────────────────────────────── [asr_ms]
│ streaming transcription, sub-300ms
▼
QUERY REWRITER (Haiku, max 80 tokens) ──────── [rewrite_ms]
│ breaks vague queries into 2-3 specific sub-queries
│ knows current date so sub-queries are time-aware
▼
SEMANTIC ROUTER (FAISS cosine similarity) ───── [route_ms < 30ms]
│ no LLM call — pure local vector math
│ domain embeddings pre-computed at startup
▼
LANGGRAPH ORCHESTRATOR
│ ◄── LangGraph MemorySaver (session state)
│ ◄── FAISS semantic memory (cross-turn recall)
│
├── KNOWLEDGE ──► RAG MCP Server (port 8001)
│ wraps CRAG /ask endpoint
│ Tavily fallback lives inside CRAG only
│
├── REALTIME ──► Web Search MCP Server (port 8002)
│ DuckDuckGo — no API key, independent from Tavily
│
└── STRUCTURED ──► Data Lookup MCP Server (port 8003)
local JSON store — no external deps
│
▼
SYNTHESIZE NODE (Haiku streaming) ──────────── [llm_ttft_ms]
│ tool result → 2-3 spoken sentences
▼
ELEVENLABS eleven_turbo_v2_5 (streaming TTS) ─ [tts_ttfb_ms]
│ base64 MP3 chunks streamed as they arrive
▼
BROWSER (plays audio + renders latency dashboard)
Total = asr_ms + rewrite_ms + route_ms + llm_ttft_ms + tts_ttfb_ms
```
---
## Routing Benchmark
Ran 30 hand-labelled queries (10 per domain) through the semantic router. No LLM involved in routing — just FAISS cosine similarity between query embeddings and domain description embeddings.
| Domain | Correct | Total | Accuracy |
|--------|---------|-------|----------|
| RAG | 7 | 10 | 70.0% |
| Web | 10 | 10 | 100.0% |
| Data | 10 | 10 | 100.0% |
| **Overall** | **27** | **30** | **90.0%** |
The 3 misrouted RAG queries were genuinely ambiguous — "technical guidelines for API integration" and "data retention policy" both look like structured data questions to a vector classifier. That's an honest limitation, not a bug. Full report with confusion matrix: [evaluation/results/routing_report.md](evaluation/results/routing_report.md)
---
## Latency Breakdown
| Component | Typical | Notes |
|-----------|---------|-------|
| ASR — Deepgram Nova-2 | ~300ms | Streaming WebSocket, measured from speech end |
| Query rewriting — Haiku | ~1000ms | Cold start is slower; warm calls ~600ms |
| Semantic routing — FAISS | **<30ms** | Zero API cost, local vector math |
| Tool call via MCP | 200–500ms | RAG can be 30-40s on CRAG cold start |
| LLM synthesis TTFT — Haiku | ~600ms | Streaming, first token |
| TTS TTFB — ElevenLabs turbo | ~300ms | Free tier |
| **End-to-end** | **~2400ms** | Warm system, non-RAG query |
The routing decision alone is under 30ms. That's the key number — routing is a classification problem and FAISS solves it faster and cheaper than any LLM call would.
---
## What this project covers
| Gap | What I built |
|-----|-------------|
| Multi-agent coordination | LangGraph StateGraph with 3 specialist agents, each an independent MCP server |
| MCP integration | 3 MCP servers — rag, websearch, data — independently startable, fault-isolated with separate circuit breakers |
| Semantic routing | Replaced LLM routing call with FAISS cosine similarity — 100x faster, zero API cost per turn |
| Agent memory | Two layers: LangGraph MemorySaver for session state + FAISS semantic memory for cross-turn recall |
| Real-time streaming | 5-component latency dashboard: asr_ms, rewrite_ms, route_ms, llm_ttft_ms, tts_ttfb_ms |
| Production engineering | Circuit breaker per service, graceful degradation, replay mode, /health endpoint, CI/CD |
---
## Design decisions
**Semantic routing instead of an LLM routing call**
The original plan used a Haiku call with max 10 tokens to classify queries. That adds ~100ms and costs money on every single turn. Routing is a classification problem — you're asking "which bucket does this belong to?" FAISS cosine similarity between the query embedding and pre-computed domain description embeddings answers that in under 30ms at zero API cost. I also added a query rewriting step before routing so ambiguous inputs get decomposed into specific sub-queries first, which meaningfully improves accuracy on edge cases.
**DuckDuckGo in the web server, Tavily only inside CRAG**
If both the web search server and the CRAG system used Tavily, a single Tavily outage would break both paths simultaneously — that's a hidden shared dependency that defeats the whole point of having separate MCP servers. DuckDuckGo in the web server means each server has a genuinely independent external dependency. It also requires zero API key. The tradeoff is raw results vs Tavily's LLM-optimized summaries, which is fine for this use case.
**Three separate MCP servers instead of one**
Fault isolation. If the RAG server is down, web and data queries still work. The circuit breaker tracks failures per service independently. After 3 consecutive RAG failures the RAG circuit opens and queries fall back gracefully — web and data are completely unaffected. In production each server would run as a persistent HTTP service on separate infrastructure.
**Two memory layers**
Short-term: LangGraph MemorySaver checkpointer persists the full AgentState between turns for the same session. Long-term: FAISS semantic memory stores embeddings of past turns and retrieves the most semantically relevant ones before synthesis. This means the agent handles "tell me more about what you said earlier" correctly even when that was several turns back.
---
## Known limitations
**RAG cold start is slow** — CRAG loads the sentence-transformers model on first request which takes 30-40 seconds. Subsequent requests are under 5 seconds. Warm up by hitting the CRAG `/health` endpoint before your first query.
**TTS free tier is unreliable** — ElevenLabs free tier has character limits and occasional IP-level blocks. For a reliable demo buy the $5/month Starter plan, record the demo, then cancel.
**TTS TTFB shows 0ms**
[truncated…]PUBLIC HISTORY
IDENTITY
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
README BADGE
Add to your README:
