agentic-voice-assistant

provenance:github:ShubhammS18/agentic-voice-assistant

WHAT THIS AGENT DOES

This voice assistant listens to your spoken questions and automatically finds the best way to answer them. It can pull information from documents, search the live web for current events, or look up specific details in a database, all within a single conversation. It’s designed to help businesses struggling with voice-based customer service or internal knowledge sharing where a single answer source isn't enough. Teams like customer support desks, internal help centers, or companies needing a voice interface for various data sources would find it valuable. What sets it apart is its ability to understand the context of your questions across multiple turns of conversation and its speed – it delivers answers very quickly without relying on complex artificial intelligence models for routing.

View Source ↗First seen 4mo agoNot yet hireable

README

# agentic-voice-assistant
# Agentic Voice Assistant

![CI](https://github.com/ShubhammS18/agentic-voice-assistant/actions/workflows/ci.yml/badge.svg?branch=main)

> A voice-controlled assistant that listens to what you say, figures out the best way to answer it, and speaks back — routing your query through RAG, live web search, or a structured data store depending on what you asked. Built with semantic routing (no LLM needed for routing decisions), three independent MCP servers, and two-layer agent memory that actually remembers what you said earlier in the conversation.

Enterprise support desks, internal knowledge bases, and customer-facing
voice agents all share the same problem: a single LLM call can't
reliably answer every query type — knowledge questions need grounded
retrieval, current events need live web access, and structured lookups
need deterministic data. This assistant solves that by routing each
spoken query to the right tool automatically, with sub-1800ms
end-to-end latency. **The target use case is any team that needs a voice
interface over mixed data sources without building separate systems for
each one.**

---

## What it looks like

### All three agents working in one session

![Three agent routing demo](docs/screenshots/three_agents_demo.png)

*Turn 1 asked about the Golden Visa policy → routed to RAG, answered from CRAG documents*  
*Turn 2 asked about AI news today → routed to Web, DuckDuckGo returned live results*  
*Turn 3 asked about the tech stack → routed to Data, answered from the structured store*

### Cross-turn memory working

![Memory demo](docs/screenshots/memory_demo.png)

*Turn 4 said "tell me more about the first thing I asked" — the agent correctly recalled Turn 1 without the user repeating anything. That's FAISS semantic memory doing its job.*

---

## Architecture

```
BROWSER (microphone)
  │  raw PCM audio — 16kHz, 16-bit, mono — over WebSocket
  ▼
FASTAPI WebSocket Server  (/ws/voice)
  │
  ▼
DEEPGRAM Nova-2  ──────────────────────────────  [asr_ms]
  │  streaming transcription, sub-300ms
  ▼
QUERY REWRITER  (Haiku, max 80 tokens)  ────────  [rewrite_ms]
  │  breaks vague queries into 2-3 specific sub-queries
  │  knows current date so sub-queries are time-aware
  ▼
SEMANTIC ROUTER  (FAISS cosine similarity)  ─────  [route_ms < 30ms]
  │  no LLM call — pure local vector math
  │  domain embeddings pre-computed at startup
  ▼
LANGGRAPH ORCHESTRATOR
  │  ◄── LangGraph MemorySaver  (session state)
  │  ◄── FAISS semantic memory  (cross-turn recall)
  │
  ├── KNOWLEDGE  ──►  RAG MCP Server   (port 8001)
  │                   wraps CRAG /ask endpoint
  │                   Tavily fallback lives inside CRAG only
  │
  ├── REALTIME   ──►  Web Search MCP Server  (port 8002)
  │                   DuckDuckGo — no API key, independent from Tavily
  │
  └── STRUCTURED ──►  Data Lookup MCP Server  (port 8003)
                      local JSON store — no external deps
  │
  ▼
SYNTHESIZE NODE  (Haiku streaming)  ────────────  [llm_ttft_ms]
  │  tool result → 2-3 spoken sentences
  ▼
ELEVENLABS  eleven_turbo_v2_5  (streaming TTS)  ─  [tts_ttfb_ms]
  │  base64 MP3 chunks streamed as they arrive
  ▼
BROWSER  (plays audio + renders latency dashboard)

Total = asr_ms + rewrite_ms + route_ms + llm_ttft_ms + tts_ttfb_ms
```

---

## Routing Benchmark

Ran 30 hand-labelled queries (10 per domain) through the semantic router. No LLM involved in routing — just FAISS cosine similarity between query embeddings and domain description embeddings.

| Domain | Correct | Total | Accuracy |
|--------|---------|-------|----------|
| RAG | 7 | 10 | 70.0% |
| Web | 10 | 10 | 100.0% |
| Data | 10 | 10 | 100.0% |
| **Overall** | **27** | **30** | **90.0%** |

The 3 misrouted RAG queries were genuinely ambiguous — "technical guidelines for API integration" and "data retention policy" both look like structured data questions to a vector classifier. That's an honest limitation, not a bug. Full report with confusion matrix: [evaluation/results/routing_report.md](evaluation/results/routing_report.md)

---

## Latency Breakdown

| Component | Typical | Notes |
|-----------|---------|-------|
| ASR — Deepgram Nova-2 | ~300ms | Streaming WebSocket, measured from speech end |
| Query rewriting — Haiku | ~1000ms | Cold start is slower; warm calls ~600ms |
| Semantic routing — FAISS | **<30ms** | Zero API cost, local vector math |
| Tool call via MCP | 200–500ms | RAG can be 30-40s on CRAG cold start |
| LLM synthesis TTFT — Haiku | ~600ms | Streaming, first token |
| TTS TTFB — ElevenLabs turbo | ~300ms | Free tier |
| **End-to-end** | **~2400ms** | Warm system, non-RAG query |

The routing decision alone is under 30ms. That's the key number — routing is a classification problem and FAISS solves it faster and cheaper than any LLM call would.

---

## What this project covers

| Gap | What I built |
|-----|-------------|
| Multi-agent coordination | LangGraph StateGraph with 3 specialist agents, each an independent MCP server |
| MCP integration | 3 MCP servers — rag, websearch, data — independently startable, fault-isolated with separate circuit breakers |
| Semantic routing | Replaced LLM routing call with FAISS cosine similarity — 100x faster, zero API cost per turn |
| Agent memory | Two layers: LangGraph MemorySaver for session state + FAISS semantic memory for cross-turn recall |
| Real-time streaming | 5-component latency dashboard: asr_ms, rewrite_ms, route_ms, llm_ttft_ms, tts_ttfb_ms |
| Production engineering | Circuit breaker per service, graceful degradation, replay mode, /health endpoint, CI/CD |

---

## Design decisions

**Semantic routing instead of an LLM routing call**

The original plan used a Haiku call with max 10 tokens to classify queries. That adds ~100ms and costs money on every single turn. Routing is a classification problem — you're asking "which bucket does this belong to?" FAISS cosine similarity between the query embedding and pre-computed domain description embeddings answers that in under 30ms at zero API cost. I also added a query rewriting step before routing so ambiguous inputs get decomposed into specific sub-queries first, which meaningfully improves accuracy on edge cases.

**DuckDuckGo in the web server, Tavily only inside CRAG**

If both the web search server and the CRAG system used Tavily, a single Tavily outage would break both paths simultaneously — that's a hidden shared dependency that defeats the whole point of having separate MCP servers. DuckDuckGo in the web server means each server has a genuinely independent external dependency. It also requires zero API key. The tradeoff is raw results vs Tavily's LLM-optimized summaries, which is fine for this use case.

**Three separate MCP servers instead of one**

Fault isolation. If the RAG server is down, web and data queries still work. The circuit breaker tracks failures per service independently. After 3 consecutive RAG failures the RAG circuit opens and queries fall back gracefully — web and data are completely unaffected. In production each server would run as a persistent HTTP service on separate infrastructure.

**Two memory layers**

Short-term: LangGraph MemorySaver checkpointer persists the full AgentState between turns for the same session. Long-term: FAISS semantic memory stores embeddings of past turns and retrieves the most semantically relevant ones before synthesis. This means the agent handles "tell me more about what you said earlier" correctly even when that was several turns back.

---

## Known limitations

**RAG cold start is slow** — CRAG loads the sentence-transformers model on first request which takes 30-40 seconds. Subsequent requests are under 5 seconds. Warm up by hitting the CRAG `/health` endpoint before your first query.

**TTS free tier is unreliable** — ElevenLabs free tier has character limits and occasional IP-level blocks. For a reliable demo buy the $5/month Starter plan, record the demo, then cancel.

**TTS TTFB shows 0ms**

[truncated…]

PUBLIC HISTORY

First discoveredMar 22, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 16, 2026

last updatedMar 21, 2026

last crawled3 months ago

version—

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:ShubhammS18/agentic-voice-assistant)