Let’s be honest: evaluating voice agents is a dumpster fire. You train a RAG pipeline on Whisper + Llama-3 + custom voice synthesis, wire it up to a WebRTC client, and then… what? You hope it sounds natural? You run 30 manual test calls and log “seemed okay”? If you’ve ever tried to quantify voice agent performance beyond “it didn’t crash,” you’ve felt the pain. Enter EVA — not the anime robot, but ServiceNow’s open-source EVA (Evaluation of Voice Agents), a Python framework that actually measures voice agent quality end-to-end: speech → intent → response → speech synthesis → perceived fluency. And yes — it’s self-hostable, lightweight, and already at 94 GitHub stars (as of May 2024) with active commits — no vaporware.
What Is EVA and Why Does It Solve a Real Problem?
EVA isn’t another LLM benchmark. It’s a pipeline-aware evaluator designed for production voice agents — think customer support bots, voice-controlled dashboards, or IVR replacements. Unlike generic LLM eval tools (e.g., lm-eval or ragas), EVA injects real audio (WAV/MP3), transcribes it, routes it through your voice agent’s full stack (ASR → NLU → LLM → TTS), records the output audio, then scores it across four dimensions:
- ASR Accuracy (WER on ground-truth transcriptions)
- Intent Classification F1 (matches expected intent labels)
- Response Relevance & Coherence (LLM-as-a-judge + BLEU/ROUGE)
- TTS Naturalness (MOS estimation via lightweight neural MOS predictor — not requiring human raters)
Here’s the kicker: EVA doesn’t require you to expose your voice agent’s internals. You define it as an HTTP endpoint (e.g., POST /chat), and EVA handles the rest — including audio I/O, timing, and parallel test execution.
That said — it’s not a voice agent builder. It won’t replace Whisper or Piper. It’s pure evaluation infrastructure. Think of it like pytest for voice: boring until your bot starts mishearing “refund” as “refunded” at 3 a.m.
Installing EVA Locally (No Docker Required)
EVA is pure Python (3.10+), and installing it is stupid simple — no CUDA, no heavy ML deps by default. I ran this on a 2021 M1 MacBook Air (8GB RAM) and it hummed.
# Create a clean venv — highly recommended
python3.10 -m venv ~/venvs/eva-env
source ~/venvs/eva-env/bin/activate
# Install core + optional audio deps
pip install eva-eval==0.2.1 # latest stable as of May 2024
pip install pydub librosa # for audio preprocessing
That’s it. No make build, no npm install. The core package weighs in at ~14MB on disk — tiny compared to transformers-based eval suites.
Want TTS MOS scoring? Then add torch and torchaudio (but only if you want the neural MOS estimator):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
⚠️ Note: The MOS model (eva.mos.MOSPredictor) runs fine on CPU, but expect ~200ms per 5s audio clip. On my M1, it’s fine. On a low-end ARM SBC? Skip MOS and rely on WER + intent F1.
Running EVA with Docker Compose (Self-Hosted CI Flow)
I self-host EVA in a docker-compose.yml alongside my voice agent under test (a FastAPI + Whisper.cpp + Ollama setup). Here’s my production-ready compose file — stripped down but battle-tested:
# docker-compose.eva.yml
version: '3.8'
services:
eva-runner:
image: python:3.10-slim
volumes:
- ./eva-tests:/workspace/tests
- ./eva-config.yaml:/workspace/eva-config.yaml
working_dir: /workspace
command: >
bash -c "
pip install eva-eval==0.2.1 pydub librosa &&
eva run --config eva-config.yaml --tests tests/
"
depends_on:
- voice-agent
networks:
- voice-net
voice-agent:
image: my-voice-agent:latest
ports:
- "8000:8000"
networks:
- voice-net
# your agent's env, volumes, etc.
networks:
voice-net:
driver: bridge
Key things this does:
- Mounts test audio files (
tests/) and config outside the container — lets me iterate on test cases without rebuilding. - Uses
eva runCLI (not the web UI — more on that in a sec). - Runs EVA after
voice-agentis healthy (you’ll want a healthcheck in prod).
Your eva-config.yaml looks like this:
# eva-config.yaml
agent_endpoint: "http://voice-agent:8000/chat"
timeout: 30
concurrency: 4 # max parallel requests — tune based on your agent's capacity
asr:
ground_truth_file: "tests/ground-truth.csv" # format: audio_path,transcript,intent
asr_model: "whisper-tiny" # or "whisper-base", "faster-whisper"
tts:
enable_mos: true
mos_model: "mos_predictor_v1"
scoring:
intent_labels: ["balance_inquiry", "refund_request", "tech_support"]
llm_judge_model: "gpt-3.5-turbo-1106" # or "llama3:8b" if you run Ollama locally
⚠️ Pro tip: eva run outputs CSV/JSON logs by default. I pipe them into Grafana via Loki — but even cat results/eva-results-20240512.json | jq '.summary' gives you instant pass/fail insight.
EVA vs. the Alternatives: Why Not Just Use Ragas or Custom Scripts?
If you’re currently evaluating voice agents with homegrown scripts, you’re probably:
- Manually converting audio → text → intent → response → recording output → listening to 20 clips and scribbling “sounded robotic” in a Notion table.
Or worse — using generic LLM eval tools like ragas or trulens. Here’s how EVA beats them:
| Tool | Measures ASR WER? | Scores TTS Naturalness? | Requires Agent Code Access? | CLI + Config-Driven? |
|---|---|---|---|---|
| EVA | ✅ Yes, with Whisper/faster-whisper | ✅ MOS estimator (CPU-friendly) | ❌ No — HTTP-only | ✅ Yes, eva run --config |
ragas |
❌ No audio support | ❌ No TTS/audio | ✅ Yes (needs LLM + retriever objects) | ❌ Python API only |
trulens |
❌ No audio | ❌ No TTS | ✅ Yes (instrumentation required) | ❌ Notebook-first |
| Custom Bash + Whisper + curl | ⚠️ Possible, but brittle | ❌ Hand-rolled (if at all) | ✅ Yes (but you wrote it) | ⚠️ Ad-hoc, no config |
The real win? Reproducibility. With EVA, your QA engineer can run eva run --config prod-eval.yaml exactly the same way your CI pipeline does — no Python imports, no Jupyter kernel restarts.
That said: EVA won’t replace human listening tests for tone or empathy. It’s not trying to. It’s your automated smoke test — “does it understand and respond correctly at all?” before you waste 45 minutes on subjective listening.
Who Is This For? (Spoiler: Not Just Voice Engineers)
EVA is not for hobbyists building their first “Hey Jarvis” Alexa clone — unless you’re serious about testing. It’s for teams who:
- Run voice agents in production (e.g., telco IVRs, bank call centers, smart home hubs)
- Ship voice agent updates weekly and need regression guardrails
- Integrate with CI/CD (GitHub Actions, GitLab CI — I have a working
.github/workflows/eva.ymlif you ask) - Self-host everything (no SaaS eval APIs, no sending audio to third parties)
It’s also quietly great for AI researchers comparing Whisper variants or TTS models — just point EVA at different endpoints and compare WER + MOS side-by-side.
Hardware-wise? You don’t need a GPU. My test rig:
- CPU: Intel i5-1135G7 (4c/8t)
- RAM: 16GB (EVA process maxes at ~650MB during 4-concurrent eval)
- Storage: <100MB (audio files excluded — those live in
./tests/) - Network: Just needs HTTP access to your agent. No internet required if you run offline models (e.g.,
faster-whisper+piper).
If your voice agent is Dockerized, EVA fits in the same docker network and adds ~300MB RAM overhead — negligible.
The Rough Edges: What EVA Doesn’t Do (Yet)
Let’s keep it real — EVA is promising, but it’s still early. As of v0.2.1 (May 2024), here’s what I’ve hit:
- No built-in web dashboard —
eva serveexists but serves static HTML reports, not live metrics. It’sindex.html, not Grafana. You can generate reports (eva report --format html), but it’s a one-off snapshot. - No audio diffing UI — You get MOS scores, but no side-by-side audio player to A/B your old vs. new TTS. I hacked this with a simple Flask route serving
/audio/{test_id}/before.wavand/after.wav. - Limited TTS backends — Only Piper and eSpeak are documented. I got Coqui TTS working by wrapping it in a FastAPI proxy — but it’s not plug-and-play.
- Intent labeling friction — You must provide ground-truth intent labels. No unsupervised clustering or weak supervision (yet). If your agent handles 50+ intents, that CSV gets long.
- No multi-turn evaluation — EVA treats every audio clip as an isolated utterance. No session state, no conversation history injection. That’s fine for most IVR use cases, but limiting for assistant-style bots.
Also — the docs are functional, not narrative. You’ll spend 10 minutes reading eva --help and the GitHub README. There’s no “Getting Started with Voice Agent Testing” tutorial. You’re expected to know your ASR/TTS stack.
So — Should You Deploy EVA Right Now?
Yes — if you’re shipping voice agents and tired of flying blind.
I’ve been running EVA in CI for 3 weeks against our RAG-powered support bot. It caught two regressions:
- A Whisper.cpp quantization that spiked WER from 8.2% → 14.7% on accented English audio
- A TTS voice switch that dropped MOS from 3.8 → 2.1 (confirmed by 3 internal listeners)
That paid for itself in 1 day of avoided support escalations.
Is it production-ready? Yes, for automated evaluation. Is it the final word in voice QA? No. You’ll still need human raters for tone, cultural nuance, and emotional appropriateness. But EVA handles the boring, repeatable, numbers-driven part — and it does it well.
Bottom line: If you’re using curl + whisper.cpp + a spreadsheet to evaluate voice agents — stop. pip install eva-eval, write a config, and run eva run. It takes <30 minutes. You’ll get WER, intent F1, and MOS scores — all self-hosted, all auditable, all reproducible.
And if ServiceNow keeps this up? I’ll be watching those GitHub stars. 94 is just the beginning.
Comments