The most complete open-source LLM evaluation suite.
Measure accuracy, latency, cost, hallucination, and reasoning quality across any LLM โ side by side.
| ๐ Live Docs | ๐ค HuggingFace | ๐ GitHub | โญ Star the Repo |
|---|
Evaluation+Framework+%E2%80%94+Demo+Coming+Soon&font=raleway" alt="Demo Placeholder">
๐ Table of Contents
- โจ Why This Framework?
- ๐ฏ Key Features
- ๐ Architecture
- ๐ Quick Start
- ๐ฆ Installation
- ๐ API Keys Setup
- ๐ป CLI Reference
- ๐ Python API
- ๐ REST API Reference
- ๐ Streamlit Dashboard
- ๐ Evaluation Metrics
- ๐ Supported Benchmarks
- ๐ค Supported Models & Pricing
- ๐ Database & Storage
- ๐ PDF Report Generation
- ๐ณ Docker Deployment
- ๐งช Testing
- ๐ค HuggingFace Dataset
- ๐ Project Structure
- ๐ง Configuration Reference
- ๐ค Contributing
- ๐ License
- โญ Star History
โจ Why This Framework?
"You can't improve what you can't measure." โ Peter Drucker
The LLM landscape is evolving at breakneck speed. New models appear every week, each claiming to be state-of-the-art. But how do you actually know which model is best for your use case?
Most existing benchmarking tools:
- โ Evaluate only a single model at a time
- โ Ignore latency and real-world cost
- โ Don't detect hallucinations
- โ Require complex setup
- โ Lack a usable dashboard
This framework solves all of that. It's the only open-source tool that evaluates GPT-4, Claude, Gemini, Mistral, and Llama side by side โ on the same prompts โ with 5 production-relevant metrics, a beautiful Streamlit dashboard, a REST API, and a CLI.
๐ฏ Key Features
๐ Evaluation Metrics
|
๐ Interfaces
|
๐ Benchmarks
|
๐ Infrastructure
|
๐ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LLM EVALUATION FRAMEWORK โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Click โ โ FastAPI โ โ Streamlit โ โ ReportLab โ โ
โ โ CLI โ โ REST API โ โ Dashboard โ โPDF Generatorโ โ
โ โ 7 cmds โ โ 12 endpoints โ โ 5 pages โ โ โ โ
โ โโโโโโฌโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโผโโโโโโโโโโโ โ
โ โ Core Evaluator โ โ
โ โ (Async Engine) โ โ
โ โ asyncio.Semaphore โ โ
โ โ configurable timeout โ โ
โ โ progress callbacks โ โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโดโโโโโโโฌโโโโโโโโโโโ๏ฟฝ๏ฟฝโโโโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโผโโโโโโโ โโโโโโโโโโผโโโโโโโ โโโโผโโโโโโโโโโโ โโโโโโโโผโโโโ โ
โ โ Metrics โ โ Benchmarks โ โ Database โ โ LiteLLM โ โ
โ โ โ โ โ โ (SQLite) โ โ โ โ
โ โ accuracy โ โ MMLU (14K) โ โ โ โ OpenAI โ โ
โ โ hallucin. โ โ TruthfulQA โ โ save_result โ โ Anthropicโ โ
โ โ latency โ โ Custom CSV โ โ list_resultsโ โ Google โ โ
โ โ cost โ โ Custom JSON โ โ export_csv โ โ Mistral โ โ
โ โ reasoning โ โ HF Hub cache โ โ export_json โ โ Together โ โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ๏ฟฝ๏ฟฝโโโ
Data Flow
User Request โ CLI/API/Dashboard
โ
EvaluationConfig (model, benchmark, num_samples, temperature, concurrency)
โ
LLMEvaluator.evaluate(config, samples)
โ
asyncio.Semaphore(concurrency) โ controls parallelism
โ
For each sample (parallel):
litellm.acompletion() โ response
AccuracyMetric.score(response, expected)
HallucinationMetric.score(prompt, response)
HallucinationMetric.reasoning_quality(response)
LatencyMetric โ wall-clock time
CostMetric.calculate(model, input_tokens, output_tokens)
โ
_aggregate(samples) โ EvaluationResult (accuracy, latency stats, cost, etc.)
โ
Database.save_result() โ SQLite
โ
Return EvaluationResult to caller
๐ Quick Start
3-Minute Setup
# 1. Install
pip install llm-evaluation-framework
# 2. Set your API key
export OPENAI_API_KEY="sk-..."
# 3. Run your first evaluation
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 50
Expected output:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Evaluation: gpt-4o-mini โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโค
โ Accuracy โ 78.00% โ
โ Avg Latency โ 432 ms โ
โ P95 Latency โ 1240 ms โ
โ Total Cost โ $0.0012 โ
โ Cost / 1K Tokens โ $0.0015 โ
โ Hallucination โ 2.40% โ
โ Reasoning Score โ 7.2 / 10 โ
โ Samples โ 50 โ
โ Run ID โ a3f92c1b โ
โฐโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโฏ
๐ฆ Installation
Option 1 โ pip (Recommended)
pip install llm-evaluation-framework
Option 2 โ With Extras
# Dashboard (Streamlit + Plotly + Pandas)
pip install "llm-evaluation-framework[dashboard]"
# PDF Reports (ReportLab)
pip install "llm-evaluation-framework[reports]"
# Development (pytest, mypy, ruff)
pip install "llm-evaluation-framework[dev]"
# Everything
pip install "llm-evaluation-framework[dashboard,reports,dev]"
Option 3 โ From Source
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
# Install with all extras
pip install -e ".[dashboard,reports,dev]"
Option 4 โ Docker
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
cp .env.example .env # fill in your API keys
docker-compose up -d
# API: http://localhost:8000/docs
# Dashboard: http://localhost:8501
Requirements
| Dependency | Version | Purpose |
|---|---|---|
| Python | โฅ 3.10 | Core runtime |
| litellm | 1.52.x | Unified LLM API |
| fastapi | 0.115.x | REST API |
| uvicorn | 0.32.x | ASGI server |
| streamlit | 1.40.x | Dashboard |
| plotly | 5.24.x | Charts |
| pandas | 2.2.x | Data handling |
| click | 8.1.x | CLI |
| rich | 13.9.x | Terminal UI |
| datasets | 3.2.x | HF Hub loader |
| reportlab | 4.2.x | PDF reports |
| pydantic | 2.10.x | Data validation |
๐ API Keys Setup
Copy .env.example to .env and fill in your keys:
cp .env.example .env
# .env
# โโ OpenAI โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
OPENAI_API_KEY=sk-...
# โโ Anthropic (Claude) โโโโโโโโโโโโโโโโโโโโโโโ
ANTHROPIC_API_KEY=sk-ant-...
# โโ Google (Gemini) โโโโโโโโโโโโโโโโโโโโโโโโโโ
GEMINI_API_KEY=AI...
GOOGLE_API_KEY=AI...
# โโ Mistral โโโโโโโโโโโโโโโโโโโโโโโ๏ฟฝ๏ฟฝ๏ฟฝโโโโโโโโโโ
MISTRAL_API_KEY=...
# โโ Together AI (Llama) โโโโโโโโโโโโโโโโโโโโโโ
TOGETHERAI_API_KEY=...
# โโ HuggingFace (for dataset loading) โโโโโโโโ
HUGGINGFACE_TOKEN=hf_...
# โโ App Settings โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
LLM_EVAL_DB_PATH=llm_eval.db
LLM_EVAL_CACHE_DIR=~/.cache/llm_eval
PORT=8000
DASHBOARD_PORT=8501
Note: You only need the keys for the models you want to evaluate. The framework works with any subset of providers.
๐ป CLI Reference
The CLI is installed as llm-eval and provides 7 subcommands:
llm-eval run โ Evaluate a single model
llm-eval run [OPTIONS]
Options:
-m, --model TEXT LiteLLM model name [required]
-b, --benchmark TEXT mmlu | truthfulqa | custom [default: mmlu]
-n, --samples INTEGER Number of samples [default: 20]
--temperature FLOAT Sampling temperature [default: 0.0]
--max-tokens INTEGER Max output tokens [default: 512]
--concurrency INTEGER Parallel API calls [default: 5]
-o, --output PATH Save JSON result to file
-v, --verbose Enable debug logging
Examples:
# Quick 20-sample smoke test
llm-eval run --model gpt-4o-mini --benchmark mmlu
# Full 100-sample MMLU evaluation
llm-eval run --model gpt-4o --benchmark mmlu --samples 100 --concurrency 10
# TruthfulQA evaluation
llm-eval run --model claude-3-5-haiku-20241022 --benchmark truthfulqa --samples 50
# Save output to JSON
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 50 -o result.json
# Use a custom temperature
llm-eval run --model gpt-3.5-turbo --benchmark mmlu --temperature 0.3
llm-eval compare โ Compare multiple models
llm-eval compare [OPTIONS]
Options:
-m, --models TEXT Model names (repeat flag) [required, min 2]
-b, --benchmark TEXT Benchmark name [default: mmlu]
-n, --samples INTEGER Samples per model [default: 20]
-o, --output PATH Save JSON results to file
Examples:
# Compare 3 providers
llm-eval compare \
--models gpt-4o-mini \
--models claude-3-haiku-20240307 \
--models gemini/gemini-1.5-flash \
--benchmark mmlu --samples 50
# Compare on TruthfulQA
llm-eval compare \
--models gpt-4o \
--models claude-3-5-sonnet-20241022 \
--benchmark truthfulqa --samples 100
# Save comparison results
llm-eval compare --models gpt-4o-mini --models claude-3-haiku-20240307 \
--benchmark mmlu --output comparison.json
llm-eval results โ View stored results
llm-eval results [OPTIONS]
Options:
--model TEXT Filter by model name
--benchmark TEXT Filter by benchmark
--limit INTEGER Max results to show [default: 20]
Examples:
# Show all results
llm-eval results
# Filter by model
llm-eval results --model gpt-4o-mini
# Filter by benchmark
llm-eval results --benchmark mmlu --limit 50
llm-eval export โ Export results
llm-eval export --format [csv|json] --output OUTPUT_PATH
# Export to CSV
llm-eval export --format csv --output results.csv
# Export to JSON
llm-eval export --format json --output results.json
llm-eval report โ Generate PDF report
llm-eval report --run-ids RUN_ID [--run-ids RUN_ID ...] --output OUTPUT_DIR
# Single run
llm-eval report --run-ids a3f92c1b --output ./reports/
# Multiple runs in one report
llm-eval report --run-ids a3f92c1b --run-ids b4d03e2c --output ./reports/
llm-eval serve โ Start FastAPI server
llm-eval serve [--host HOST] [--port PORT] [--reload]
llm-eval serve --port 8000 --reload
# โ http://localhost:8000/docs
llm-eval dashboard โ Launch Streamlit dashboard
llm-eval dashboard [--port PORT]
llm-eval dashboard --port 8501
# โ http://localhost:8501
๐ Python API
Basic Evaluation
import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark
async def main():
evaluator = LLMEvaluator() # uses default db path
samples = MMLUBenchmark().load(num_samples=100) # loads from HF Hub or cache
config = EvaluationConfig(
model="gpt-4o-mini",
benchmark="mmlu",
num_samples=100,
temperature=0.0,
max_tokens=512,
concurrency=10, # parallel API calls
timeout=30.0, # seconds per request
)
result = await evaluator.evaluate(config, samples)
print(f"Accuracy: {result.accuracy:.2%}")
print(f"Avg Latency: {result.avg_latency_ms:.0f} ms")
print(f"P95 Latency: {result.p95_latency_ms:.0f} ms")
print(f"P99 Latency: {result.p99_latency_ms:.0f} ms")
print(f"Total Cost: ${result.total_cost_usd:.4f}")
print(f"Cost per 1K: ${result.cost_per_1k_tokens:.4f}")
print(f"Hallucination: {result.hallucination_rate:.2%}")
print(f"Reasoning Score: {result.avg_reasoning_score:.1f} / 10")
print(f"Run ID: {result.run_id}")
asyncio.run(main())
Side-by-Side Comparison
import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark
async def compare():
evaluator = LLMEvaluator()
# Load samples ONCE โ all models see the same prompts
samples = MMLUBenchmark().load(num_samples=50)
configs = [
EvaluationConfig(model="gpt-4o", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="gpt-4o-mini", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="claude-3-5-sonnet-20241022", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="claude-3-haiku-20240307", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="gemini/gemini-1.5-flash", benchmark="mmlu", num_samples=50),
]
# All 5 evaluations run in parallel
results = await evaluator.evaluate_multiple(configs, samples)
# Print ranked leaderboard
header = f"{'Rank':<5} {'Model':<40} {'Acc':>8} {'Lat':>8} {'Cost/1K':>10} {'Score':>8}"
print(header)
print("โ" * len(header))
for i, r in enumerate(sorted(results, key=lambda x: x.accuracy, reverse=True), 1):
print(
f"{i:<5} {r.model:<40} "
f"{r.accuracy:>7.1%} "
f"{r.avg_latency_ms:>6.0f}ms "
f"${r.cost_per_1k_tokens:>8.4f} "
f"{r.avg_reasoning_score:>6.1f}/10"
)
asyncio.run(compare())
Custom Benchmark
import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.custom import CustomBenchmark
async def custom_eval():
# From a CSV file
bench = CustomBenchmark.from_file("my_benchmark.csv")
# Or from a string
csv_content = """prompt,expected
"What is the capital of Python packaging?","PyPI"
"What does ACID stand for in databases?","Atomicity Consistency Isolation Durability"
"""
bench = CustomBenchmark.from_string(csv_content, format="csv")
evaluator = LLMEvaluator()
config = EvaluationConfig(model="gpt-4o-mini", benchmark="custom", num_samples=50)
result = await evaluator.evaluate(config, bench.load(50))
print(f"Custom eval accuracy: {result.accuracy:.2%}")
asyncio.run(custom_eval())
Progress Tracking
import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark
async def with_progress():
evaluator = LLMEvaluator()
samples = MMLUBenchmark().load(100)
config = EvaluationConfig(model="gpt-4o-mini", benchmark="mmlu", num_samples=100)
async def on_progress(done: int, total: int):
pct = done / total * 100
bar = "โ" * int(pct / 5) + "โ" * (20 - int(pct / 5))
print(f"\r[{bar}] {done}/{total} ({pct:.0f}%)", end="", flush=True)
result = await evaluator.evaluate(config, samples, progress_callback=on_progress)
print(f"\nโ
Done! Accuracy: {result.accuracy:.1%}")
asyncio.run(with_progress())
Database Queries
from llm_eval.database.models import Database
db = Database() # or Database("custom_path.db")
# List all results
records = db.list_results(limit=50)
# Filter by model
records = db.list_results(model="gpt-4o-mini", limit=20)
# Filter by benchmark
records = db.list_results(benchmark="mmlu", limit=100)
# Get a specific run
record = db.get_result("a3f92c1b")
print(record.accuracy, record.avg_latency_ms)
# Compare models on a benchmark (latest run per model)
comparison = db.get_model_comparison("mmlu")
for row in comparison:
print(row["model"], row["accuracy"])
# Export
db.export_csv("all_results.csv")
db.export_json("all_results.json")
# Delete a run
db.delete_result("a3f92c1b")
๐ REST API Reference
Start the API server:
uvicorn llm_eval.api.main:app --reload --port 8000
# Interactive docs: http://localhost:8000/docs
# ReDoc: http://localhost:8000/redoc
Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/evaluate |
Evaluate a model on a benchmark |
POST |
/compare |
Compare multiple models side-by-side |
POST |
/evaluate/custom |
Upload CSV/JSON for custom evaluation |
GET |
/results |
List stored results (filterable) |
GET |
/results/{run_id} |
Get a specific run result |
DELETE |
/results/{run_id} |
Delete a stored result |
GET |
/export/csv |
Download all results as CSV |
GET |
/export/json |
Download all results as JSON |
POST |
/report |
Generate and download PDF report |
GET |
/models |
List all models and pricing |
GET |
/benchmarks |
List available benchmarks |
GET |
/health |
Health check ({"status":"ok"}) |
Request/Response Examples
POST /evaluate
// Request
{
"model": "gpt-4o-mini",
"benchmark": "mmlu",
"num_samples": 50,
"temperature": 0.0,
"max_tokens": 512,
"concurrency": 10
}
// Response
{
"run_id": "a3f92c1b",
"model": "gpt-4o-mini",
"benchmark": "mmlu",
"num_samples": 50,
"accuracy": 0.78,
"avg_latency_ms": 432.1,
"p50_latency_ms": 380.0,
"p95_latency_ms": 1100.0,
"p99_latency_ms": 1840.0,
"total_cost_usd": 0.0012,
"cost_per_1k_tokens": 0.0015,
"hallucination_rate": 0.024,
"avg_reasoning_score": 7.2,
"created_at": "2025-01-20T14:32:01"
}
POST /compare
// Request
{
"models": ["gpt-4o-mini", "claude-3-haiku-20240307", "gemini/gemini-1.5-flash"],
"benchmark": "mmlu",
"num_samples": 30
}
// Response
{
"results": [
{"model": "gpt-4o-mini", "accuracy": 0.78, ...},
{"model": "claude-3-haiku-20240307", "accuracy": 0.74, ...},
{"model": "gemini/gemini-1.5-flash", "accuracy": 0.76, ...}
]
}
POST /evaluate/custom (file upload)
curl -X POST http://localhost:8000/evaluate/custom \
-F "file=@my_benchmark.csv" \
-F "model=gpt-4o-mini" \
-F "num_samples=50"
๐ Streamlit Dashboard
Launch the dashboard:
streamlit run llm_eval/dashboard/app.py
# โ http://localhost:8501
# Or via CLI
llm-eval dashboard --port 8501
Dashboard Pages
| Page | Description |
|---|---|
| ๐ Dashboard | Overview: total runs, unique models, best accuracy, total spend. Radar chart, cost vs quality scatter, latency histogram. |
| โถ๏ธ Run Evaluation | Configure and launch a new evaluation with live progress bar. |
| โ๏ธ Compare Models | Select multiple models, run parallel comparison, see ranked table + charts. |
| ๐ Results | Browse all stored results with filters. Download CSV or JSON. |
| ๐ Reports | Select runs, generate PDF report, download instantly. |
| โน๏ธ About | Framework info, links, quick start guide. |
Dashboard Charts
- Radar Chart โ 5-axis model comparison (accuracy, speed, cost efficiency, truthfulness, reasoning)
- Latency Histogram โ distribution of response times per model
- Cost vs Quality Scatter โ bubble chart (bubble size = sample count)
- Accuracy Bar Chart โ ranked model comparison
- 3D Scatter โ cost vs quality vs latency
๐ Evaluation Metrics
Accuracy Metric
Uses a cascade of matching strategies, applied in order:
- Exact match โ after lowercasing and stripping punctuation
- Prefix normalization โ removes "The answer is", "Answer:", etc.
- Multiple-choice detection โ extracts the first A/B/C/D letter from the prediction
- Fuzzy match โ Levenshtein ratio โฅ 0.85 (configurable)
from llm_eval.metrics.accuracy import AccuracyMetric
metric = AccuracyMetric(fuzzy_threshold=0.85)
# Multiple choice
metric.score("The answer is A", "A") # True
metric.score("I think B is correct here", "B") # True
# Free-form
metric.score("mitochondria", "mitochondrion") # True (fuzzy)
metric.score("Paris", "Paris") # True (exact)
# Batch scoring
results, accuracy = metric.batch_score(
["A", "B", "A", "D"],
["A", "B", "C", "D"]
)
# results = [True, True, False, True], accuracy = 0.75
Latency Metric
from llm_eval.metrics.latency import LatencyMetric
metric = LatencyMetric()
latencies = [120, 200, 350, 450, 800, 1200, 2000, 5500] # ms
stats = metric.compute_stats(latencies)
print(f"Mean: {stats.mean_ms:.0f} ms")
print(f"P50: {stats.p50_ms:.0f} ms")
print(f"P95: {stats.p95_ms:.0f} ms")
print(f"P99: {stats.p99_ms:.0f} ms")
# SLA violations
rate = metric.sla_violation_rate(latencies, threshold_ms=5000.0)
print(f"SLA violations: {rate:.1%}") # 12.5%
# Classification
print(metric.classify(200)) # "excellent"
print(metric.classify(1000)) # "good"
print(metric.classify(4000)) # "slow"
Cost Metric
from llm_eval.metrics.cost import CostMetric
metric = CostMetric()
# Calculate exact cost
cost = metric.calculate("gpt-4o-mini", input_tokens=150, output_tokens=200)
print(f"Per-sample cost: ${cost:.6f}")
# Pre-run estimate
estimate = metric.estimate_run_cost("gpt-4o-mini", num_samples=100)
print(f"Estimated run cost: ${estimate:.4f}")
# All pricing
for name, pricing in metric.get_all_pricing().items():
print(f"{name}: ${pricing.input_per_1m}/M in, ${pricing.output_per_1m}/M out")
Hallucination + Reasoning Metric
from llm_eval.metrics.hallucination import HallucinationMetric
metric = HallucinationMetric()
# Hallucination score (0 = grounded, 1 = likely hallucinating)
score = metric.score(
prompt="What is the capital of France?",
response="I believe it was supposedly Paris, though I could be wrong."
)
print(f"Hallucination score: {score:.2f}") # ~0.3 (hedging signals)
# Reasoning quality (1-10)
quality = metric.reasoning_quality(
"First, we analyze the data. Based on the evidence specifically, "
"therefore we can conclude that X. For example, this means..."
)
print(f"Reasoning quality: {quality:.1f}/10") # ~7.0
๐ Supported Benchmarks
MMLU (Massive Multitask Language Understanding)
- Size: ~14,000 test questions
- Format: 4-choice multiple choice
- Subjects: 57 academic subjects including:
- STEM: abstract algebra, anatomy, astronomy, biology, chemistry, computer science, physics, mathematics
- Humanities: history, law, philosophy, world religions
- Social sciences: economics, geography, political science, psychology, sociology
- Professional: medical, legal, accounting, marketing, nursing
from llm_eval.benchmarks.mmlu import MMLUBenchmark
bench = MMLUBenchmark(subject="all") # or a specific subject
samples = bench.load(num_samples=100, seed=42)
# [{"prompt": "Question?\nA) ...\nB) ...\nAnswer:", "expected": "A"}, ...]
TruthfulQA
- Size: 817 questions
- Format: 4-choice multiple choice (MC1 format)
- Purpose: Tests whether models give truthful answers; questions are designed to elicit common misconceptions
from llm_eval.benchmarks.truthfulqa import TruthfulQABenchmark
bench = TruthfulQABenchmark()
samples = bench.load(num_samples=100, seed=42)
Custom Benchmark
- Format: CSV or JSON
- Required columns:
prompt,expected - Sources: File upload, string, or programmatic
from llm_eval.benchmarks.custom import CustomBenchmark
# From file
bench = CustomBenchmark.from_file("my_data.csv")
# From string
bench = CustomBenchmark.from_string("""
prompt,expected
"What is 2+2?",4
"Capital of Germany?",Berlin
""", format="csv")
print(f"Dataset size: {len(bench)} samples")
samples = bench.load(num_samples=50)
๐ค Supported Models & Pricing
OpenAI
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
gpt-4o |
$5.00 | $15.00 | Best accuracy |
gpt-4o-mini |
$0.15 | $0.60 | Best value |
gpt-4-turbo |
$10.00 | $30.00 | Legacy |
gpt-3.5-turbo |
$0.50 | $1.50 | Fast, cheap |
o1 |
$15.00 | $60.00 | Reasoning |
o1-mini |
$3.00 | $12.00 | Reasoning, fast |
Anthropic
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
claude-3-5-sonnet-20241022 |
$3.00 | $15.00 | Flagship |
claude-3-5-haiku-20241022 |
$0.80 | $4.00 | Fast + cheap |
claude-3-opus-20240229 |
$15.00 | $75.00 | Best reasoning |
claude-3-haiku-20240307 |
$0.25 | $1.25 | Cheapest Claude |
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
gemini/gemini-1.5-pro |
$3.50 | $10.50 | 2M context |
gemini/gemini-1.5-flash |
$0.075 | $0.30 | Ultra-cheap |
gemini/gemini-2.0-flash-exp |
$0.00 | $0.00 | Free preview |
Mistral
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
mistral/mistral-large-latest |
$4.00 | $12.00 | Flagship |
mistral/mistral-small-latest |
$1.00 | $3.00 | Balanced |
mistral/open-mistral-7b |
$0.25 | $0.25 | Open weights |
Meta Llama (via Together AI)
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
together_ai/meta-llama/Llama-3-70b-chat-hf |
$0.90 | $0.90 | Powerful open |
together_ai/meta-llama/Llama-3-8b-chat-hf |
$0.20 | $0.20 | Fast open |
Custom / Local Models
Any LiteLLM-compatible provider works:
# Ollama (local)
llm-eval run --model ollama/llama3 --benchmark mmlu --samples 20
# vLLM
llm-eval run --model hosted_vllm/meta-llama/Llama-3-8b --benchmark mmlu
# HuggingFace TGI
llm-eval run --model huggingface/mistralai/Mistral-7B-Instruct-v0.2 --benchmark mmlu
๐ Database & Storage
All evaluation results are automatically saved to SQLite (llm_eval.db by default):
from llm_eval.database.models import Database
db = Database("llm_eval.db") # default path
# or
db = Database("/data/eval_results.db")
# Query
records = db.list_results(model="gpt-4o-mini", benchmark="mmlu", limit=100)
record = db.get_result("a3f92c1b")
# Compare latest results per model for a benchmark
comparison = db.get_model_comparison("mmlu")
# Export
db.export_csv("results.csv")
db.export_json("results.json")
# Delete
db.delete_result("a3f92c1b")
Database Schema
CREATE TABLE evaluations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL UNIQUE,
model TEXT NOT NULL,
benchmark TEXT NOT NULL,
num_samples INTEGER NOT NULL,
accuracy REAL NOT NULL,
avg_latency_ms REAL NOT NULL,
p95_latency_ms REAL NOT NULL,
total_cost_usd REAL NOT NULL,
cost_per_1k_tokens REAL NOT NULL,
hallucination_rate REAL NOT NULL,
avg_reasoning_score REAL NOT NULL,
created_at TEXT NOT NULL,
metadata TEXT NOT NULL DEFAULT '{}'
);
๐ PDF Report Generation
from llm_eval.reports.generator import ReportGenerator
from llm_eval.database.models import Database
db = Database()
records = db.list_results(benchmark="mmlu", limit=5)
gen = ReportGenerator()
pdf_path = gen.generate(records, output_dir="./reports/")
print(f"Report saved to: {pdf_path}")
The generated PDF includes:
- Cover page โ title, timestamp, model count
- Executive summary table โ all models side by side
- Per-model detail page โ all metrics, run config
Or via CLI:
llm-eval report --run-ids a3f92c1b b4d03e2c --output ./reports/
๐ณ Docker Deployment
docker-compose.yml deploys both services:
# Start everything
docker-compose up -d
# View logs
docker-compose logs -f
# Stop
docker-compose down
# Rebuild after code changes
docker-compose up -d --build
Build individual images:
# API only
docker build --target api -t llm-eval-api:latest .
docker run -p 8000:8000 --env-file .env llm-eval-api:latest
# Dashboard only
docker build --target dashboard -t llm-eval-dashboard:latest .
docker run -p 8501:8501 --env-file .env llm-eval-dashboard:latest
Environment variables in Docker:
docker run -p 8000:8000 \
-e OPENAI_API_KEY=sk-... \
-e ANTHROPIC_API_KEY=sk-ant-... \
-v ./data:/data \
-e LLM_EVAL_DB_PATH=/data/llm_eval.db \
llm-eval-api:latest
๐งช Testing
Run the full test suite
# All tests
pytest tests/ -v
# With coverage report
pytest tests/ -v --cov=llm_eval --cov-report=html --cov-report=term-missing
# Run a specific test class
pytest tests/test_evaluator.py::TestAccuracyMetric -v
# Run a specific test
pytest tests/test_evaluator.py::TestAccuracyMetric::test_mc_correct -v
No API keys required! All tests use mocked LiteLLM responses.
Test Coverage
| Module | Tests | Coverage |
|---|---|---|
core/evaluator.py |
7 tests | 96% |
metrics/accuracy.py |
9 tests | 100% |
metrics/hallucination.py |
6 tests | 95% |
metrics/latency.py |
5 tests | 100% |
metrics/cost.py |
6 tests | 100% |
benchmarks/custom.py |
7 tests | 98% |
database/models.py |
8 tests | 97% |
Linting & Type Checking
# Ruff linter
ruff check llm_eval/ tests/
# Type checking
mypy llm_eval/ --ignore-missing-imports
# Format check
ruff format --check llm_eval/
๐ค HuggingFace Dataset
The evaluation benchmark dataset is published on HuggingFace:
from datasets import load_dataset
# Load all splits
ds = load_dataset("vigneshwar234/llm-eval-benchmark")
print(ds)
# DatasetDict({
# train: Dataset({features: ['id', 'prompt', 'expected', 'subject', 'difficulty', 'source', 'choices'], num_rows: 500}),
# validation: Dataset({features: ..., num_rows: 200}),
# test: Dataset({features: ..., num_rows: 500})
# })
# Use as a custom benchmark
import pandas as pd
from llm_eval.benchmarks.custom import CustomBenchmark
df = pd.DataFrame(ds["test"])
samples = df[["prompt", "expected"]].to_dict("records")
Dataset Statistics
| Split | Samples | Subjects | Sources |
|---|---|---|---|
| train | 500 | 15+ | MMLU + TruthfulQA |
| validation | 200 | 15+ | MMLU + TruthfulQA |
| test | 500 | 15+ | MMLU + TruthfulQA |
| total | 1,200 | 15+ | Mixed |
Re-generate and push:
python huggingface/create_dataset.py --push
๐ Project Structure
LLM-Evaluation-Framework/
โ
โโโ llm_eval/ โ Main package
โ โโโ __init__.py โ Version: 1.0.0
โ โ
โ โโโ core/
โ โ โโโ __init__.py
โ โ โโโ evaluator.py โ LLMEvaluator, EvaluationConfig, EvaluationResult
โ โ
โ โโโ metrics/
โ โ โโโ __init__.py
โ โ โโโ accuracy.py โ AccuracyMetric (exact/normalized/MC/fuzzy)
โ โ โโโ hallucination.py โ HallucinationMetric + reasoning_quality
โ โ โโโ latency.py โ LatencyMetric (percentile stats, SLA)
โ โ โโโ cost.py โ CostMetric (15+ provider pricing)
โ โ
โ โโโ benchmarks/
โ โ โโโ __init__.py
โ โ โโโ mmlu.py โ MMLUBenchmark (HF Hub + local cache + builtin)
โ โ โโโ truthfulqa.py โ TruthfulQABenchmark
โ โ โโโ custom.py โ CustomBenchmark (CSV/JSON)
โ โ
โ โโโ dashboard/
โ โ โโโ __init__.py
โ โ โโโ app.py โ 5-page Streamlit dashboard
โ โ
โ โโโ api/
โ โ โโโ __init__.py
โ โ โโโ main.py โ FastAPI (12 endpoints)
โ โ
โ โโโ cli/
โ โ โโโ __init__.py
โ โ โโโ main.py โ Click CLI (7 subcommands)
โ โ
โ โโโ reports/
โ โ โโโ __init__.py
โ โ โโโ generator.py โ ReportLab PDF generator
โ โ
โ โโโ database/
โ โโโ __init__.py
โ โโโ models.py โ SQLite persistence layer
โ
โโโ tests/
โ โโโ __init__.py
โ โโโ test_evaluator.py โ 40+ pytest tests (no API keys needed)
โ
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml โ CI: test ร 3 Python versions + lint + Docker + Pages
โ
โโโ docs/
โ โโโ index.html โ GitHub Pages (green/yellow theme)
โ
โโโ huggingface/
โ โโโ README.md โ HuggingFace dataset card
โ ๏ฟฝ๏ฟฝ๏ฟฝโโ create_dataset.py โ Dataset builder + HF Hub uploader
โ โโโ data/ โ Generated: train/validation/test JSON
โ
โโโ requirements.txt โ All pinned dependencies
โโโ .env.example โ All env var templates
โโโ Dockerfile โ Multi-stage (api + dashboard targets)
โโโ docker-compose.yml โ API + Dashboard stack
โโโ setup.py โ Package setup
โโโ pyproject.toml โ Ruff, mypy, pytest config
โโโ LICENSE โ MIT
โโโ README.md โ This file
๐ง Configuration Reference
EvaluationConfig
@dataclass
class EvaluationConfig:
model: str # LiteLLM model string (required)
benchmark: str # "mmlu" | "truthfulqa" | "custom" (required)
num_samples: int = 100 # Number of samples to evaluate
temperature: float = 0.0 # Sampling temperature (0.0 = deterministic)
max_tokens: int = 512 # Max response tokens
timeout: float = 30.0 # Per-request timeout in seconds
concurrency: int = 5 # Max parallel API calls
run_id: str = auto # Auto-generated 8-char hex ID
tags: dict = {} # Custom metadata tags
Environment Variables
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
โ | OpenAI API key |
ANTHROPIC_API_KEY |
โ | Anthropic API key |
GEMINI_API_KEY |
โ | Google Gemini API key |
MISTRAL_API_KEY |
โ | Mistral API key |
TOGETHERAI_API_KEY |
โ | Together AI key (for Llama) |
HUGGINGFACE_TOKEN |
โ | HF token for dataset loading |
LLM_EVAL_DB_PATH |
llm_eval.db |
SQLite database path |
LLM_EVAL_CACHE_DIR |
~/.cache/llm_eval |
Benchmark cache directory |
PORT |
8000 |
FastAPI port |
DASHBOARD_PORT |
8501 |
Streamlit port |
LITELLM_VERBOSE |
false |
LiteLLM debug logging |
๐ค Contributing
We welcome contributions! Here's how to get started:
Development Setup
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dashboard,reports,dev]"
pre-commit install # optional but recommended
Contribution Guide
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-awesome-feature - Make your changes with tests
- Verify:
pytest tests/ -v && ruff check llm_eval/ - Commit:
git commit -m "feat: add my awesome feature" - Push and open a Pull Request
Good First Issues
- Add a new benchmark loader (e.g., HellaSwag, ARC)
- Add a new metric (e.g., BLEU, ROUGE, BERTScore)
- Improve hallucination detection with an NLI model
- Add a new chart to the dashboard
- Add support for a new provider
๐ License
This project is licensed under the MIT License โ see the LICENSE file for details.
MIT License โ Copyright (c) 2025 vignesh2027
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
โญ Star History
If this project helped you, please star it โ it helps others find it!
Made with ๐ by vignesh2027
โญ Star ยท ๐ Report Bug ยท ๐ก Request Feature ยท ๐ค HuggingFace ยท ๐ Live Docs
Comments