OpenVals provides an open-source framework for evaluating and benchmarking large language models from providers like OpenAI, Ollama, Claude, and Gemini. Hosted on GitHub at vishwanathakuthota/openvals with 16 stars, this Python project addresses the gap between model demos and production reliability. AI evaluation tools often limit themselves to basic metrics, but OpenVals structures assessments to quantify trust, risk, and performance. It helps organizations align evaluations with business goals, compare models under identical conditions, and make deployment decisions with normalized scores.

The framework targets unpredictability in LLMs and generative AI, particularly in enterprise systems, regulated industries, and security-sensitive setups. Models may perform well in isolated tests but falter under real workloads due to factors like latency or safety. OpenVals introduces weighted scoring—Trust Score = Σ (wᵢ × mᵢ)—to prioritize accuracy, cost, or speed based on use cases.

Core capabilities

OpenVals breaks evaluation into focused components:

  • Model evaluation tests outputs against datasets using accuracy (exact or relaxed matching), semantic similarity, and latency.
  • Multi-model benchmarking runs side-by-side comparisons, normalizes scores, ranks models, and generates performance insights.
  • Scoring engine applies custom weights to metrics, balancing priorities like speed against reliability.
  • Extensible architecture includes plug-and-play adapters for models, support for custom metrics, and scalable pipelines.
  • Recommendation engine (in development) will suggest optimal models based on datasets and tradeoffs like speed versus accuracy.

These features enable normalized comparisons across local AI (like Ollama models) and public APIs.

Getting it running

Installation requires Python, as indicated by the project's PyPI badge. Run this command:

pip install openvals

For a single-model evaluation, load a dataset and instantiate an evaluator:

from openvals.core.evaluator import Evaluator
from openvals.datasets.loader import load_dataset
from openvals.models.ollama_model import OllamaModel

dataset = load_dataset("examples/sample_eval.json")

model = OllamaModel("llama3")

evaluator = Evaluator(model, dataset)

result = evaluator.run()

print(result["overall_score"])

This processes the sample JSON dataset against Llama 3 via Ollama, outputting an overall score.

Multi-model benchmarking expands on this by comparing several models:

from openvals.benchmarking.runner import BenchmarkRunner
from openvals.models.ollama_model import OllamaModel
from openvals.datasets.loader import load_dataset

dataset = load_dataset("examples/sample_eval.json")

models = {
    "llama2": OllamaModel("llama2"),
    "llama3": OllamaModel("llama3"),
    "mistral": OllamaModel("mistral")
}

runner = BenchmarkRunner(models, dataset)

results = runner.run()

print(results)

The output ranks models by composite score:

=== FINAL RANKING ===
1. mistral   (0.91)
2. llama3    (0.87)
3. llama2    (0.84)

Users supply datasets in JSON format (like examples/sample_eval.json), which the loader handles automatically. OllamaModel serves as an example adapter; others exist for API-based providers. Ensure Ollama runs locally for these examples, with models pulled beforehand.

Metrics explained

OpenVals evaluates via three core metrics, detailed in its table:

Metric Meaning
Accuracy Exact / relaxed match scoring
Semantic Meaning similarity
Latency Response speed

Accuracy checks literal or flexible output matches. Semantic similarity measures meaning via embeddings or comparators. Latency tracks inference time. The hybrid scoring combines these, with weights adjustable per run to reflect priorities—higher weight on latency for real-time apps, for instance.

Who this is for

The project suits AI engineers, ML teams, SaaS companies integrating LLMs, enterprises validating models, and AI governance teams. In regulated sectors, it quantifies safety and reliability before deployment. Teams benchmarking local models (Ollama) against cloud options (Claude, Gemini) benefit from its normalization, avoiding demo biases. If your workflow involves custom datasets and multi-model tests aligned to business KPIs, OpenVals fits; dataset prep remains user-driven, focusing effort on evaluation.

How it stands out

Most evaluation tools compute isolated metrics like perplexity or BLEU scores, stopping short of production trust signals. OpenVals extends this with business-aligned weighting, multi-model ranking, and latency integration—features absent in basic benchmarks. Compared to heavier frameworks like Hugging Face's Evaluate library, it emphasizes Ollama-local runs and quick API comparisons without deep ML expertise. Its lightweight repo size (per badge) keeps overhead low versus enterprise suites like Weights & Biases, though it lacks built-in visualization.

For self-hosters, the Ollama focus pairs well with local setups, sidestepping API costs during dev. Drawbacks include early-stage status—no Docker image or Helm chart yet—and reliance on sample datasets; production users may need to build extensive JSON eval sets.

OpenVals works best for teams needing quick, extensible LLM comparisons in Python environments. Solo developers or those satisfied with single-metric tools might skip it. Check the GitHub repo or website for updates.