Let’s be honest: if you’re running AI engineering teams—or just trying to ship LLM-powered features without drowning in console.log() spaghetti—you’re probably flying blind. You track latency in Datadog, count tokens in a spreadsheet, and pray your RAG pipeline doesn’t silently hallucinate its way into production. There’s no unified, self-hostable dashboard that answers: “Which model version spiked error rates last Tuesday?”, “How many prompts actually hit the vector DB vs. falling back to keyword search?”, or “Who deployed that fine-tuned LoRA that’s now chewing 40% more GPU mem?” That’s where pellametric quietly shows up—not with hype, but with TypeScript, Prometheus metrics, and a GitHub repo that’s small enough to audit but sharp enough to use today.

What Is Pellametric? Real-World AI Engineering Observability

Pellametric is a self-hostable AI engineering analytics platform, built by the folks at Pella Labs (who also run the bematist.dev blog). As of June 2024, it has 42 stars on GitHub, written in TypeScript, and targets one very specific pain point: instrumenting AI pipelines—not infrastructure, not ML training, but the messy, middleware-heavy, LLM-orchestration layer.

Unlike generic APMs (Datadog, Grafana Tempo) or ML monitoring tools (Arize, WhyLabs), Pellametric assumes you’re already using LangChain, LlamaIndex, or even raw openai/anthropic SDKs—and gives you a lightweight agent + backend to capture prompt → LLM call → tool usage → output → feedback in structured, queryable form.

It’s not trying to replace your tracing stack. It complements it. You don’t need to re-architect your app to get value—just wrap your LLM calls with its @pellametric/agent:

import { createPellaAgent } from '@pellametric/agent';

const pella = createPellaAgent({
  endpoint: 'http://localhost:3001/ingest',
  apiKey: 'dev-key-123',
});

const response = await pella.llm({
  model: 'gpt-4o',
  prompt: 'Explain quantum entanglement like I’m 5',
  temperature: 0.3,
  metadata: { feature: 'faq-bot', version: 'v2.1.0' },
});

That one call ships:

  • Prompt + completion (with token counts)
  • Latency, model provider, region, retries
  • Structured feedback (e.g., user_rating: 4, is_hallucinated: true)
  • Optional tool invocations (e.g., SearchTool, DatabaseQueryTool)

All of it lands in Pellametric’s Postgres-backed backend and shows up in a clean, filterable UI—no SaaS signups, no vendor lock-in.

Installation & Docker Deployment (It Actually Works)

I spun this up on a DigitalOcean $12/mo droplet (2 vCPU, 4GB RAM, 80GB SSD)—no Kubernetes, no Helm, just docker-compose. The docs say “just run docker-compose up”, but there are two gotchas you won’t find in the README (I filed an issue after hitting both):

  1. The PROMETHEUS_ENABLED=true env var must be set even if you’re not scraping metrics—otherwise the /metrics endpoint 500s, breaking the frontend health check.
  2. The UI expects API_BASE_URL to be set at build time, not runtime—so you can’t just override it in docker-compose.yml with an env var. You must rebuild the frontend image with --build-arg API_BASE_URL=http://host.docker.internal:3001.

Here’s the working docker-compose.yml I use (tested with v0.4.2, the latest tagged release as of June 2024):

version: '3.8'

services:
  pellametric-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: pellametric
      POSTGRES_USER: pella
      POSTGRES_PASSWORD: pella_dev
    volumes:
      - ./pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U pella -d pellametric"]
      interval: 30s
      timeout: 10s
      retries: 5

  pellametric-backend:
    image: ghcr.io/pella-labs/pellametric-backend:v0.4.2
    restart: unless-stopped
    environment:
      DATABASE_URL: postgresql://pella:pella_dev@pellametric-db:5432/pellametric
      PORT: 3001
      API_KEY: dev-key-123
      PROMETHEUS_ENABLED: "true"
      NODE_ENV: production
    depends_on:
      pellametric-db:
        condition: service_healthy
    ports:
      - "3001:3001"

  pellametric-frontend:
    image: ghcr.io/pella-labs/pellametric-frontend:v0.4.2
    restart: unless-stopped
    environment:
      API_BASE_URL: http://localhost:3001
    ports:
      - "3000:3000"
    depends_on:
      pellametric-backend:
        condition: service_healthy

Run it:

docker-compose up -d --build
docker-compose logs -f pellametric-backend

Wait ~30 seconds, then hit http://localhost:3000. Sign in with [email protected] / password. Done.

Note: The frontend image does not support runtime API_BASE_URL injection—this is a real limitation if you’re reverse-proxying behind Nginx or Cloudflare. I patched it locally with a 3-line change to Dockerfile.frontend, but upstream hasn’t merged that yet.

Pellametric vs. The Alternatives: Where It Fits (and Where It Doesn’t)

If you’ve tried Langfuse, Phoenix, or Arize, here’s the honest comparison:

Tool Self-hostable LLM-Specific UI Real-time Filtering Token Cost Tracking Learning Curve Docker-Ready
Pellametric ✅ Yes (Postgres + 2 services) ✅ Clean, focused on prompts/feedback ✅ Full Lucene-style search on metadata ✅ Yes (via LLM provider integration) ⭐ Low (5 min to first trace) ✅ Yes (but see gotchas above)
Langfuse ✅ Yes ✅ Very polished, but overloaded with tracing ✅ Yes, but needs manual index tuning ✅ Yes, but requires provider config ⚠️ Medium (UI is dense) ✅ Yes
Phoenix ✅ Yes ❌ Minimal UI (debug-first) ❌ Search is CLI-only or requires OpenInference export ❌ No built-in cost calc ⚠️ High (Python + Pydantic + OpenInference spec) ⚠️ Partial (needs manual DB setup)
Arize ❌ Cloud-only (free tier only) ✅ Excellent, but SaaS-only ✅ Yes ✅ Yes ❌ Vendor lock-in N/A

The kicker? Pellametric’s token cost tracking works out of the box. Set OPENAI_API_KEY in your app, and it pulls pricing from OpenAI’s official /v1/models endpoint at runtime—then multiplies by prompt_tokens + completion_tokens. No manual CSV uploads. No cost_per_million_tokens config.

That said: it doesn’t do drift detection. No concept of “model performance over time” like Phoenix. No embedding similarity dashboards. It’s observability, not ML monitoring.

Why Self-Host Pellametric? Who’s This Actually For?

Let’s cut the fluff: you should self-host Pellametric if you meet at least two of these:

  • You’re sending PII, PHI, or proprietary prompts through LLMs (e.g., internal docs, customer support logs, legal contracts)
  • You run >3 LLM endpoints in production and can’t remember which one uses gpt-3.5-turbo-1106 vs. gpt-3.5-turbo-0125
  • You’ve built your own LangChain wrapper and want to prove that caching llm.invoke() reduced latency by 320ms—not just guess
  • Your team uses console.error("LLM failed") as observability

It’s not for:

  • Teams who want auto-instrumentation (no OpenTelemetry auto-injector yet)
  • Data scientists needing embedding clustering or UMAP plots
  • Companies requiring SOC 2 compliance out of the box (no audit logs, no RBAC—just admin/viewer roles)

Hardware-wise? I’ve run it for 3 weeks on that $12 droplet:

  • Avg. RAM: ~1.1 GB (Postgres + backend + frontend)
  • CPU: < 15% idle, spikes to 40% during bulk ingestion (e.g., backfilling 10k traces)
  • Disk: ~280 MB after 12k traces (with prompt/completion text compressed in Postgres text columns)

No GPU needed. No Node.js version hell—backend runs on Node 20.12, frontend on Vite 5.2.

The Rough Edges: What You’ll Actually Encounter

Here’s what I ran into—and what you’ll face too:

  1. No RBAC (yet)
    The UI has admin and viewer roles, but both can delete all traces. There’s no per-project isolation. If your team has 5 squads sharing one instance, you’re trusting everyone not to DELETE FROM traces;. The GitHub issue is open since March 2024—no ETA.

  2. Feedback is “opt-in but not optional”
    To see user_rating or is_hallucinated in filters, you must send those fields in every pella.llm() call. There’s no “default to null” UI toggle. If your frontend forgets to attach metadata: { user_rating: 5 }, that feedback vanishes—not hidden, gone.

  3. No alerting
    You can’t set “alert if error_rate > 5% for 5 minutes”. You get metrics (pellametric_llm_errors_total), but no built-in alert rules or email/SMS hooks. You’ll need Prometheus + Alertmanager—fine if you have it, painful if you don’t.

  4. Frontend build quirk (again)
    Want to deploy behind https://ai-obs.yourco.com? You must rebuild the frontend image with --build-arg API_BASE_URL=https://ai-obs.yourco.com/api. No nginx.conf rewrite trick works—the frontend makes hardcoded /api/ requests that 404. I spent 90 minutes on this. Don’t repeat my mistake.

  5. Docs lag behind code
    The @pellametric/agent package on npm is v0.4.2, but the GitHub README still shows pella.track()—a deprecated method. The actual working method is pella.llm(). Check the agent source if in doubt.

Final Verdict: Is Pellametric Worth Deploying Right Now?

Yes—but with guardrails.

I’ve run it for 17 days across 3 internal LLM services (a RAG chatbot, a Slack summarizer, and a document classifier). It caught two real issues:

  • A misconfigured temperature=1.0 in prod that spiked hallucination flags by 70%
  • A SearchTool timeout bug that only surfaced when filtering by duration_ms > 5000

That’s value. Real, immediate, actionable value.

Is it polished? No. It’s a sharp, lightweight knife—not a Swiss Army one. You’ll tweak configs, rebuild images, and read source. But if you’re tired of stitching together console.log() + Datadog + Excel to answer “why did that LLM call fail?”, Pellametric gets you 80% there in <30 minutes.

The 42 stars? They’re not from hype. They’re from people like you—sysadmins who’d rather grep logs than click through SaaS dashboards.

So go ahead. Clone it. Run docker-compose up. Then go break something—and finally see it break.

Because in AI engineering, visibility isn’t optional. It’s the first line of defense.

git clone https://github.com/pella-labs/pellametric.git
cd pellametric
docker-compose up -d
# Then: http://localhost:3000 → login → start instrumenting