Let’s cut the fluff: if you’re running 3+ MCP (Model Context Protocol) servers — whether it’s llama.cpp, ollama, text-generation-inference, or your own custom LLM endpoint — and you’re still curl-ing /health endpoints, parsing Prometheus metrics manually, or losing track of which model is choking on context length during inference bursts… you’re flying blind. Observal isn’t just another dashboard. It’s the first real observability layer built for the MCP ecosystem — not retrofitted, not generic, not trying to be Grafana + Prometheus + Loki all at once. And yes, it’s only got 113 GitHub stars (as of May 2024), written in Python, and barely six months old — but that’s exactly why it’s worth your 20 minutes right now.

What Is Observal — and Why Does MCP Need Its Own Observability Tool?

Observal (GitHub: BlazeUp-AI/Observal) is a lightweight, Python-based observability platform designed exclusively for MCP-compliant LLM servers. That specificity is its superpower — and its current limitation.

MCP is not just another API spec. It defines standardized /health, /ready, /metrics, and /models endpoints — plus structured request/response telemetry (e.g., input_tokens, output_tokens, inference_time_ms, queue_wait_ms). Most existing tools — Prometheus, Grafana, or even Langfuse — treat LLM endpoints as black-box HTTP services. They might scrape /metrics, but they don’t understand what model_loading_duration_seconds means in the context of a 13B GGUF on a 24GB 3090, or how to correlate a spike in queue_wait_ms with a runaway transformers pipeline on CPU.

Observal does. It connects to your MCP servers, auto-discovers models, pulls structured telemetry, renders latency histograms per model, tracks token throughput over time, and flags unhealthy endpoints before your frontend times out. It also includes a built-in “marketplace” — not for selling models, but for discovering, testing, and benchmarking other MCP servers on your network (think: curl http://observal:8000/marketplace → see your llama.cpp, vLLM, and llm-server-mcp instances side-by-side with uptime, avg. latency, and active contexts).

That said: this is not Datadog for LLMs. It doesn’t do distributed tracing (yet), has no alerting engine (you’ll need to wire it to Alertmanager), and no persistent storage layer — metrics live in memory or get dumped to CSV/JSON for export.

How to Install Observal: From git clone to Live Dashboard

Observal runs as a single Python process (no complex orchestration needed), but Docker is the cleanest path — especially if you’re already running MCP servers in containers.

First, verify your MCP servers are actually MCP-compliant. Test one:

curl -s http://localhost:8080/health | jq
# Should return { "status": "ok", "version": "0.2.1", "mcp_version": "0.1.0" }

If you get 404 or garbage JSON, Observal won’t talk to it.

Then, pull the latest Observal image (v0.3.2 as of this writing — not on PyPI yet, so Docker is mandatory for production):

docker pull ghcr.io/blazeup-ai/observal:0.3.2

Here’s a minimal docker-compose.yml that watches three MCP services (llama.cpp, ollama, and a local text-generation-inference instance):

version: '3.8'
services:
  observal:
    image: ghcr.io/blazeup-ai/observal:0.3.2
    ports:
      - "8000:8000"
    environment:
      - OBSERVAL_LOG_LEVEL=INFO
      - OBSERVAL_SERVERS=http://llama-cpp:8080,http://ollama:11434,http://tgi:8080
      - OBSERVAL_POLL_INTERVAL=15
    depends_on:
      - llama-cpp
      - ollama
      - tgi
    restart: unless-stopped

  llama-cpp:
    image: ghcr.io/ggerganov/llama.cpp:full-cuda
    command: ["--model", "/models/phi-3-mini-4k-instruct.Q4_K_M.gguf", "--port", "8080", "--host", "0.0.0.0"]
    volumes:
      - ./models:/models
    ports:
      - "8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434"
    volumes:
      - ./ollama:/root/.ollama

  tgi:
    image: ghcr.io/huggingface/text-generation-inference:2.0.4
    command: ["--model-id", "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "--port", "8080"]
    ports:
      - "8080"
    deploy:
      resources:
        reservations:
          memory: 8G
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

⚠️ Critical config note: Observal expects all MCP servers to expose /health, /metrics, and /models on the same port — and it talks to them as containers on the same Docker network. So http://llama-cpp:8080 works inside the compose network, but http://localhost:8080 will fail. Don’t waste 45 minutes debugging that.

Once up:

docker compose up -d
docker compose logs -f observal

You’ll see logs like:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     MCP server discovered: http://llama-cpp:8080 → phi-3-mini-4k-instruct (loaded)
INFO:     MCP server discovered: http://ollama:11434 → tinyllama:1.1b (ready)
INFO:     Polling metrics every 15s...

Then hit http://localhost:8000 — you’ll get a clean, responsive dashboard showing uptime graphs, per-model latency heatmaps, token throughput, and real-time request logs.

Observal vs. The Alternatives: Prometheus, Langfuse, and Why “Good Enough” Isn’t Enough

Let’s be honest: you could slap Prometheus exporters on your LLM servers and call it a day. And if you’re running one vLLM cluster behind a single ingress — sure, that’s fine.

But here’s where the cracks show:

  • Prometheus + Grafana: You’ll spend hours writing custom metric relabeling rules just to split http_request_duration_seconds{model="phi-3"} from http_request_duration_seconds{model="tinyllama"} — and even then, you won’t know if phi-3 is actually loaded, or just sitting idle in a pool. Observal auto-discovers loaded models and only graphs metrics for active ones. It also natively renders inference_time_ms vs queue_wait_ms — Prometheus just gives you raw quantiles.

  • Langfuse: Langfuse is brilliant for application-level LLM tracing (RAG pipelines, LangChain steps, evals). But it’s not infrastructure observability. It doesn’t tell you that your llama.cpp instance just OOM’d and restarted, or that tgi is stuck at 98% GPU memory for 12 minutes. Langfuse sees requests, Observal sees servers.

  • Grafana Agent + Loki: Great for logs. Terrible for correlating model_loading_duration_seconds = 8.2s with GPU memory usage = 23.1/24GB — because those live in separate systems. Observal keeps them in one context, with one refresh.

The TL;DR: Use Langfuse alongside Observal — not instead of it. Langfuse answers “Why did this RAG chain fail?”, Observal answers “Why is /chat/completions timing out before it even hits my LangChain app?”

Why Self-Host Observal? Who Actually Needs This?

Observal is only worth self-hosting if you fit one (or more) of these profiles:

  • You run >2 heterogeneous LLM backends (e.g., llama.cpp for GGUF, vLLM for Hugging Face, ollama for dev iteration) and want a single pane of glass.
  • You’re building an internal LLM platform for your engineering team — and need to prove uptime SLAs or debug latency regressions across model versions.
  • You’re benchmarking quantized models and need to compare real-world output_tokens_per_second across hardware (e.g., 3090 vs. 4090 vs. M3 Ultra) — not just synthetic perf output.
  • You’re experimenting with MCP itself — and want to validate your own MCP server implementation against real telemetry.

It’s not for you if:

  • You run one ollama serve on your laptop and curl it occasionally.
  • You need enterprise RBAC, SSO, or audit logs (Observal has none of that — yet).
  • You expect built-in alerting or email/SMS notifications (you’ll need to script something against its /api/v1/metrics endpoint).

Resource-wise, Observal is ridiculously light. I’ve run it on a $5/month Hetzner CX11 (2 vCPU, 2GB RAM) alongside two llama.cpp instances — and it hovers at 38MB RAM, <5% CPU, even polling 5 servers every 10 seconds. It does zero heavy lifting: no ML, no indexing, no vector DB. It’s just a smart HTTP client + FastAPI + lightweight in-memory aggregation.

The Rough Edges: What’s Missing, and What I Ran Into

I’ve run Observal in prod for 17 days across 4 MCP servers (2 llama.cpp, 1 tgi, 1 custom MCP wrapper around transformers). Here’s my unfiltered take:

What works brilliantly:

  • Auto-discovery is rock-solid. Add a new MCP endpoint to OBSERVAL_SERVERS, restart, and it appears in <30 seconds.
  • The /marketplace view is exactly what I wished existed for internal dev tooling. Seeing tinyllama at 12ms avg vs phi-3 at 87ms on the same GPU? Incredibly actionable.
  • The /api/v1/metrics endpoint returns clean JSON — I built a simple Grafana dashboard on top of it in 20 minutes.

What’s rough right now:

  • No persistent metrics storage. All data vanishes on restart. There’s an open PR (#42) for SQLite support, but it’s not merged. If you care about historical trends beyond 24h, you must export to CSV or pipe to Prometheus.
  • Model tagging is manual. Observal grabs model_id from /models, but doesn’t let you add aliases like "prod-phi3-quantized" or "dev-tinyllama-v2". You get phi-3-mini-4k-instruct.Q4_K_M.gguf — not exactly human-friendly.
  • No authentication. OBSERVAL_USERNAME/OBSERVAL_PASSWORD env vars are planned, but not implemented. I’m blocking / with nginx basic auth — not ideal, but necessary on shared infra.
  • MCP version skew. Observal v0.3.2 expects MCP v0.1.0. If your text-generation-inference instance returns mcp_version: "0.0.1", it’ll skip it silently. Check your server’s /health output before assuming it’s compatible.

Also worth noting: the GitHub repo has 113 stars, 3 contributors, and 22 open issues — most labeled good first issue. This isn’t a “set and forget” enterprise tool. It’s a sharp, opinionated, developer-first tool — and it shows. You’ll need to read the source (it’s only ~1,200 lines of Python) if you want to tweak polling behavior or add custom metrics.

Final Verdict: Should You Deploy It?

Yes — but not as your final observability solution. Deploy it as your MCP-first telemetry foundation.

If you’re knee-deep in LLM ops and tired of stitching together docker stats, curl one-liners, and htop, Observal will save you 5–10 hours/week just in debugging time. The fact that it understands queue_wait_ms as a first-class metric — and renders it next to inference_time_ms — is worth the 5-minute docker compose setup alone.

Is it production-ready for a 200-node LLM cluster? No. Does it replace Prometheus? Absolutely not. But for teams running 2–10 MCP servers — especially those building internal LLM tooling — it’s the missing piece. It’s early, scrappy, and refreshingly focused.

I’m keeping it running. And I’ve already submitted a PR to add support for model_loading_failure_reason — because when your 70B model fails to load on GPU, you want that error in the dashboard. Not buried in docker logs.

Give it 20 minutes. Run the compose file. Hit http://localhost:8000/marketplace. Then tell me you don’t immediately think, “Why didn’t this exist last month?”