Let’s be honest: most AI memory layers feel like duct-taped notebooks—full of isolated facts, missing context, and zero sense of narrative. You feed an LLM a query, it recalls “Paris is the capital of France” (✅), but forgets you asked about travel routes last Tuesday, you hate baguettes, and you’re planning a trip with your sister who’s allergic to gluten. That’s not memory—that’s index cards in a wind tunnel. Enter Vektori: a local, zero-config, single-binary memory layer that models sentences, episodes, and facts in one graph—and actually remembers the story. I’ve been running it alongside my Ollama-backed agent stack for 11 days. It’s not perfect—but it’s the first memory system I’ve used where I catch myself saying “Wait, it knew that?”

What Is Vektori? A Narrative-Aware Memory Graph for Local AI Agents

Vektori isn’t another vector DB pretending to be smart. It’s a sentence-level memory model built on a three-layer graph:

  • Raw Sentences: verbatim input chunks (e.g., “My sister’s gluten allergy ruined breakfast in Montmartre.”)
  • Episodes: temporally grouped, semantically coherent clusters (e.g., Trip to Paris — May 2024)
  • Facts: distilled, timeless assertions (“User’s sister has celiac disease”, “User dislikes traditional French breakfast pastries”)

All three layers live in one SQLite file. No Redis. No Postgres. No embedding model download—Vektori uses all-MiniLM-L6-v2 built-in, quantized, and cached locally. It’s Python (3.10+), MIT-licensed, and as of May 2024, sits at 56 GitHub stars, with active commits every 2–3 days.

That “zero config” claim? It’s real. No config.yaml, no .env required to start. Run it, point your agent at its /v1/memory endpoint, and it begins building narrative structure automatically. No manual tagging. No episode IDs to manage. It infers episode boundaries from temporal proximity, co-reference, and semantic density.

Here’s the kicker: Vektori doesn’t just store—it reconstructs. Ask it “What happened during my Paris trip?”, and it returns a ranked list of episodes, not just nearest-neighbor sentences. Ask “Why did I skip the bakery?”, and it traces back through facts → episodes → supporting sentences. That’s not retrieval. That’s reasoning over memory.

How Vektori Compares to Alternatives

If you’ve tried other local memory solutions, you’ve probably hit one of these walls:

  • ChromaDB + LangChain memory: Great for facts, terrible for narrative. You get nearest-neighbor sentences, but zero episode grouping. You’re responsible for chunking, summarizing, and stitching context yourself. Vektori does that automatically—and stores the relationships. Chroma’s 32MB RAM baseline? Vektori runs at ~140MB RSS on idle (measured via ps aux --sort=-%mem | head -5).

  • MemGPT’s in-memory recall: Fast, but volatile. Restart the process? Your agent forgets everything. Vektori persists to SQLite—full ACID, crash-safe, and VACUUM-friendly. I killed its process mid-ingest twice—zero corruption.

  • LlamaIndex + SimpleNodeParser: Requires manual node linking and summary generation. Vektori’s episode layer emerges. I fed it 1,200 lines of chat logs (raw llama.cpp + llm-vm traces), and it auto-grouped 18 episodes—e.g., “Debugging Ollama CUDA errors”, “Tuning Qwen3-4B quantization”. I didn’t label a single one.

  • Weaviate or Qdrant (self-hosted): Overkill for personal agents. They need Docker orchestration, schema definitions, and GPU optional-but-recommended for decent speed. Vektori? pip install vektori && vektori serve does it. No GPU. No Docker. Just Python and 2GB free RAM.

That said—Vektori isn’t a replacement for production-grade retrieval. It doesn’t support filtering by metadata keys (source: discord, type: code). It doesn’t do hybrid search (keyword + vector). If you need enterprise audit logs or multi-tenant isolation, look elsewhere. But if you’re a solo dev, researcher, or privacy-obsessed tinkerer building narrative agents, Vektori hits a sweet spot no other OSS tool touches.

Installation & Getting Started: Zero Config, Real Results

Vektori ships as a pip-installable package (v0.3.2 as of May 2024). No build steps, no Rust toolchain. Here’s what works right now:

Via pip (fastest for testing)

pip install vektori==0.3.2
vektori serve --port 8000

That’s it. It auto-creates ./vektori.db and starts a FastAPI server at http://localhost:8000.

Via Docker (recommended for persistence)

Use this docker-compose.yml:

version: '3.8'
services:
  vektori:
    image: ghcr.io/vektori-ai/vektori:0.3.2
    ports:
      - "8000:8000"
    volumes:
      - ./vektori-data:/app/data
    restart: unless-stopped
    command: ["serve", "--port", "8000"]

Then:

docker compose up -d
curl -X POST http://localhost:8000/v1/memory \
  -H "Content-Type: application/json" \
  -d '{"text": "I booked a flight to Lisbon for June 12. My passport expires in August."}'

Adding memory with Python (your agent’s POV)

import requests

def remember(text: str, endpoint="http://localhost:8000/v1/memory"):
    resp = requests.post(endpoint, json={"text": text})
    return resp.json()  # returns episode_id, fact_ids, sentence_id

remember("Lisbon flight confirmed. Gate C12.")
remember("Passport renewal started via gov.uk portal.")

Querying is just as light:

# Get top 3 episodes about "Lisbon"
requests.get("http://localhost:8000/v1/episodes?query=Lisbon&limit=3").json()

# Get all facts containing "passport"
requests.get("http://localhost:8000/v1/facts?query=passport").json()

No API keys. No auth layer. That’s intentional—and yes, you should add reverse-proxy auth (Caddy/Nginx) in production. But for local dev? Bliss.

Why Self-Host Vektori? Who Actually Needs This?

Let’s cut the fluff: Vektori isn’t for everyone. You don’t need it if:

  • You’re building a static FAQ bot
  • Your “memory” is just a 10-line context.txt
  • You rely on cloud LLMs with built-in session history (e.g., Claude’s conversation_id)

You do need it if:

  • You run local LLM agents (Ollama, LM Studio, text-generation-webui) and want them to retain user history across reboots
  • You’re experimenting with autonomous research agents, and need them to distinguish “paper X says Y” (fact) from “I read paper X on Tuesday while debugging Z” (episode)
  • You care about data sovereignty: your memory lives in one encrypted SQLite file you can rsync to backup drives or wipe with shred -u vektori.db
  • You’re sick of writing summarize_last_5_messages() logic and want the system to infer narrative structure for you

I use it with a llama.cpp-hosted phi-4 agent that handles my daily dev journal. Before Vektori, I’d get:

“You asked about Rust async last week.”
Now it says:
“In your ‘Rust Debug Sprint’ episode (May 3–5), you traced tokio::select! hangs, referenced PR #442, and concluded Pin<Box<dyn Future>> was the culprit.”

That level of contextual recall changes how you interact with your agent—not just what it knows, but how it understands your intent.

Resource Usage, Hardware, and Real-World Constraints

Vektori is lean—but not magic. Here’s what I measured on my NUC11 (16GB RAM, i5-1135G7, no dGPU):

  • Cold start: vektori serve takes ~1.8s (loads all-MiniLM-L6-v2 quantized .gguf from disk)
  • Idle memory: 142MB RSS, ~9% CPU (background SQLite WAL writes)
  • Ingest throughput: ~8–12 sentences/sec on CPU (tested with 500-line journal batch)
  • Disk usage: 1.2MB per 1,000 sentences (SQLite bloat is minimal—VACUUM shrinks it ~15%)
  • Max recommended scale: ~50,000 sentences on <8GB RAM systems. Beyond that, you’ll want to prune old episodes (not yet exposed in API—but trivial to do via raw SQLite: DELETE FROM episodes WHERE created_at < '2024-01-01').

No GPU acceleration. No CUDA. It’s CPU-bound, but the quantized embedding model keeps latency sane: POST /v1/memory averages 320–440ms on my hardware. Not “real-time chat” fast—but perfectly fine for agent-to-agent or agent-to-user async loops.

Note: The current embedding model (all-MiniLM-L6-v2) is solid for English, but weak on code or heavy domain jargon. There’s no model swap API yet—but the GitHub issue #23 is open, and the maintainer commented “planned for v0.4”. So keep an eye on that.

The Verdict: Is Vektori Worth Deploying Now?

Yes—but with caveats.

Why I’m keeping it running:

  • It solved a real problem I’d hacked around for months: agent amnesia. My dev journal agent now recalls why I abandoned a branch, not just that I did.
  • The SQLite backend is stupidly reliable. I’ve force-restarted it 17 times. Zero data loss.
  • The episode layer just works. No tuning. No thresholds. It grouped my chaotic git log --oneline + journalctl snippets into coherent “Debugging Sessions” without me lifting a finger.

Rough edges I hit:

  • No auth layer. Not even basic HTTP Basic. If you expose this to LAN, slap Nginx in front now.
  • No CLI for querying. You need curl or Python to talk to it—no vektori episodes --query="cuda" yet.
  • No embedding model hot-swap. If you want nomic-embed-text-v1.5, you’re waiting for v0.4.
  • Fact extraction is rule-based, not LLM-driven. It catches “X is Y”, “X has Z”, but misses implied facts like “I paid $30 → I have $30 less”. That’s fine for v0.3, but expect richer inference in later releases.

The honest TL;DR: Vektori is the most human-aligned local memory layer I’ve used. It doesn’t just remember what you said—it remembers when, why, and in what context. It’s not production-ready for teams, but for solo hackers, researchers, or anyone building narrative-aware agents? It’s the missing piece.

I’m running it 24/7. I’ve added curl hooks to my git commit --hook to auto-log every push. I’ve wired it into my systemd journal parser. It’s not perfect—but it’s the first memory system that made me feel like my agent gets me.

And in self-hosting? That’s worth more than stars on GitHub.

Update: As of May 22, 2024, Vektori is at v0.3.2 with 56 stars. The maintainer merged episode pruning support yesterday. Watch that repo. This thing’s moving.