My AI agent couldn't remember anything.
Every session started the same way. "What's the NAS IP?" "Which host is Grafana on?" "How do I SSH into the ownCloud container?" I'd answer, the agent would do the work, and the next time I opened Slack it would ask again. Sixty-two completed tasks and it still had no memory across conversations. Every Slack thread was an island.
I got tired of onboarding my own agent to my own infrastructure. So I plugged in Hindsight — and then spent two weeks putting out fires it started.
The Built-In Memory Tool Wasn't Enough
Hermes Agent ships with a memory tool. You call memory(action="add", content="Grafana is on the monitoring host at port 3000") and that fact gets injected into every future turn. It works for maybe 15-20 facts before you hit the 2,200-character budget. And there's no semantic search — exact string matching only. If you phrase the question differently than the stored fact, no match.
Hindsight is the upgrade path. It's an open-source memory plugin from Vectorize-io that ships with Hermes. It runs as a Python daemon with embedded PostgreSQL 18.1 and pgvector 0.4.2 underneath — no separate database server to manage, no cloud service to pay for. You drop it on an existing host and it handles semantic search, entity resolution, and knowledge graph traversal automatically.
I deployed it on the bridge host where the LLM gateway already lived. One LXC, no new infrastructure.
Phase 1: pgvector From Source (Not a One-Liner)
The plugin ships with an embedded PostgreSQL, but pgvector had to be built from source against PG 18.1 and GLIBC 2.36. The CREATE EXTENSION vector one-liner I'd imagined wasn't available — the packaged version targeted older Postgres releases.
Worse: the initdb step failed when run as root. PostgreSQL 18.1 refuses to initialize a data directory under the root user. I had to create a dedicated hindsight system user and rerun initdb under that account. The data landed at /home/hindsight/.pg0/ on the bridge host.
That was the first sign this wasn't going to be a clean install. I should have paid more attention.
Phase 2: Hindsight Actually Works
Once it was running — a uvicorn daemon on port 9177 — the API turned out to be straightforward. Three tools:
hindsight_retain(content, context, tags)— stores a fact. Hindsight automatically extracts entities, builds graph connections, and generates embeddings. You don't pass entity IDs or relationship types. You just pass text and tags, and it figures out the structure.hindsight_recall(query)— searches across all stored memories using semantic similarity, keyword matching, entity graph traversal, and cross-encoder reranking.hindsight_reflect(query)— synthesizes a reasoned answer by running an LLM across the most relevant memories. Not just retrieval — actual reasoning.
The embedding model is BAAI/bge-small-en-v1.5 at 384 dimensions — small enough to run on CPU, good enough for infrastructure facts. The reranker is cross-encoder/ms-marco-MiniLM-L-6-v2, which runs as a second pass over recall results to reorder them by relevance.
The entity graph is the part that surprised me. Tag a memory with ["infrastructure", "grafana"] and Hindsight connects it to other infrastructure facts automatically. Ask "what monitoring do I have running?" and it traverses the graph from the monitoring tag to every related memory — host, port, service name, dependencies. I didn't build any of this. It just works.
Phase 3: The Dual-Write Problem
Now there were two memory systems: the built-in memory tool (flat, 2,200-char budget, guaranteed context injection) and Hindsight (unlimited, semantic, graph-aware). Both needed to stay in sync.
The convention I landed on: every memory.save() call pairs with a hindsight_retain() for the same fact. The built-in memory gets the short version for immediate context injection. Hindsight gets the full version with tags for long-term search and synthesis.
This is not an elegant pattern. I'm calling it what it is: dual-write with no transaction boundary. If one write succeeds and the other fails, the systems diverge. There's no reconciliation. I just haven't hit a case where the divergence mattered yet.
And dual-write had a worse problem.
The Spawn Storm and the CPU Meltdown
On cold boot, Hindsight spawned four daemon instances. Same port, same config, four PIDs fighting over the embedded PostgreSQL. Combined RSS hit 5.2GB. This happened silently — nothing in the agent's output indicated the memory backend was melting down.
Then there were the eval runs. When I ran batch skill evaluations at --parallel 4, every concurrent session was calling hindsight_retain() 6-8 times per turn. That's 24-32 embedding inference calls hitting a single daemon simultaneously. The bridge host's CPU spiked past 2000% and stayed there. The daemon didn't crash — it just saturated, queuing requests and burning cycles while the agent appeared to work normally.
Twice the OOM killer stepped in. 1.8GB RSS the first time, 2GB the second. The hindsight-api process was silently killed mid-operation, and the agent continued running as if nothing had happened — because Hermes treats Hindsight as non-critical. Memory writes fail, the agent shrugs, you don't notice until you search for a fact that was never stored.
Retention Triage: What's Worth Remembering
The fix wasn't to reduce parallelism or add more CPU. It was to stop storing garbage.
Before the triage, the agent was retaining everything. Every IP address lookup. Every transient error. Every tool output from every eval iteration. "Gateway returned 503 at 14:22 UTC" — stored. The same host IP — stored four times with slightly different wording. The store filled with noise, and the noise degraded recall quality. Search returned irrelevant memories because the embedding space was cluttered with ephemera.
I added a retention triage rule to every profile's SOUL.md. Before storing, the agent asks: will I need this in a week? Stable infrastructure facts — host locations, service ports, dependency chains — always get stored. Session-specific context — error messages, timing data, one-off commands — almost never does.
The result: ~75-80% fewer retention calls during batch workloads. The CPU spikes dropped from "melting" to "warm." Recall quality improved because the embedding space wasn't polluted with transient noise. The trade-off is that some genuinely useful facts might get dropped — but a clean store beats a complete one.
What Actually Changed
Before Hindsight, every session was a fresh start. The agent didn't remember which services ran where, which hosts were on which Proxmox node, or which SSH key unlocked which container. I answered the same infrastructure questions over and over.
After Hindsight, those questions stopped.
When I say "check the Grafana dashboard," the agent recalls Grafana is on the monitoring host at port 3000, that the monitoring host is a Proxmox guest, and that it's reachable from the bridge. It follows the entity chain without asking me to fill gaps.
Cross-session context was the bigger win. Before, if we debugged a gateway issue on Monday and I came back Wednesday, the agent had no memory of Monday's diagnosis. Now "remember we moved the wiki container last month?" actually works. The agent retrieves the migration memory, checks the entity graph for the current location, and picks up where we left off.
Sixty-two completed tasks and the agent finally has context across conversations. Not because I built a sophisticated memory architecture — because I plugged in an open-source plugin, fought through the source-build and the spawn storms, and learned that the hard part of AI memory isn't the technology. It's deciding what's worth remembering.
What's Next
The biggest gap right now is visibility. I have no dashboard showing what's stored, how often facts are accessed, or whether recall is returning stale data. The OOM kills went unnoticed for days. The CPU saturation was invisible until I SSH'd in and ran htop. A simple Grafana panel — memory count, access frequency, search hit rates — would catch problems before the agent's behavior degrades.
Longer term, I want to add memory decay. Facts that are never accessed should deprioritize over time. The NetAlertX hostname quirk is worth keeping forever. A gateway 503 from three weeks ago is not. The current store treats them equally.
But those are problems for another week. For now, my agent remembers which services run where and how to reach them — without being told every time. It only took an embedded PostgreSQL, a source-built pgvector, a root-user initdb workaround, and two OOM kills to get there.