AI Debugging With Persistent Memory: Stop Investigating the Same Bug Twice
How the MemNexus team diagnosed a recurring CI failure pattern across 5 incidents in 10 days — and why the sixth incident took 2 minutes instead of 2 hours.
MemNexus Team
Engineering
Here's a debugging scenario that will feel familiar.
You're working with a coding agent on a production issue. Your AI is helpful — it generates hypotheses, walks through possible causes, suggests things to check. But it doesn't know what you've already ruled out. It doesn't know you had a similar issue three weeks ago that turned out to be a lockfile problem. It doesn't know your team investigated this exact component last month.
Every investigation starts from scratch.
Multiply that across a year of work, across a team of five developers, and you're looking at hundreds of hours spent re-discovering things you've already discovered.
Persistent memory changes this. Here's how it played out for us.
The recurring lockfile problem
We spent 10 days debugging a pattern that kept showing up in different forms. Our monorepo uses pnpm, but one service (mcp-server) had been set up with npm — and our CI pipeline kept failing in new and interesting ways.
Five separate incidents, each diagnosed fresh:
- Feb 3:
pnpm-lock.yamlmismatch incustomer-portal. CI blocked for two days. - Feb 3: WSL2 filesystem blocking
pnpm installin worktrees. Root cause of multiple lockfile issues. - Feb 7:
pnpm-lock.yamlat repo root didn't trigger the path filter. Required manual workflow rerun. - Feb 10: Husky prevented
pnpm installin worktrees. Fixed with a[ -d .git ]guard. - Feb 12: npm lockfile contaminated by parent pnpm store. Three CI iterations to resolve.
Five incidents. Each one investigated from scratch. Each one taking hours to diagnose. Each one fixed in isolation, without recognizing the pattern.
Why it took us six incidents to see the pattern
This is the part that's worth being honest about. We're a small team. We build memory infrastructure. And it still took us six incidents to connect the dots.
The reason is simple: the incidents looked different on the surface. One was a CI path filter issue. One was a WSL2 filesystem problem. One was Husky. One was Docker build context. Each had different symptoms, different error messages, and different immediate fixes. The common thread — npm running inside a pnpm monorepo — wasn't obvious when you were heads-down diagnosing each one individually.
Different team members hit different incidents. The person who fixed the Husky issue on Feb 10 didn't know about the WSL2 fix from Feb 3. The person who fixed the Docker contamination on Feb 12 didn't know about the path filter workaround from Feb 7. Each fix was local and reasonable. Nobody had the full picture because the full picture was spread across five separate debugging sessions in five separate agent conversations — all of which evaporated when the sessions ended.
That's the core problem. It's not that we weren't paying attention. It's that the knowledge was ephemeral. Each session produced a fix and then disappeared. There was nowhere for the pattern to accumulate.
We started saving debugging sessions to MemNexus after the fourth incident. When the sixth one appeared, we searched:
mx memories search --query "lockfile CI failure package manager" --timeline
The timeline search returned all five previous incidents, chronologically sorted, each one with root cause, fix, and warning signs. The AI read across those results and synthesized:
"Based on five previous incidents, this is npm/pnpm cross-contamination in the monorepo. The consistent pattern: npm running inside a pnpm monorepo tree walks up to the parent node_modules/.pnpm/ store. The lockfile gets ../node_modules/.pnpm/ references that don't exist in Docker or CI. The consistent fix is converting services to pnpm, or running npm install in an isolated temp directory."
The diagnosis took two minutes. The fix — converting mcp-server from npm to pnpm — took an hour. The sixth incident that could have been another multi-hour investigation was closed before lunch.
That synthesis was only possible because we'd saved all five previous incidents when they happened.
What made it work
We captured each incident with enough detail to be useful later:
mx memories create \
--conversation-id "conv_incident_5" \
--content "CI lockfile failure: npm install inside the monorepo walked up to
the parent node_modules/.pnpm/ store and added references that don't exist
in the Docker build context. Fix: run npm install in an isolated temp dir
when generating the lockfile. Affected: mcp-server Dockerfile. Third time
we've hit this class of issue — root cause is always npm/pnpm cross-
contamination in the monorepo." \
--topics "ci,docker,lockfile,gotcha"
The key elements:
- What happened (the specific failure mode)
- Why it happened (the root cause, not just "lockfile mismatch")
- What was affected (which service, which workflow)
- Pattern recognition ("third time we've hit this class of issue")
That last line — noting the pattern explicitly — made it much easier for future search to connect the incidents.
The five-minute search that replaced a five-hour investigation
This pattern appears across different types of bugs:
Timing-sensitive failures: "This flaky test is intermittently failing" becomes "search for previous flaky test investigations in this service, look for timing-related root causes."
Authentication issues: "Something's wrong with token validation" becomes "search for previous auth service debugging sessions, find the key rotation incident from last month."
Performance regressions: "The API is slow" becomes "search for previous performance investigations, find the connection pool tuning decision, find the query that caused problems before."
In each case, the AI can synthesize across past investigations and either find the answer or at least show you what you've already ruled out.
Without persistent memory, your AI can help you investigate. With persistent memory, your AI can help you remember.
The habit that makes it work
The synthesis happens automatically. The habit that makes it possible is saving root causes when you find them.
The moment you identify the root cause of a hard bug is the highest-value moment to save a memory. Your understanding is fresh, you have all the context, and you know exactly what future-you would need to know if this appeared again.
The saving takes 60 seconds. The payoff is that every future investigation of similar symptoms starts knowing what you know now.
# Save the root cause while it's fresh
mx memories create \
--conversation-id "conv_incident" \
--content "Root cause: [specific cause]. Symptoms: [what was observable].
Ruled out: [what you checked]. Fix: [what resolved it, with commit/PR].
Warning sign for next time: [what to look for if this class appears again]." \
--topics "gotcha,completed"
The --topics "gotcha" tag creates a searchable collection. Before touching any complex component, mx memories search --query "component-name" --topics "gotcha" returns the hard-won lessons — things worth knowing before you step on the same problem.
Pattern recognition at scale
The lockfile example is our five incidents over ten days. Scale that to a year of work, a team of ten developers, a system with twenty services.
Hundreds of debugging sessions. Each one contributing to an accumulating knowledge base. Patterns that took one person five incidents to recognize become visible on the second incident to the next developer, because the first five are in the memory store.
This is what "institutional knowledge" actually means — not what's in the documentation, but what's in people's heads. With persistent memory, it stops living only in people's heads.
What this looks like in practice
Here's how this actually plays out on our team. These are real examples from our memory store.
A deploy keeps failing with ImagePullBackOff. You've checked the Helm values three times. You've stripped the config down to the minimum. Nothing works. Before going deeper, you search:
mx memories search --query "ImagePullBackOff deploy failing" --topics "gotcha"
Result: a memory from three weeks ago — "Production AKS cluster provisioned with AMD64 nodes but all CI builds target ARM64. This caused ImagePullBackOff on every prod deploy. Fix: change prod VM size to ARM64 equivalent. Gotcha: Azure VM SKU naming — 'p' suffix means ARM64/Ampere."
That's the answer. You were debugging config when the problem was architecture mismatch. The search took five seconds.
CI is blocked and nobody knows why. The error is pnpm install --frozen-lockfile failing. You search:
mx memories search --query "pnpm lockfile CI blocked" --timeline
Result: four previous incidents, each with the root cause chain. One memory spells it out: "Root cause: worktree.toml runs sudo npm install as post-create hook. sudo creates .husky/_/ owned by root. WSL2 9p filesystem doesn't support utimensat syscall. husky prepare script fails with EPERM."
You're not debugging from scratch. You're starting from a diagnosis someone already completed.
When you find a root cause, save it while it's fresh:
mx memories create \
--content "CI lockfile failure: npm install inside the monorepo walked up to
the parent node_modules/.pnpm/ store and added references that don't exist
in the Docker build context. Fix: run npm install in an isolated temp dir.
Affected: mcp-server Dockerfile. Third time we've hit this class of issue —
root cause is always npm/pnpm cross-contamination in the monorepo." \
--topics "ci,docker,lockfile,gotcha"
That memory is now searchable by anyone on the team. The next person who sees a lockfile error in CI finds it before they start investigating.
The sixth lockfile incident took us two minutes. Your next recurring bug can too. Sign up at memnexus.ai and get started in under five minutes.
Get updates on AI memory and developer tools. No spam.
Related Posts
A Systematic AI Debugging Workflow That Gets Smarter Over Time
Most developers debug the same classes of bugs repeatedly. Here's a workflow that uses persistent memory to make each debugging session faster than the last.
Better Code Reviews With Persistent AI Memory
How to load architectural context before reviewing a PR — so your AI reviewer knows why things were built the way they were, not just what the code does.
AI Memory for Open Source Contributors: Context That Doesn't Reset Between PRs
Open source contributors context-switch between projects months apart. Persistent AI memory means you never re-explain a project's conventions or patterns.