MemNexus is in gated preview — invite only. Learn more
Back to Blog
·11 min read

What an MCP Memory Server Actually Does (And How MemNexus Implements One)

A technical deep-dive into MCP memory servers: what they are, how they differ from other MCP servers, and how MemNexus implements the full extraction-graph-retrieval pipeline.

MemNexus Team

Engineering

MCPPersistent MemoryModel Context ProtocolDeveloper ToolsAI Agents

MCP — the Model Context Protocol — was designed to let AI tools call out to external capabilities: read a file, query a database, run a search. Most MCP servers you'll encounter follow this pattern. A filesystem server reads and writes files. A database server executes queries. A web search server fetches results. Each one exposes well-typed tools and returns results synchronously. Stateless by design.

A memory MCP server is different. It doesn't just respond to calls — it accumulates. Every session adds to it. Every retrieval draws on the full history of what's been saved. The protocol is the same, but what the server does with it is fundamentally distinct.

This post is a technical walkthrough of what that means: how a memory server differs from other MCP servers architecturally, what MemNexus exposes as an MCP server and why each tool exists, how the extraction and retrieval pipeline works, and how to connect it to your MCP client.

How a Memory MCP Server Differs

Most MCP servers wrap an existing capability. A filesystem server wraps the OS. A database server wraps a query engine. The server itself has no internal state — it routes calls to the underlying system and returns what comes back.

A memory MCP server wraps a purpose-built persistence layer. What makes it distinct:

It has a write-read contract that spans sessions. When your agent calls create_memory, the content is processed and stored. When a different agent in a different session calls search_memories, it can retrieve what was stored weeks ago. The server's value accumulates over time.

Retrieval is semantic, not exact. A database server returns rows matching a query predicate. A memory server retrieves content by meaning — you search for "auth service caching decision" and get back memories about token storage, session management, and a Redis trade-off, even if none of them contain those exact words. That requires a vector search layer, not just a key-value store.

Structure is extracted, not imposed. You don't write structured data to a memory server. You write natural language — a description of what happened, what you decided, what you learned. The server's job is to extract structure from that text automatically: entities, facts, topics, relationships. What goes in as prose comes back as searchable, graph-linked knowledge.

This is why memory servers are non-trivial to build and why they warrant a dedicated implementation rather than a thin wrapper over a database.

The MemNexus MCP Tools

MemNexus exposes five core MCP tools. They're organized around two workflows: accumulating knowledge and retrieving it.

Writing to memory

create_memory is the primary write tool. It takes a content string and optional parameters: a conversation ID to group related memories, and topics for explicit categorization (though most topics are extracted automatically). You call this when something is worth preserving — a decision, a debugging finding, a constraint, a convention. The content goes through the extraction pipeline described below before it's stored.

Tool: create_memory
Input:
  content: "Switched from cookie-based to header-based tokens in the auth service.
            Reason: mobile clients can't reliably set cookies across redirects.
            All existing sessions invalidated during migration — coordinated with
            the mobile team before deploy."
  conversationId: "conv_auth_refactor_2026"

The call returns a memory ID and a list of extracted entities and facts — useful for verifying that the extraction pipeline understood the content correctly.

Reading from memory

search_memories is the core retrieval tool. It takes a query string and returns the most semantically relevant memories from your store. A typical session-start workflow calls this with a description of what you're about to work on. The results inject into the agent's working context before the first message.

Tool: search_memories
Input:
  query: "token authentication mobile"
  limit: 5

The results include memory content, a relevance score, extracted facts, and linked entities. A score above 0.7 indicates strong semantic alignment.

get_memory fetches a specific memory by ID or by name. You reach for this after a search surfaces the right result and you want the full content, or when you know the exact memory you need — like a named memory that stores your project conventions.

recall synthesizes an answer across multiple memories rather than returning a list. You give it a question — "what did we decide about the token storage layer?" — and it retrieves relevant memories, then constructs a coherent narrative response. This is more useful than raw search when you want understanding, not just retrieval. Under the hood it's the same vector search, followed by a synthesis step that produces a single structured response from the top results.

Tool: recall
Input:
  query: "what authentication decisions have we made and why"

build_context is the session-start tool. One call returns a structured briefing: active work items, key facts relevant to your topic, gotchas detected from recurring patterns across memories, and recent activity. Rather than having the agent run four separate searches and assemble the results, build_context runs those queries in parallel and returns a single structured document in a single fast call.

Tool: build_context
Input:
  context: "auth service token refresh work"

For more on how build_context works and the gotcha detection mechanism, see the build-context deep-dive.

Under the Hood: Extraction, Graph, Search

When you call create_memory, the content doesn't just get stored as a text blob. It goes through a three-stage pipeline.

Stage 1: Extraction

The extraction stage runs on the raw content using an LLM call. It identifies:

  • Entities — named things: services, libraries, technologies, people, concepts
  • Facts — statements about entities: decisions made, constraints discovered, behaviors observed
  • Topics — thematic categories: authentication, caching, database, deployment

The extraction produces a structured representation of the content. A memory about switching from cookies to header-based tokens produces entities like auth-service, cookie-based-tokens, header-based-tokens, mobile-clients; facts like mobile clients cannot reliably set cookies across redirects and existing sessions were invalidated during migration; and topics like authentication, token-management, mobile.

This extraction is what makes retrieval rich. When you later search for "mobile auth," the query matches both the raw content and the extracted facts and entities — increasing recall without requiring exact keyword overlap.

Stage 2: Knowledge Graph

Extracted entities and facts are linked in a knowledge graph. Entities become nodes. Facts become edges or node properties. Relationships between entities are inferred: auth-service connects to mobile-clients via the fact about cookie handling; mobile-clients connects to header-based-tokens via the decision to switch.

This graph structure enables traversal retrieval — when you search for auth-service, the graph surfaces connected entities and their relationships, not just direct matches. A memory about the auth service database connection pool can surface alongside a memory about token storage because both are connected through the auth-service entity node.

Stage 3: Semantic Search Index

The raw content and extracted facts are embedded into a vector index (embedding model running server-side — you don't manage this). At retrieval time, your query is embedded using the same model, and cosine similarity determines relevance ranking. This is what enables the "search by meaning, not keywords" behavior.

The graph and the vector index complement each other. The vector index finds semantically similar content. The graph expands results to include structurally related memories that might not score highly on pure semantic similarity. Both signals combine in the final ranking.

A Real Workflow: Start, Work, Save, Repeat

Here's what the full session lifecycle looks like with MemNexus connected as an MCP server.

Session start. The agent calls build_context with a description of the current task. It gets back a briefing: the active conversation thread from the last session, key facts about the relevant system, any gotchas that have appeared in multiple prior memories, and what changed in the last 24 hours. The agent is oriented before you type a word.

Tool: build_context
Input:
  context: "continuing work on token refresh race condition"

Response:
  Active Work: "Auth service token refresh investigation"
    (conv_auth_2026, open, last activity 2 days ago)
  Key Facts:
    - JWT middleware validates tokens in-process, not via Redis
    - Refresh tokens stored in Postgres with expiry index
    - Race condition root cause: parallel refresh requests
      caused duplicate invalidation
  Gotchas (appeared in 3+ memories):
    - DB advisory lock required on refresh endpoint — without it,
      concurrent requests produce duplicate tokens
  Recent Activity:
    - Token refresh fix merged to staging (2 days ago)
    - Integration tests updated to cover parallel refresh scenario

During the session. The agent has the background. You work. When you make a decision worth preserving — you chose a particular lock strategy, you discovered a library edge case, you confirmed a constraint — the agent calls create_memory:

Tool: create_memory
Input:
  content: "Confirmed: DB-level advisory lock on session_id resolves
            duplicate token issue in parallel refresh.
            pg_try_advisory_lock returns false if lock held, which
            causes the second request to wait and use the token
            issued by the first. Tested with 50 concurrent refresh
            requests — zero duplicates."
  conversationId: "conv_auth_2026"

Next session. Two days later, you're back. New session, blank context window. The agent calls build_context again. The memory from the previous session is there: the advisory lock finding, the test result, the conversation thread. The agent knows where the last session ended.

Tool: search_memories
Input:
  query: "token refresh advisory lock"

The memory surfaces immediately, with the full content of what was confirmed and how it was tested. The agent doesn't ask what a pg advisory lock is or why you're using it. It already knows.

This is the compounding effect in practice. Not magic — structured retrieval across a persistent store that grows with every session.

Connecting MemNexus to Your MCP Client

MemNexus supports any MCP-compatible client. Configuration requires adding one entry to your client's MCP config.

Install the CLI and authenticate:

npm install -g @memnexus-ai/mx-agent-cli
mx auth login

Run setup to auto-configure all installed clients:

mx setup

mx setup detects which MCP clients you have installed — Claude Code, Cursor, Windsurf, Cline, GitHub Copilot in VS Code — and writes the correct configuration entry to each. For Claude Code, that's your CLAUDE.md or settings file. For Cursor, it's ~/.cursor/mcp.json. For clients not auto-detected, the server command is mx mcp serve and the transport is stdio.

After restarting your client, the MemNexus tools appear in the tool list alongside any other MCP servers you have configured. No additional setup. The tools self-describe — their names, parameter schemas, and descriptions are part of the MCP handshake, so the client model sees them at session start.

Why a Dedicated Memory MCP Server

The reference MCP memory implementation — the official @modelcontextprotocol/server-memory package — works by storing a flat knowledge graph in a local JSON file. It's a good starting point for understanding the pattern. As your memory store grows past a few hundred entries, though, flat file retrieval and basic keyword matching limit what you can build on top of it.

MemNexus advances this in three ways. First, extraction: rather than requiring you to manually structure what you save, the extraction pipeline produces entities, facts, and topics automatically from natural language. Second, graph-backed retrieval: linked entities expand search results to structurally related memories, not just semantically similar text. Third, synthesis tools: recall and build_context move beyond "return a list of relevant results" to "produce a coherent answer from everything you know."

At a few dozen memories, the reference implementation works fine. At hundreds of memories across months of sessions, flat-file retrieval misses the connections that graph traversal finds — and keyword matching misses what semantic search catches. The difference shows up when you need the right three facts at the start of a session, not a list of 20 partially relevant ones.

What's Next

MemNexus is in gated preview. The extraction pipeline, knowledge graph, and all five MCP tools described here are live for preview users. We're actively expanding the retrieval capabilities — richer graph traversal, more precise gotcha detection, and startup hooks that deliver the build_context briefing automatically without an explicit tool call.

If you're building on MCP and want persistent memory that goes beyond session notes, the quickstart docs cover the full setup. To get access, join the waitlist.


MemNexus is a persistent memory layer for AI agents and coding assistants, connecting via MCP to any compatible client. Get started →

Give your coding agents memory that persists

MemNexus works across Claude Code, Codex, Copilot, and Cursor — your agents get smarter every session.

Get Started Free

Get updates on AI memory and developer tools. No spam.