Agent CLI: Memory System Technical Specification¶
This document serves as the authoritative technical reference for the agent-cli memory subsystem. It details the component architecture, data structures, internal algorithms, and control flows implemented in the codebase.
High-Level Overview¶
For sharing with friends who want the gist without the technical deep-dive.
The Problem¶
LLMs are stateless. Every conversation starts fresh. They don't remember you told them your wife's name is Anne, or that you hate mushrooms, or that you're working on a Python project.
How It Works¶
- After each message, an LLM extracts atomic facts ("User's wife is named Anne")
- These facts get stored and made searchable
- Before your next message, relevant memories get pulled in automatically
- The LLM now "remembers" things about you
What Makes It Different¶
-
Reconciliation, not just accumulation: If you say "I love pizza" today and "I hate pizza" next month, most systems would just store both. This one uses an LLM to detect the contradiction and update the old fact. It actively manages memory, not just appends to it.
-
Recency-aware: Recent memories score higher than old ones. What you said yesterday matters more than what you said six months ago for most queries.
-
Diversity selection: If you've mentioned five different times that you like coffee, it won't waste your context window by injecting all five variations. It picks the most relevant one and moves on.
-
Human-readable persistence: Every memory is a markdown file on disk. You can read them, edit them, delete them. Optional git integration means you have full version history of everything the system remembers.
-
Summarization: For long conversations, it maintains a rolling summary so you don't hit token limits.
In One Sentence¶
A local-first system that gives LLMs persistent memory across conversations, with the twist that everything stays human-readable files on disk and it uses smarter scoring (recency + diversity + relevance) instead of just embedding similarity.
Try It Now¶
Get an LLM that remembers you using Ollama. Two options:
Option A: With Open WebUI (web interface)
# 1. Pull the required models (one-time setup)
ollama pull embeddinggemma:300m # for memory embeddings
ollama pull qwen3:4b # for chat
# 2. Start Open WebUI (runs in background)
# On Linux, add: --add-host=host.docker.internal:host-gateway
docker run -d -p 3000:8080 \
-v open-webui:/app/backend/data \
-e WEBUI_AUTH=false \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8100/v1 \
-e OPENAI_API_KEY=dummy \
ghcr.io/open-webui/open-webui:main
# 3. Start the memory proxy (runs in foreground so you can watch the logs)
uvx -p 3.13 --from "agent-cli[memory]" agent-cli memory proxy \
--memory-path ./my-memories \
--openai-base-url http://localhost:11434/v1 \
--embedding-model embeddinggemma:300m
# 4. Open http://localhost:3000, select "qwen3:4b" as model, and start chatting — it remembers you
Option B: With the openai CLI (terminal, no config needed)
# Terminal 1: Pull models and start proxy
ollama pull embeddinggemma:300m && ollama pull qwen3:4b
uvx -p 3.13 --from "agent-cli[memory]" agent-cli memory proxy \
--memory-path ./my-memories \
--openai-base-url http://localhost:11434/v1 \
--embedding-model embeddinggemma:300m
# Terminal 2: Chat from the terminal (env vars point to proxy)
export OPENAI_BASE_URL=http://localhost:8100/v1 OPENAI_API_KEY=dummy
# Tell it something about yourself
uvx openai api chat.completions.create -m qwen3:4b -g user "My name is Alice and I love hiking"
# Later, ask if it remembers
uvx openai api chat.completions.create -m qwen3:4b -g user "What's my name?"
1. Architectural Components¶
The memory system is composed of layered Python modules, separating the API surface from the core logic and storage engines.
1.1 Runtime Layer¶
agent_cli.memory.api(FastAPI):- Exposes
POST /v1/chat/completionsas an OpenAI-compatible endpoint. - Handles request validation (
ChatRequestpydantic model). - Manages lifecycle: starts/stops the
MemoryClientfile watcher on app startup/shutdown.
- Exposes
agent_cli.memory.client.MemoryClient:- The primary entry point. Orchestrates the interaction between the Logic Engine, File Store, and Vector Index.
- State: Holds references to the
chromadb.Collection,MemoryIndex(in-memory file map), andreranker_model(ONNX session). - Methods:
chat()(end-to-end),search()(retrieval only),add()(injection only).
1.2 Logic Engine (agent_cli.memory.engine)¶
agent_cli.memory.engine: The high-level orchestrator.process_chat_request: Main entry point. Handles synchronous vs. asynchronous (streaming) execution paths and coordinates the pipeline.
agent_cli.memory._retrieval: The "Read" path logic.augment_chat_request: Executes retrieval, reranking, recency weighting, MMR selection, and prompt injection.
agent_cli.memory._ingest: The "Write" path logic.extract_and_store_facts_and_summaries: Runs fact extraction, reconciliation, summarization, and triggers persistence.
1.3 Storage Layer¶
agent_cli.memory._persistence:- Handles the coordination of writing to disk and updating the vector DB (
persist_entries,evict_if_needed).
- Handles the coordination of writing to disk and updating the vector DB (
agent_cli.memory._files(File Store):- Source of Truth. Manages reading/writing Markdown files with YAML front matter.
- Handles path resolution:
<memory_path>/entries/<conversation_id>/{facts,turns,summaries}/.
agent_cli.memory._store(Vector Store):- Wraps
chromadb. - Handles embedding generation (via
text-embedding-3-smallor local models). - Implements
query_memorieswith dense retrieval parameters (n_results, filtering).
- Wraps
agent_cli.memory._indexer(Index Sync):- Maintains
memory_index.json(file hash snapshot) to keep ChromaDB in sync with the filesystem. - Watcher: Uses
watchfilesto detect OS-level file events (Create/Modify/Delete) and trigger incremental vector updates.
- Maintains
agent_cli.memory._git(Versioning):- Provides asynchronous Git integration for the memory store.
- Initialize repo on startup and commits changes after memory updates.
2. Data Structures & Schema¶
2.1 File System Layout¶
Memories are stored as Markdown files with YAML front matter in <memory_path>/entries/<conversation_id>/.
Active Directory Structure:
entries/
<conversation_id>/
facts/
<timestamp>__<uuid>.md # Extracted atomic facts
turns/
user/<timestamp>__<uuid>.md # Raw user messages
assistant/
<timestamp>__<uuid>.md # Raw assistant responses
summaries/
summary.md # The single rolling summary of the conversation
Deleted Directory Structure (Soft Deletes): Deleted files are moved (not destroyed) to preserve audit trails.
entries/
<conversation_id>/
deleted/
facts/
<timestamp>__<uuid>.md
summaries/
summary.md # Tombstoned summary
2.2 File Format¶
Front Matter (YAML):
* id: UUIDv4 (summaries use a deterministic ID suffix).
* role: memory (fact), user, assistant, or summary.
* conversation_id: Scope key.
* created_at: ISO 8601 timestamp (used for recency scoring).
* summary_kind: Present only on summaries.
* replaced_by: Present only on tombstones when an update replaces a fact.
Body: The semantic content (e.g., "User lives in San Francisco").
2.3 Vector Schema (ChromaDB)¶
All entries share a single collection but are partitioned by metadata.
* ID: Matches file UUID.
* Embedding: 1536d (OpenAI) or 384d (MiniLM).
* Metadata: Mirrors front matter (role, conversation_id, created_at, summary_kind, replaced_by) for pre-filtering and maintenance.
2.4 Versioning (Git)¶
When enable_git_versioning is true, the memory system maintains a local Git repository at memory_path.
* Initialization: Creates a repo and .gitignore (ignoring chroma/, memory_index.json, etc.) if missing.
* Commits: Automatically stages and commits all changes (adds, modifications, soft deletes) after every turn or fact update.
* Execution: Uses asynchronous subprocess calls (asyncio.create_subprocess_exec) to prevent blocking the main event loop during git operations.
3. The Read Path (Retrieval Pipeline)¶
Executed synchronously during augment_chat_request.
Step 1: Scope & Retrieval¶
- Scopes: Queries
conversation_idbucket +globalbucket. - Density: Calls
collection.query()retrievingtop_k * 3candidates per scope using Cosine Similarity. - Filtering: Excludes
role="summary"(summaries are handled separately).
Step 2: Cross-Encoder Reranking¶
- Model:
Xenova/ms-marco-MiniLM-L-6-v2(ONNX, quantized). - Input: Pairs of
(user_query, memory_content). - Output: Raw logit score.
- Normalization:
relevance = sigmoid(raw_score). - Thresholding: Candidates with
relevance < score_threshold(default 0.35) are hard-pruned.
Step 3: Recency Scoring¶
Combines semantic relevance with temporal proximity.
* Formula: recency_score = exp(-age_in_days / 30.0)
* Final Score: (1 - w) * relevance + w * recency_score
* w (recency_weight) defaults to 0.2.
Step 4: Diversity (MMR)¶
Applies Maximal Marginal Relevance to select the final top_k from the ranked pool.
* Goal: Prevent redundant memories (e.g., 5 variations of "User likes apples").
* Formula: mmr_score = λ * relevance - (1 - λ) * max_sim(candidate, selected)
* λ (mmr_lambda) defaults to 0.7.
* max_sim uses the cosine similarity of embeddings provided by Chroma.
Step 5: Injection¶
- Structure:
- Summary: Injected if available for the conversation (
role="summary"). - Memories: The top-k retrieved facts/turns, formatted as
[<role>] <content>(scores are not shown). - Current Turn: The user's actual message.
- Summary: Injected if available for the conversation (
4. The Write Path (Ingestion Pipeline)¶
Executed via _postprocess_after_turn (background task).
4.1 Streaming vs. Non-Streaming¶
- Streaming:
- User turn persisted immediately (stored under
turns/user). - Assistant tokens accumulated in memory buffer.
- Full assistant text persisted on stream completion (stored under
turns/assistant), then background post-processing runs.
- User turn persisted immediately (stored under
- Non-Streaming:
- User and Assistant turns persisted sequentially after full response is received, followed by post-processing.
4.2 Fact Extraction¶
- Input: Latest user message only (assistant/system text is ignored).
- Prompt:
FACT_SYSTEM_PROMPT. Extracts 0–3 atomic, standalone statements via PydanticAI. - Output: JSON list of strings. Failures fall back to
[].
4.3 Reconciliation (Memory Management)¶
Resolves contradictions using a "Search-Decide-Update" loop.
1. Local Search: For each new fact, retrieve a small neighborhood of existing role="memory" entries for the conversation.
2. LLM Decision: Uses UPDATE_MEMORY_PROMPT (examples + strict JSON schema) to compare new_facts vs existing_memories.
* Decisions: ADD, UPDATE, DELETE, NONE.
* If no existing memories are found, all new facts are added directly.
* On LLM/network failure, defaults to adding all new facts.
* Safeguard: if the model returns only deletes/empties, the new facts are still added to avoid data loss.
3. Execution:
* Adds: Creates new fact files and upserts to Chroma.
* Updates: Implemented as delete + add with a fresh ID; tombstones record replaced_by.
* Deletes: Soft-deletes files (moved under deleted/) and removes from Chroma.
4.4 Summarization¶
- Input: Previous summary (if any) + newly extracted facts.
- Prompt:
SUMMARY_PROMPT(updates the running summary). - Persistence: Writes a single
summaries/summary.mdper conversation (deterministic doc ID).
4.5 Eviction¶
- Trigger: If total entries in conversation >
max_entries(default 500). - Strategy: Sorts by
created_at(ascending) and deletes the oldestfactsorturnsuntil count is within limit. Summaries are exempt.
4.6 Versioning¶
If enable_git_versioning is enabled, an asynchronous git commit is triggered at the end of the post-processing pipeline to snapshot the latest state of the memory store.
5. Prompt Logic Specifications¶
To replicate the system behavior, the following prompt strategies are required.
5.1 Fact Extraction (FACT_SYSTEM_PROMPT)¶
- Goal: Extract 1-3 concise, atomic facts from the user message.
- Constraints: Ignore assistant text. No acknowledgements. Output JSON list of strings.
- Example: "My wife is Anne" ->
["The user's wife is named Anne"].
5.2 Reconciliation (UPDATE_MEMORY_PROMPT)¶
- Goal: Compare
new_factsagainstexisting_memories(id + text) and output structured decisions. - Operations:
- ADD: New information (generates a new ID).
- UPDATE: Refines existing information (uses the provided short ID).
- DELETE: Contradicts existing information (e.g., "I hate pizza" vs "I love pizza"). If deleting because of a replacement, the new fact must also be returned (ADD or UPDATE).
- NONE: Fact already exists or is irrelevant.
- Output constraints: JSON list only; no prose/code fences; IDs for UPDATE/DELETE/NONE must come from the provided list.
5.3 Summarization (SUMMARY_PROMPT)¶
- Goal: Maintain a concise running summary.
- Constraints: Aggregate related facts. Drop transient chit-chat. Focus on durable info.
6. Configuration Reference¶
| Parameter | Default | Description |
|---|---|---|
memory_path |
./memory_db |
Root directory for file storage. |
embedding_model |
text-embedding-3-small |
ID for embedding generation (OpenAI or local path). |
default_top_k |
5 |
Target number of memories to retrieve. |
max_entries |
500 |
Hard cap on memories per conversation. |
mmr_lambda |
0.7 |
Diversity weighting (1.0 = pure relevance). |
recency_weight |
0.2 |
Score weight for temporal proximity. |
score_threshold |
0.35 |
Minimum semantic relevance to consider. |
enable_summarization |
True |
Toggle for summary generation loop. |
openai_base_url |
required | Base URL for LLM calls (can point to OpenAI-compatible proxies). |
enable_git_versioning |
True |
Toggle to enable/disable Git versioning of the memory store. |
start_watcher |
False |
Start a file watcher to keep Chroma in sync with on-disk edits. |