rag-proxy

A RAG (Retrieval-Augmented Generation) proxy server that lets you chat with your documents.

Usage

agent-cli rag-proxy [OPTIONS]

Description

Enables "Chat with your Data" by running a local proxy server:

Start the server, pointing to your documents folder and LLM
The server watches the folder and indexes documents into a ChromaDB vector store
Point any OpenAI-compatible client to this server's URL
When you ask a question, the server retrieves relevant chunks and adds them to the prompt

Installation

Requires the rag extra:

pip install "agent-cli[rag]"
# or from repo
uv sync --extra rag

Examples

# With local LLM (Ollama)
agent-cli rag-proxy \
  --docs-folder ~/Documents/Notes \
  --openai-base-url http://localhost:11434/v1 \
  --port 8000

# With OpenAI
agent-cli rag-proxy \
  --docs-folder ~/Documents/Notes \
  --openai-api-key sk-... \
  --port 8000

# Use with agent-cli chat
agent-cli chat --openai-base-url http://localhost:8000/v1 --llm-provider openai

Options

RAG Configuration

Option	Default	Description
`--docs-folder`	`./rag_docs`	Folder to watch for documents. Files are auto-indexed on startup and when changed. Must not overlap with `--chroma-path`.
`--chroma-path`	`./rag_db`	ChromaDB storage directory for vector embeddings. Must be separate from `--docs-folder` to avoid indexing database files.
`--limit`	`3`	Number of document chunks to retrieve per query. Higher values provide more context but use more tokens. Can be overridden per-request via `rag_top_k` in the JSON body.
`--rag-tools/--no-rag-tools`	`true`	Enable `read_full_document()` tool so the LLM can request full document content when retrieved snippets are insufficient. Can be overridden per-request via `rag_enable_tools` in the JSON body.

LLM: OpenAI-compatible

Option	Default	Description
`--openai-base-url`	-	Custom base URL for OpenAI-compatible API (e.g., for llama-server: http://localhost:8080/v1).
`--openai-api-key`	-	Your OpenAI API key. Can also be set with the OPENAI_API_KEY environment variable.

LLM Configuration

Option	Default	Description
`--embedding-base-url`	-	Base URL for embedding API. Falls back to `--openai-base-url` if not set. Useful when using different providers for chat vs embeddings.
`--embedding-model`	`text-embedding-3-small`	Embedding model to use for vectorization.

Server Configuration

Option	Default	Description
`--host`	`0.0.0.0`	Host/IP to bind API servers to.
`--port`	`8000`	Port for the RAG proxy API (e.g., `http://localhost:8000/v1/chat/completions`).

General Options

Option	Default	Description
`--log-level`	`info`	Set logging level.
`--config`	-	Path to a TOML configuration file.
`--print-args`	`false`	Print the command line arguments, including variables taken from the configuration file.

Supported Document Types

Text files (loaded directly):

.txt, .md, .json, .py, .js, .ts, .yaml, .yml, .rs, .go
.c, .cpp, .h, .sh, .toml, .rst, .ini, .cfg

Rich documents (converted via MarkItDown):

.pdf, .docx, .pptx, .xlsx, .html, .htm, .csv, .xml

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Your Client   │────▶│   RAG Proxy     │────▶│   LLM Backend   │
│  (chat, curl)   │◀────│  :8000          │◀────│ (Ollama/OpenAI) │
└─────────────────┘     └────────┬────────┘     └─────────────────┘
                                 │
                        ┌────────▼────────┐
                        │    ChromaDB     │
                        │  (Vector Store) │
                        └────────┬────────┘
                                 │
                        ┌────────▼────────┐
                        │   docs-folder   │
                        │  (Your Files)   │
                        └─────────────────┘

RAG System Architecture - Detailed design and data flow
memory - Long-term memory proxy (different retrieval and storage model)
Memory System Architecture - How memory storage works
Configuration - Config file keys and defaults

Usage with Other Clients

Any OpenAI-compatible client can use the RAG proxy:

# curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<your-model>", "messages": [{"role": "user", "content": "What do my notes say about X?"}]}'

# Python (openai library)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="<your-model>",
    messages=[{"role": "user", "content": "Summarize my project notes"}]
)

Tips

The server automatically re-indexes when files change
Use --limit to control how many document chunks are retrieved
Enable --rag-tools for the agent to request full documents when snippets aren't enough