Skip to content

rag-proxy

A RAG (Retrieval-Augmented Generation) proxy server that lets you chat with your documents.

Usage

agent-cli rag-proxy [OPTIONS]

Description

Enables "Chat with your Data" by running a local proxy server:

  1. Start the server, pointing to your documents folder and LLM
  2. The server watches the folder and indexes documents into a ChromaDB vector store
  3. Point any OpenAI-compatible client to this server's URL
  4. When you ask a question, the server retrieves relevant chunks and adds them to the prompt

Installation

Requires the rag extra:

pip install "agent-cli[rag]"
# or from repo
uv sync --extra rag

Examples

# With local LLM (Ollama)
agent-cli rag-proxy \
  --docs-folder ~/Documents/Notes \
  --openai-base-url http://localhost:11434/v1 \
  --port 8000

# With OpenAI
agent-cli rag-proxy \
  --docs-folder ~/Documents/Notes \
  --openai-api-key sk-... \
  --port 8000

# Use with agent-cli chat
agent-cli chat --openai-base-url http://localhost:8000/v1 --llm-provider openai

Options

RAG Configuration

Option Default Description
--docs-folder ./rag_docs Folder to watch for documents. Files are auto-indexed on startup and when changed. Must not overlap with --chroma-path.
--chroma-path ./rag_db ChromaDB storage directory for vector embeddings. Must be separate from --docs-folder to avoid indexing database files.
--limit 3 Number of document chunks to retrieve per query. Higher values provide more context but use more tokens. Can be overridden per-request via rag_top_k in the JSON body.
--rag-tools/--no-rag-tools true Enable read_full_document() tool so the LLM can request full document content when retrieved snippets are insufficient. Can be overridden per-request via rag_enable_tools in the JSON body.

LLM: OpenAI-compatible

Option Default Description
--openai-base-url - Custom base URL for OpenAI-compatible API (e.g., for llama-server: http://localhost:8080/v1).
--openai-api-key - Your OpenAI API key. Can also be set with the OPENAI_API_KEY environment variable.

LLM Configuration

Option Default Description
--embedding-base-url - Base URL for embedding API. Falls back to --openai-base-url if not set. Useful when using different providers for chat vs embeddings.
--embedding-model text-embedding-3-small Embedding model to use for vectorization.

Server Configuration

Option Default Description
--host 0.0.0.0 Host/IP to bind API servers to.
--port 8000 Port for the RAG proxy API (e.g., http://localhost:8000/v1/chat/completions).

General Options

Option Default Description
--log-level info Set logging level.
--config - Path to a TOML configuration file.
--print-args false Print the command line arguments, including variables taken from the configuration file.

Supported Document Types

Text files (loaded directly):

  • .txt, .md, .json, .py, .js, .ts, .yaml, .yml, .rs, .go
  • .c, .cpp, .h, .sh, .toml, .rst, .ini, .cfg

Rich documents (converted via MarkItDown):

  • .pdf, .docx, .pptx, .xlsx, .html, .htm, .csv, .xml

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Your Client   │────▶│   RAG Proxy     │────▶│   LLM Backend   │
│  (chat, curl)   │◀────│  :8000          │◀────│ (Ollama/OpenAI) │
└─────────────────┘     └────────┬────────┘     └─────────────────┘
                        ┌────────▼────────┐
                        │    ChromaDB     │
                        │  (Vector Store) │
                        └────────┬────────┘
                        ┌────────▼────────┐
                        │   docs-folder   │
                        │  (Your Files)   │
                        └─────────────────┘

Usage with Other Clients

Any OpenAI-compatible client can use the RAG proxy:

# curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<your-model>", "messages": [{"role": "user", "content": "What do my notes say about X?"}]}'

# Python (openai library)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="<your-model>",
    messages=[{"role": "user", "content": "Summarize my project notes"}]
)

Tips

  • The server automatically re-indexes when files change
  • Use --limit to control how many document chunks are retrieved
  • Enable --rag-tools for the agent to request full documents when snippets aren't enough