rag-proxy
A RAG (Retrieval-Augmented Generation) proxy server that lets you chat with your documents.
Usage
Description
Enables "Chat with your Data" by running a local proxy server:
- Start the server, pointing to your documents folder and LLM
- The server watches the folder and indexes documents into a ChromaDB vector store
- Point any OpenAI-compatible client to this server's URL
- When you ask a question, the server retrieves relevant chunks and adds them to the prompt
Installation
Requires the rag extra:
Examples
# With local LLM (Ollama)
agent-cli rag-proxy \
--docs-folder ~/Documents/Notes \
--openai-base-url http://localhost:11434/v1 \
--port 8000
# With OpenAI
agent-cli rag-proxy \
--docs-folder ~/Documents/Notes \
--openai-api-key sk-... \
--port 8000
# Use with agent-cli chat
agent-cli chat --openai-base-url http://localhost:8000/v1 --llm-provider openai
Options
RAG Configuration
| Option | Default | Description |
|---|---|---|
--docs-folder |
./rag_docs |
Folder to watch for documents. Files are auto-indexed on startup and when changed. Must not overlap with --chroma-path. |
--chroma-path |
./rag_db |
ChromaDB storage directory for vector embeddings. Must be separate from --docs-folder to avoid indexing database files. |
--limit |
3 |
Number of document chunks to retrieve per query. Higher values provide more context but use more tokens. Can be overridden per-request via rag_top_k in the JSON body. |
--rag-tools/--no-rag-tools |
true |
Enable read_full_document() tool so the LLM can request full document content when retrieved snippets are insufficient. Can be overridden per-request via rag_enable_tools in the JSON body. |
LLM: OpenAI-compatible
| Option | Default | Description |
|---|---|---|
--openai-base-url |
- | Custom base URL for OpenAI-compatible API (e.g., for llama-server: http://localhost:8080/v1). |
--openai-api-key |
- | Your OpenAI API key. Can also be set with the OPENAI_API_KEY environment variable. |
LLM Configuration
| Option | Default | Description |
|---|---|---|
--embedding-base-url |
- | Base URL for embedding API. Falls back to --openai-base-url if not set. Useful when using different providers for chat vs embeddings. |
--embedding-model |
text-embedding-3-small |
Embedding model to use for vectorization. |
Server Configuration
| Option | Default | Description |
|---|---|---|
--host |
0.0.0.0 |
Host/IP to bind API servers to. |
--port |
8000 |
Port for the RAG proxy API (e.g., http://localhost:8000/v1/chat/completions). |
General Options
| Option | Default | Description |
|---|---|---|
--log-level |
info |
Set logging level. |
--config |
- | Path to a TOML configuration file. |
--print-args |
false |
Print the command line arguments, including variables taken from the configuration file. |
Supported Document Types
Text files (loaded directly):
.txt,.md,.json,.py,.js,.ts,.yaml,.yml,.rs,.go.c,.cpp,.h,.sh,.toml,.rst,.ini,.cfg
Rich documents (converted via MarkItDown):
.pdf,.docx,.pptx,.xlsx,.html,.htm,.csv,.xml
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your Client │────▶│ RAG Proxy │────▶│ LLM Backend │
│ (chat, curl) │◀────│ :8000 │◀────│ (Ollama/OpenAI) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
┌────────▼────────┐
│ ChromaDB │
│ (Vector Store) │
└────────┬────────┘
│
┌────────▼────────┐
│ docs-folder │
│ (Your Files) │
└─────────────────┘
Related
- RAG System Architecture - Detailed design and data flow
- memory - Long-term memory proxy (different retrieval and storage model)
- Memory System Architecture - How memory storage works
- Configuration - Config file keys and defaults
Usage with Other Clients
Any OpenAI-compatible client can use the RAG proxy:
# curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<your-model>", "messages": [{"role": "user", "content": "What do my notes say about X?"}]}'
# Python (openai library)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="<your-model>",
messages=[{"role": "user", "content": "Summarize my project notes"}]
)
Tips
- The server automatically re-indexes when files change
- Use
--limitto control how many document chunks are retrieved - Enable
--rag-toolsfor the agent to request full documents when snippets aren't enough