Building RAGTime: A Complete RAG System That Runs Entirely in Your Browser
How I built a privacy-first RAG pipeline using WebLLM, Transformers.js, and IndexedDB.
Building RAGTime: A Complete RAG System That Runs Entirely in Your Browser
What if you could chat with your PDF documents using AI, without sending a single byte of data to external servers? That's exactly what I built with RAGTime. It's a complete Retrieval-Augmented Generation (RAG) pipeline that runs entirely in your browser.
RAGTime is part of The Big Idea, an offline-first productivity PWA that combines task management, note-taking, and AI-powered document chat. In this post, I'll walk through how I built the RAG pipeline from first principles, the architectural decisions I made, and where I'm heading next.
What RAGTime Does
From a user's perspective, RAGTime is simple: upload a PDF, select an AI model, and start asking questions. The AI answers based on the document's content, citing specific pages as sources. It feels like magic, but under the hood, there's a sophisticated pipeline making it all work.
The key difference from services like ChatGPT or Claude is that everything happens locally. Your documents never leave your device. The AI model runs in your browser using WebGPU acceleration. Even the vector embeddings for semantic search are generated on your machine.
The Local RAG Pipeline
RAGTime's pipeline consists of five main stages:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Document │───▶│ Text │───▶│ Embedding │───▶│ Vector │───▶│ Local LLM │
│ Upload │ │ Chunking │ │ Generation │ │ Search │ │ Inference │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
PDF.js 1000 chars Transformers.js IndexedDB + WebLLM /
200 overlap all-MiniLM-L6-v2 Cosine Sim Transformers.js
Let's dive into each component.
Stage 1: Document Processing with PDF.js
When a user uploads a PDF, I need to extract its text content. I use Mozilla's PDF.js library (the same engine powering Firefox's built-in PDF viewer).
const extractTextFromPDF = async (
pdfData: ArrayBuffer,
updateProgress: (progress: number, message: string) => void
): Promise<{ text: string; pageNumber: number }[]> => {
// Create a fresh copy to prevent ArrayBuffer detachment
const pdfDataCopy = new Uint8Array(pdfData.slice(0)).buffer;
const loadingTask = pdfjsLib.getDocument({ data: pdfDataCopy });
const pdf = await loadingTask.promise;
const numPages = pdf.numPages;
const pages = [];
for (let i = 1; i <= numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
// Reconstruct text with proper line breaks
let lastY: number | null = null;
let text = '';
for (const item of content.items) {
if ('str' in item) {
// Add line break when Y position changes significantly
if (lastY !== null && Math.abs(item.transform[5] - lastY) > 5) {
text += '\n';
}
text += item.str;
lastY = item.transform[5];
}
}
pages.push({ text: text.trim(), pageNumber: i });
updateProgress(10 + Math.round((i / numPages) * 40),
`Extracting text from page ${i}/${numPages}`);
}
return pages;
};
Key insight: PDF text items don't come with natural line breaks. I detect paragraph boundaries by tracking the Y-coordinate transform. When it jumps significantly, I insert a newline.
Important gotcha: ArrayBuffers can become "detached" when passed between contexts in IndexedDB. I always create a fresh copy with slice(0) before processing. (I learned this one the hard way!)
Intelligent Chunking
Once I have the raw text, I split it into chunks that are small enough to embed but large enough to preserve context:
const pdfProcessingOptions = {
chunkSize: 1000, // Target characters per chunk
chunkOverlap: 200, // Overlap between chunks
includePageNumbers: true
};
The 200-character overlap is crucial. It ensures that if relevant information spans a chunk boundary, both chunks will contain enough context. Each chunk is prefixed with its page number for citation tracking.
Stage 2: Embedding Generation with Transformers.js
Here's where things get interesting! I need to convert each text chunk into a 384-dimensional vector that captures its semantic meaning. Traditional approaches send text to an API like OpenAI's embeddings endpoint, but I wanted to keep everything local.
Enter Transformers.js. It's a JavaScript port of Hugging Face's Transformers library that runs models directly in the browser using WebAssembly and WebGL.
import { pipeline, env } from '@xenova/transformers';
// Configure for browser usage
env.allowLocalModels = false;
env.useBrowserCache = true;
env.cacheDir = '/cache/transformers.js';
class EmbeddingService {
private embeddingPipeline: any = null;
async loadModel(): Promise<void> {
this.embeddingPipeline = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2',
{
revision: 'main',
quantized: false
}
);
}
async generateEmbedding(text: string): Promise<number[]> {
await this.loadModel();
const result = await this.embeddingPipeline(text, {
pooling: 'mean', // Average all token embeddings
normalize: true // L2 normalize for cosine similarity
});
return Array.from(result.data);
}
}
I chose all-MiniLM-L6-v2 for embeddings because it's small (about 23MB), fast, and produces high-quality semantic representations. The mean pooling strategy averages all token embeddings, and normalization ensures vectors have unit length. (This is essential for cosine similarity comparisons.)
Web Worker Optimization
Embedding generation is CPU-intensive, and processing hundreds of chunks on the main thread would freeze the UI. So I offload this work to a Web Worker:
const generateEmbeddings = async (
chunks: DocumentChunk[],
updateProgress: (progress: number, message: string) => void,
abortSignal?: AbortSignal
): Promise<DocumentChunk[]> => {
try {
// Try Web Worker for better performance
const results = await embeddingWorkerService.processChunks(
chunks,
(current, total, message) => {
updateProgress(70 + Math.round((current / total) * 30), message);
},
abortSignal
);
return results;
} catch (workerError) {
// Fallback to main thread if worker fails
console.warn('Worker failed, falling back to main thread');
// ... sequential processing
}
};
The abort signal enables graceful cancellation when users close the tab or cancel operations mid-processing.
Stage 3: Vector Storage with IndexedDB
With embeddings generated, I need persistent storage. LocalStorage has a 5MB limit and can't store binary data efficiently. IndexedDB, however, can handle large objects and supports structured data.
const DB_NAME = 'the_big_idea_db';
const STORES = {
DOCUMENTS: 'documents', // PDFs and metadata
DOCUMENT_CHUNKS: 'document_chunks', // Text chunks with embeddings
CHATS: 'chats', // Conversation history
MODELS: 'models' // Downloaded model metadata
};
// Document chunk structure:
interface DocumentChunk {
id: string;
documentId: string;
text: string;
pageNumber: number;
startPosition: number;
endPosition: number;
embedding: number[]; // 384-dimensional vector
}
export const saveDocumentChunks = async (
documentId: string,
chunks: DocumentChunk[]
): Promise<DocumentChunk[]> => {
const db = await getDB();
const transaction = db.transaction(STORES.DOCUMENT_CHUNKS, 'readwrite');
const store = transaction.objectStore(STORES.DOCUMENT_CHUNKS);
// Clear existing chunks for this document
const index = store.index('documentId');
const cursorRequest = index.openCursor(IDBKeyRange.only(documentId));
cursorRequest.onsuccess = (event) => {
const cursor = event.target.result;
if (cursor) {
cursor.delete();
cursor.continue();
}
};
// Save new chunks with embeddings
return Promise.all(chunks.map(chunk =>
saveToStore(STORES.DOCUMENT_CHUNKS, chunk)
));
};
This gives me about 50% of device storage (typically several GB) for documents and embeddings. Far more than localStorage's 5MB limit!
Stage 4: Hybrid Similarity Search
When a user asks a question, I generate an embedding for their query and find the most similar chunks. My search combines semantic vectors with keyword matching:
const findSimilarChunks = async (
query: string,
documentIds: string[],
maxResults = 15
) => {
// Preprocess query
const processedQuery = query.toLowerCase().trim();
// Generate query embedding
const queryEmbedding = await embeddingService.generateEmbedding(processedQuery);
// Get all chunks from relevant documents
const allChunks: DocumentChunk[] = [];
for (const doc of relevantDocuments) {
if (doc.chunks.length === 0) {
// Load from IndexedDB if not in memory
const loadedChunks = await IndexedDBStorage.getDocumentChunks(doc.id);
allChunks.push(...loadedChunks);
} else {
allChunks.push(...doc.chunks);
}
}
// Calculate cosine similarity for each chunk
const chunksWithScores = allChunks
.filter(chunk => chunk.embedding?.length > 0)
.map(chunk => {
// Cosine similarity calculation
let dotProduct = 0;
let queryMagnitude = 0;
let embeddingMagnitude = 0;
for (let i = 0; i < queryEmbedding.length; i++) {
dotProduct += queryEmbedding[i] * chunk.embedding[i];
queryMagnitude += queryEmbedding[i] ** 2;
embeddingMagnitude += chunk.embedding[i] ** 2;
}
const similarity = dotProduct /
(Math.sqrt(queryMagnitude) * Math.sqrt(embeddingMagnitude));
// Keyword matching as secondary signal
const queryWords = processedQuery.split(/\s+/).filter(w => w.length > 3);
const textRelevance = queryWords.filter(w =>
chunk.text.toLowerCase().includes(w)
).length / queryWords.length;
// 80% semantic, 20% keyword
return {
chunk,
score: similarity,
textRelevance,
combinedScore: (similarity * 0.8) + (textRelevance * 0.2)
};
});
// Filter by threshold and return top results
const SIMILARITY_THRESHOLD = 0.5;
return chunksWithScores
.filter(item => item.combinedScore > SIMILARITY_THRESHOLD)
.sort((a, b) => b.combinedScore - a.combinedScore)
.slice(0, maxResults);
};
The hybrid approach handles edge cases where the embedding model might miss exact keyword matches that are actually relevant (like proper nouns or technical terms).
Stage 5: Local LLM Inference
The final piece is running an actual language model in the browser. I built a Global Model Manager (GMM) that abstracts over two inference backends:
| Backend | Technology | Best For |
|---|---|---|
| WebLLM | WebGPU | Chrome with modern GPU |
| Transformers.js | WebGL2/CPU | Safari, older browsers |
export class GlobalModelManager extends EventEmitter {
private capabilities: BrowserCapabilities | null = null;
private webllmInstance: any = null;
private transformersInstance: any = null;
static getStackRecommendation(caps: BrowserCapabilities) {
// Chrome with WebGPU is ideal for WebLLM
if (caps.hasWebGPU && caps.isChrome && !caps.isMobile) {
return {
recommended: 'webllm',
reason: 'WebGPU available in Chrome desktop',
confidence: 90
};
}
// Safari needs Transformers.js for compatibility
if (caps.isSafari) {
return {
recommended: 'transformers',
reason: 'Safari compatibility via Transformers.js',
confidence: 85
};
}
// Default fallback
return {
recommended: 'transformers',
reason: 'Better cross-browser support',
confidence: 80
};
}
}
Current Model Catalog (Needs Update)
Note: The model offerings are actively being updated. The current catalog includes:
| Model | Size | Download | Stack | Notes |
|---|---|---|---|---|
| Qwen3 0.6B | 0.6B | ~400MB | WebLLM | Fast, good for mobile |
| Llama 3.2 1B | 1B | ~800MB | WebLLM | Balanced performance |
| Phi-4 Mini | 3.8B | ~1.2GB | WebLLM | Strong reasoning |
| Llama 3.2 3B | 3B | ~2GB | WebLLM | High quality |
| Llama 3.1 8B | 8B | ~4.5GB | WebLLM | Best quality (needs good GPU) |
| SmolLM2 135M | 135M | ~150MB | Transformers.js | Safari-compatible, ultra-fast |
| SmolLM2 360M | 360M | ~250MB | Transformers.js | Safari-compatible |
| SmolLM2 1.7B | 1.7B | ~1.2GB | Transformers.js | Good quality on Safari |
I'm working on adding newer models, better quantization options, and improved Safari support. The GMM architecture makes adding models straightforward. Just add entries with the appropriate mlcId (WebLLM) or huggingFaceId (Transformers.js).
The RAG Prompt
When a user asks a question, I build a context-aware prompt:
const systemPrompt = `You are an expert document analyst assistant.
INSTRUCTIONS:
1. Provide DETAILED and THOROUGH answers using the document context below
2. Include specific facts, figures, examples, and quotes when available
3. Structure longer answers with clear sections or bullet points
4. If the context doesn't contain enough information, clearly state what is and isn't covered
5. Never fabricate information - only use what's in the provided context
DOCUMENT CONTEXT:
${context}
Based on the above context, provide a comprehensive answer to the user's question.`;
The explicit instruction to only use provided context is crucial for factual accuracy. It prevents the model from hallucinating information not in the document.
Streaming Responses
Nobody wants to wait 30 seconds staring at a blank screen! I implemented true streaming for WebLLM:
await modelManager.generateCompletionStream(
content,
systemPrompt,
chatHistory,
(chunk) => {
fullResponse += chunk;
// Update the message in React state
setChats(prev => prev.map(c => {
if (c.id === chatId) {
return {
...c,
messages: c.messages.map(msg => {
if (msg === c.messages[c.messages.length - 1]) {
return { ...msg, content: fullResponse };
}
return msg;
})
};
}
return c;
}));
}
);
WebLLM supports native streaming via async iterators. Transformers.js simulates it by splitting the complete response word-by-word with delays.
Progressive Web App Integration
RAGTime is part of a larger PWA, so I needed special handling for offline support. AI model files can be huge. The 8B model is 4.5GB! My service worker uses aggressive caching:
// From vite.config.ts
runtimeCaching: [
{
urlPattern: /^https:\/\/huggingface\.co\/.*\.(safetensors|json|wasm|bin)$/,
handler: 'CacheFirst',
options: {
cacheName: 'model-cache',
expiration: {
maxEntries: 50,
maxAgeSeconds: 30 * 24 * 60 * 60 // 30 days
},
cacheableResponse: { statuses: [0, 200] }
}
}
]
Once a model is downloaded, it's cached locally. Subsequent loads are instant, and the app works completely offline.
Lessons Learned
Building RAGTime taught me several things:
Browser APIs are powerful enough. Between IndexedDB, WebGPU, Web Workers, and WebAssembly, modern browsers can run surprisingly sophisticated AI workloads.
Memory management matters. I hit issues with detached ArrayBuffers when passing PDF data between contexts. The solution was defensive copying:
const pdfDataCopy = new Uint8Array(pdfData.slice(0)).buffer;
Graceful degradation is essential. Not everyone has a WebGPU-capable browser. The fallback to Transformers.js ensures RAGTime works everywhere (even if it's slower).
Progress feedback is crucial. Processing a large PDF with hundreds of chunks takes time. Without progress indicators, users would think the app crashed.
Roadmap: Areas for Improvement in The Big Idea
RAGTime is functional, but there's significant room to grow. Here's my improvement roadmap:
1. Smarter Chunking Strategies
My current chunking is character-based with fixed sizes. Planned improvements:
- Semantic chunking: Split at natural boundaries (paragraphs/sections/sentences)
- Recursive chunking: Hierarchical chunks for multi-level retrieval
- Metadata extraction: Detect headers, tables, lists, and structured content
- Overlap optimization: Dynamic overlap based on content type
2. Enhanced Hybrid Search
I already blend vector similarity with keyword matching, but I can go further:
- BM25 integration: Full-text search for exact keyword matching
- Re-ranking: Use a cross-encoder model to reorder retrieved chunks
- Query expansion: Generate related queries for better recall
- Contextual reranking: Consider document structure in ranking
3. Multi-Document Reasoning
Currently, RAGTime handles one document at a time well. Future work:
- Cross-document search: Query across all uploaded documents simultaneously
- Citation tracking: Show which document each answer comes from
- Comparison mode: "Compare what Document A and B say about X"
- Document collections: Group related documents for scoped search
4. Model Management Improvements
The Global Model Manager has room to grow:
- Memory monitoring: Real-time GPU/CPU memory usage tracking
- Model switching: Hot-swap models without full page reload
- Quantization picker: Let users choose quality vs. speed tradeoffs
- Model recommendations: Suggest models based on device capabilities and document size
- Download resume: Pause and resume large model downloads
- Model pruning: Automatically clean up unused models to free storage
5. Performance Optimizations
Browser-based AI is compute-constrained. I'm exploring:
- WebGPU shaders: Custom kernels for similarity search
- Streaming embeddings: Generate embeddings as PDFs parse (pipeline)
- Lazy chunk loading: Only load chunks into memory when needed
- Embedding compression: Quantize embeddings for 50%+ storage savings
- Batch inference: Process multiple queries efficiently
6. User Experience Polish
- Cancel operations: Graceful abort for uploads/embedding/inference
- Export/import: Backup and restore document libraries
- Collaboration: Optional P2P sync between devices (using existing Talkie infrastructure)
- Document previews: Show PDF preview alongside chat
- Citation highlighting: Click a citation to jump to that page in the PDF
- Conversation branching: Fork conversations to explore different questions
7. Advanced RAG Techniques
- Agentic RAG: Let the model decide when it needs more context
- Iterative retrieval: Multiple rounds of search for complex questions
- Self-consistency: Generate multiple answers and synthesize
- Fact verification: Cross-check answers against retrieved chunks
Try It Yourself
RAGTime proves that privacy-first AI isn't just possible. It's practical! The entire codebase uses React, TypeScript, and standard web APIs. No servers, no API keys, no data leaving your device.
Check out The Big Idea to try RAGTime, or explore the source code on GitHub.
The future of AI might just be local-first.
Have questions about browser-based AI or want to discuss the implementation? Open an issue on the repository or reach out directly.