Back to Essays
#ai#rag#webllm#transformers.js#browser#privacy#local-first

Building RAGTime: A Complete RAG System That Runs Entirely in Your Browser

How I built a privacy-first RAG pipeline using WebLLM, Transformers.js, and IndexedDB.

Building RAGTime: A Complete RAG System That Runs Entirely in Your Browser

Dakota and a robot talking about documents

What if you could chat with your PDF documents using AI, without sending a single byte of data to external servers? That's exactly what I built with RAGTime. It's a complete Retrieval-Augmented Generation (RAG) pipeline that runs entirely in your browser.

RAGTime is part of The Big Idea, an offline-first productivity PWA that combines task management, note-taking, and AI-powered document chat. In this post, I'll walk through how I built the RAG pipeline from first principles, the architectural decisions I made, and where I'm heading next.

What RAGTime Does

From a user's perspective, RAGTime is simple: upload a PDF, select an AI model, and start asking questions. The AI answers based on the document's content, citing specific pages as sources. It feels like magic, but under the hood, there's a sophisticated pipeline making it all work.

The key difference from services like ChatGPT or Claude is that everything happens locally. Your documents never leave your device. The AI model runs in your browser using WebGPU acceleration. Even the vector embeddings for semantic search are generated on your machine.

The Local RAG Pipeline

RAGTime's pipeline consists of five main stages:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Document   │───▶│    Text     │───▶│  Embedding  │───▶│   Vector    │───▶│  Local LLM  │
│  Upload     │    │  Chunking   │    │ Generation  │    │   Search    │    │  Inference  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     PDF.js           1000 chars        Transformers.js     IndexedDB +        WebLLM /
                      200 overlap       all-MiniLM-L6-v2    Cosine Sim      Transformers.js

Let's dive into each component.

Stage 1: Document Processing with PDF.js

When a user uploads a PDF, I need to extract its text content. I use Mozilla's PDF.js library (the same engine powering Firefox's built-in PDF viewer).

TYPESCRIPT
const extractTextFromPDF = async (
  pdfData: ArrayBuffer,
  updateProgress: (progress: number, message: string) => void
): Promise<{ text: string; pageNumber: number }[]> => {
  // Create a fresh copy to prevent ArrayBuffer detachment
  const pdfDataCopy = new Uint8Array(pdfData.slice(0)).buffer;

  const loadingTask = pdfjsLib.getDocument({ data: pdfDataCopy });
  const pdf = await loadingTask.promise;
  const numPages = pdf.numPages;

  const pages = [];
  for (let i = 1; i <= numPages; i++) {
    const page = await pdf.getPage(i);
    const content = await page.getTextContent();

    // Reconstruct text with proper line breaks
    let lastY: number | null = null;
    let text = '';

    for (const item of content.items) {
      if ('str' in item) {
        // Add line break when Y position changes significantly
        if (lastY !== null && Math.abs(item.transform[5] - lastY) > 5) {
          text += '\n';
        }
        text += item.str;
        lastY = item.transform[5];
      }
    }

    pages.push({ text: text.trim(), pageNumber: i });
    updateProgress(10 + Math.round((i / numPages) * 40),
      `Extracting text from page ${i}/${numPages}`);
  }

  return pages;
};

Key insight: PDF text items don't come with natural line breaks. I detect paragraph boundaries by tracking the Y-coordinate transform. When it jumps significantly, I insert a newline.

Important gotcha: ArrayBuffers can become "detached" when passed between contexts in IndexedDB. I always create a fresh copy with slice(0) before processing. (I learned this one the hard way!)

Intelligent Chunking

Once I have the raw text, I split it into chunks that are small enough to embed but large enough to preserve context:

TYPESCRIPT
const pdfProcessingOptions = {
  chunkSize: 1000,      // Target characters per chunk
  chunkOverlap: 200,    // Overlap between chunks
  includePageNumbers: true
};

The 200-character overlap is crucial. It ensures that if relevant information spans a chunk boundary, both chunks will contain enough context. Each chunk is prefixed with its page number for citation tracking.

Stage 2: Embedding Generation with Transformers.js

Here's where things get interesting! I need to convert each text chunk into a 384-dimensional vector that captures its semantic meaning. Traditional approaches send text to an API like OpenAI's embeddings endpoint, but I wanted to keep everything local.

Enter Transformers.js. It's a JavaScript port of Hugging Face's Transformers library that runs models directly in the browser using WebAssembly and WebGL.

TYPESCRIPT
import { pipeline, env } from '@xenova/transformers';

// Configure for browser usage
env.allowLocalModels = false;
env.useBrowserCache = true;
env.cacheDir = '/cache/transformers.js';

class EmbeddingService {
  private embeddingPipeline: any = null;

  async loadModel(): Promise<void> {
    this.embeddingPipeline = await pipeline(
      'feature-extraction',
      'Xenova/all-MiniLM-L6-v2',
      {
        revision: 'main',
        quantized: false
      }
    );
  }

  async generateEmbedding(text: string): Promise<number[]> {
    await this.loadModel();

    const result = await this.embeddingPipeline(text, {
      pooling: 'mean',    // Average all token embeddings
      normalize: true     // L2 normalize for cosine similarity
    });

    return Array.from(result.data);
  }
}

I chose all-MiniLM-L6-v2 for embeddings because it's small (about 23MB), fast, and produces high-quality semantic representations. The mean pooling strategy averages all token embeddings, and normalization ensures vectors have unit length. (This is essential for cosine similarity comparisons.)

Web Worker Optimization

Embedding generation is CPU-intensive, and processing hundreds of chunks on the main thread would freeze the UI. So I offload this work to a Web Worker:

TYPESCRIPT
const generateEmbeddings = async (
  chunks: DocumentChunk[],
  updateProgress: (progress: number, message: string) => void,
  abortSignal?: AbortSignal
): Promise<DocumentChunk[]> => {
  try {
    // Try Web Worker for better performance
    const results = await embeddingWorkerService.processChunks(
      chunks,
      (current, total, message) => {
        updateProgress(70 + Math.round((current / total) * 30), message);
      },
      abortSignal
    );
    return results;
  } catch (workerError) {
    // Fallback to main thread if worker fails
    console.warn('Worker failed, falling back to main thread');
    // ... sequential processing
  }
};

The abort signal enables graceful cancellation when users close the tab or cancel operations mid-processing.

Stage 3: Vector Storage with IndexedDB

With embeddings generated, I need persistent storage. LocalStorage has a 5MB limit and can't store binary data efficiently. IndexedDB, however, can handle large objects and supports structured data.

TYPESCRIPT
const DB_NAME = 'the_big_idea_db';
const STORES = {
  DOCUMENTS: 'documents',         // PDFs and metadata
  DOCUMENT_CHUNKS: 'document_chunks',  // Text chunks with embeddings
  CHATS: 'chats',                 // Conversation history
  MODELS: 'models'                // Downloaded model metadata
};

// Document chunk structure:
interface DocumentChunk {
  id: string;
  documentId: string;
  text: string;
  pageNumber: number;
  startPosition: number;
  endPosition: number;
  embedding: number[];  // 384-dimensional vector
}

export const saveDocumentChunks = async (
  documentId: string,
  chunks: DocumentChunk[]
): Promise<DocumentChunk[]> => {
  const db = await getDB();
  const transaction = db.transaction(STORES.DOCUMENT_CHUNKS, 'readwrite');
  const store = transaction.objectStore(STORES.DOCUMENT_CHUNKS);

  // Clear existing chunks for this document
  const index = store.index('documentId');
  const cursorRequest = index.openCursor(IDBKeyRange.only(documentId));
  cursorRequest.onsuccess = (event) => {
    const cursor = event.target.result;
    if (cursor) {
      cursor.delete();
      cursor.continue();
    }
  };

  // Save new chunks with embeddings
  return Promise.all(chunks.map(chunk =>
    saveToStore(STORES.DOCUMENT_CHUNKS, chunk)
  ));
};

This gives me about 50% of device storage (typically several GB) for documents and embeddings. Far more than localStorage's 5MB limit!

Stage 4: Hybrid Similarity Search

When a user asks a question, I generate an embedding for their query and find the most similar chunks. My search combines semantic vectors with keyword matching:

TYPESCRIPT
const findSimilarChunks = async (
  query: string,
  documentIds: string[],
  maxResults = 15
) => {
  // Preprocess query
  const processedQuery = query.toLowerCase().trim();

  // Generate query embedding
  const queryEmbedding = await embeddingService.generateEmbedding(processedQuery);

  // Get all chunks from relevant documents
  const allChunks: DocumentChunk[] = [];
  for (const doc of relevantDocuments) {
    if (doc.chunks.length === 0) {
      // Load from IndexedDB if not in memory
      const loadedChunks = await IndexedDBStorage.getDocumentChunks(doc.id);
      allChunks.push(...loadedChunks);
    } else {
      allChunks.push(...doc.chunks);
    }
  }

  // Calculate cosine similarity for each chunk
  const chunksWithScores = allChunks
    .filter(chunk => chunk.embedding?.length > 0)
    .map(chunk => {
      // Cosine similarity calculation
      let dotProduct = 0;
      let queryMagnitude = 0;
      let embeddingMagnitude = 0;

      for (let i = 0; i < queryEmbedding.length; i++) {
        dotProduct += queryEmbedding[i] * chunk.embedding[i];
        queryMagnitude += queryEmbedding[i] ** 2;
        embeddingMagnitude += chunk.embedding[i] ** 2;
      }

      const similarity = dotProduct /
        (Math.sqrt(queryMagnitude) * Math.sqrt(embeddingMagnitude));

      // Keyword matching as secondary signal
      const queryWords = processedQuery.split(/\s+/).filter(w => w.length > 3);
      const textRelevance = queryWords.filter(w =>
        chunk.text.toLowerCase().includes(w)
      ).length / queryWords.length;

      // 80% semantic, 20% keyword
      return {
        chunk,
        score: similarity,
        textRelevance,
        combinedScore: (similarity * 0.8) + (textRelevance * 0.2)
      };
    });

  // Filter by threshold and return top results
  const SIMILARITY_THRESHOLD = 0.5;
  return chunksWithScores
    .filter(item => item.combinedScore > SIMILARITY_THRESHOLD)
    .sort((a, b) => b.combinedScore - a.combinedScore)
    .slice(0, maxResults);
};

The hybrid approach handles edge cases where the embedding model might miss exact keyword matches that are actually relevant (like proper nouns or technical terms).

Stage 5: Local LLM Inference

The final piece is running an actual language model in the browser. I built a Global Model Manager (GMM) that abstracts over two inference backends:

BackendTechnologyBest For
WebLLMWebGPUChrome with modern GPU
Transformers.jsWebGL2/CPUSafari, older browsers
TYPESCRIPT
export class GlobalModelManager extends EventEmitter {
  private capabilities: BrowserCapabilities | null = null;
  private webllmInstance: any = null;
  private transformersInstance: any = null;

  static getStackRecommendation(caps: BrowserCapabilities) {
    // Chrome with WebGPU is ideal for WebLLM
    if (caps.hasWebGPU && caps.isChrome && !caps.isMobile) {
      return {
        recommended: 'webllm',
        reason: 'WebGPU available in Chrome desktop',
        confidence: 90
      };
    }
    // Safari needs Transformers.js for compatibility
    if (caps.isSafari) {
      return {
        recommended: 'transformers',
        reason: 'Safari compatibility via Transformers.js',
        confidence: 85
      };
    }
    // Default fallback
    return {
      recommended: 'transformers',
      reason: 'Better cross-browser support',
      confidence: 80
    };
  }
}

Current Model Catalog (Needs Update)

Note: The model offerings are actively being updated. The current catalog includes:

ModelSizeDownloadStackNotes
Qwen3 0.6B0.6B~400MBWebLLMFast, good for mobile
Llama 3.2 1B1B~800MBWebLLMBalanced performance
Phi-4 Mini3.8B~1.2GBWebLLMStrong reasoning
Llama 3.2 3B3B~2GBWebLLMHigh quality
Llama 3.1 8B8B~4.5GBWebLLMBest quality (needs good GPU)
SmolLM2 135M135M~150MBTransformers.jsSafari-compatible, ultra-fast
SmolLM2 360M360M~250MBTransformers.jsSafari-compatible
SmolLM2 1.7B1.7B~1.2GBTransformers.jsGood quality on Safari

I'm working on adding newer models, better quantization options, and improved Safari support. The GMM architecture makes adding models straightforward. Just add entries with the appropriate mlcId (WebLLM) or huggingFaceId (Transformers.js).

The RAG Prompt

When a user asks a question, I build a context-aware prompt:

TYPESCRIPT
const systemPrompt = `You are an expert document analyst assistant.

INSTRUCTIONS:
1. Provide DETAILED and THOROUGH answers using the document context below
2. Include specific facts, figures, examples, and quotes when available
3. Structure longer answers with clear sections or bullet points
4. If the context doesn't contain enough information, clearly state what is and isn't covered
5. Never fabricate information - only use what's in the provided context

DOCUMENT CONTEXT:
${context}

Based on the above context, provide a comprehensive answer to the user's question.`;

The explicit instruction to only use provided context is crucial for factual accuracy. It prevents the model from hallucinating information not in the document.

Streaming Responses

Nobody wants to wait 30 seconds staring at a blank screen! I implemented true streaming for WebLLM:

TYPESCRIPT
await modelManager.generateCompletionStream(
  content,
  systemPrompt,
  chatHistory,
  (chunk) => {
    fullResponse += chunk;

    // Update the message in React state
    setChats(prev => prev.map(c => {
      if (c.id === chatId) {
        return {
          ...c,
          messages: c.messages.map(msg => {
            if (msg === c.messages[c.messages.length - 1]) {
              return { ...msg, content: fullResponse };
            }
            return msg;
          })
        };
      }
      return c;
    }));
  }
);

WebLLM supports native streaming via async iterators. Transformers.js simulates it by splitting the complete response word-by-word with delays.

Progressive Web App Integration

RAGTime is part of a larger PWA, so I needed special handling for offline support. AI model files can be huge. The 8B model is 4.5GB! My service worker uses aggressive caching:

TYPESCRIPT
// From vite.config.ts
runtimeCaching: [
  {
    urlPattern: /^https:\/\/huggingface\.co\/.*\.(safetensors|json|wasm|bin)$/,
    handler: 'CacheFirst',
    options: {
      cacheName: 'model-cache',
      expiration: {
        maxEntries: 50,
        maxAgeSeconds: 30 * 24 * 60 * 60 // 30 days
      },
      cacheableResponse: { statuses: [0, 200] }
    }
  }
]

Once a model is downloaded, it's cached locally. Subsequent loads are instant, and the app works completely offline.

Lessons Learned

Building RAGTime taught me several things:

Browser APIs are powerful enough. Between IndexedDB, WebGPU, Web Workers, and WebAssembly, modern browsers can run surprisingly sophisticated AI workloads.

Memory management matters. I hit issues with detached ArrayBuffers when passing PDF data between contexts. The solution was defensive copying:

TYPESCRIPT
const pdfDataCopy = new Uint8Array(pdfData.slice(0)).buffer;

Graceful degradation is essential. Not everyone has a WebGPU-capable browser. The fallback to Transformers.js ensures RAGTime works everywhere (even if it's slower).

Progress feedback is crucial. Processing a large PDF with hundreds of chunks takes time. Without progress indicators, users would think the app crashed.


Roadmap: Areas for Improvement in The Big Idea

RAGTime is functional, but there's significant room to grow. Here's my improvement roadmap:

1. Smarter Chunking Strategies

My current chunking is character-based with fixed sizes. Planned improvements:

  • Semantic chunking: Split at natural boundaries (paragraphs/sections/sentences)
  • Recursive chunking: Hierarchical chunks for multi-level retrieval
  • Metadata extraction: Detect headers, tables, lists, and structured content
  • Overlap optimization: Dynamic overlap based on content type

2. Enhanced Hybrid Search

I already blend vector similarity with keyword matching, but I can go further:

  • BM25 integration: Full-text search for exact keyword matching
  • Re-ranking: Use a cross-encoder model to reorder retrieved chunks
  • Query expansion: Generate related queries for better recall
  • Contextual reranking: Consider document structure in ranking

3. Multi-Document Reasoning

Currently, RAGTime handles one document at a time well. Future work:

  • Cross-document search: Query across all uploaded documents simultaneously
  • Citation tracking: Show which document each answer comes from
  • Comparison mode: "Compare what Document A and B say about X"
  • Document collections: Group related documents for scoped search

4. Model Management Improvements

The Global Model Manager has room to grow:

  • Memory monitoring: Real-time GPU/CPU memory usage tracking
  • Model switching: Hot-swap models without full page reload
  • Quantization picker: Let users choose quality vs. speed tradeoffs
  • Model recommendations: Suggest models based on device capabilities and document size
  • Download resume: Pause and resume large model downloads
  • Model pruning: Automatically clean up unused models to free storage

5. Performance Optimizations

Browser-based AI is compute-constrained. I'm exploring:

  • WebGPU shaders: Custom kernels for similarity search
  • Streaming embeddings: Generate embeddings as PDFs parse (pipeline)
  • Lazy chunk loading: Only load chunks into memory when needed
  • Embedding compression: Quantize embeddings for 50%+ storage savings
  • Batch inference: Process multiple queries efficiently

6. User Experience Polish

  • Cancel operations: Graceful abort for uploads/embedding/inference
  • Export/import: Backup and restore document libraries
  • Collaboration: Optional P2P sync between devices (using existing Talkie infrastructure)
  • Document previews: Show PDF preview alongside chat
  • Citation highlighting: Click a citation to jump to that page in the PDF
  • Conversation branching: Fork conversations to explore different questions

7. Advanced RAG Techniques

  • Agentic RAG: Let the model decide when it needs more context
  • Iterative retrieval: Multiple rounds of search for complex questions
  • Self-consistency: Generate multiple answers and synthesize
  • Fact verification: Cross-check answers against retrieved chunks

Try It Yourself

RAGTime proves that privacy-first AI isn't just possible. It's practical! The entire codebase uses React, TypeScript, and standard web APIs. No servers, no API keys, no data leaving your device.

Check out The Big Idea to try RAGTime, or explore the source code on GitHub.

The future of AI might just be local-first.


Have questions about browser-based AI or want to discuss the implementation? Open an issue on the repository or reach out directly.