The model downloads once, then runs offline.
First time you load the tool, your browser downloads the Llama 3.2 3B model from Hugging Face. After that, the weights live in your browser cache. Disconnect from the internet and the tool still works.
Drop a PDF, ask questions, get answers with citations. The model runs in your browser. Nothing about your document touches a server.
or click below. Stays on this device, period.
PDF · multi-page · up to 200 MB in browser memory
First time you load the tool, your browser downloads the Llama 3.2 3B model from Hugging Face. After that, the weights live in your browser cache. Disconnect from the internet and the tool still works.
PDFs are read with pdfjs-dist in the page. Embeddings are computed locally with all-MiniLM-L6-v2 via transformers.js. Inference runs in your GPU through WebLLM. Open DevTools, Network tab. After the model loads, it stays at zero.
The page's Content Security Policy allows scripts and styles only from this origin. The only external destinations allowed are Hugging Face hosts (huggingface.co, hf.co) for the model weights and raw.githubusercontent.com for the compiled WebGPU shaders, both for the one-time model download. Everything else, including any future change, would have to update the CSP and would be visible in this page's source.
One index.html plus the vendored libraries (pdfjs, transformers.js, MiniLM, WebLLM). The model itself still downloads on first use unless you save it from your browser cache.
Drop the directory on any static host. The README has the nginx CSP block and the Dockerfile to run it on Cloud Run, Fly, or your own box.
Read the self-host guide →MIT licensed. Check the CSP, the prompt template, the embedding pipeline. Submit issues for documents we read poorly.
github.com/xjmani/ask →