// internal use · adversary systems //

Home Brain

Your laptop is already the server. Offline local inference, zero cost. Setup time: ~20 minutes.

System Offline 72B Q8 Capable Metal Accelerated
Min specs: Apple Silicon Mac · 16GB unified memory (runs 7B models). 64GB+ recommended for 70B+ class models. 128GB for full Q8 quality, no compromise.
01
Ollama — The Engine
Runs models locally. Metal acceleration on Apple Silicon is automatic. Everything depends on this.
Install via Homebrew
Required
Terminal
brew install ollama
OLLAMA_FLASH_ATTENTION="1" OLLAMA_KV_CACHE_TYPE="q8_0" ollama serve
Metal GPU acceleration is automatic on Apple Silicon. Flash attention and Q8 KV cache cut memory pressure during long context — set these every time you start the server.
Verify the server is running
Required
New terminal tab
curl http://localhost:11434
02
Models to Pull
64GB+ unified memory runs 72B models. 128GB runs them at full Q8 quality.
Qwen 2.5 72B Primary
~75GB · Q8 · GPT-4 class
Best general-purpose model for your setup. Reasoning, instruction following, long-form. The workhorse.
ollama pull qwen2.5:72b-instruct-q8_0
Qwen 2.5 VL 72B
~40GB · Q4 · Vision
Image and frame analysis. Runs alongside text model. Local visual intelligence, fully offline.
ollama pull qwen2.5vl:72b-instruct-q4_K_M
DeepSeek R1 70B
~40GB · Q4 · Reasoning
Strong analytical reasoning. Multi-step problems, research decomposition, structured analysis.
ollama pull deepseek-r1:70b
Phi-4 Mini
~2GB · Tiny · Fast
Classification and pre-filtering. First-pass triage. Zero latency — runs before the big model sees anything.
ollama pull phi4-mini
Memory: Qwen 72B Q8 uses ~75GB. Don't run both 72B models simultaneously — swap as needed.
Context window
Optional
Ollama 0.20+ auto-calculates context from available VRAM — on 96GB you get ~262K by default. No Modelfile needed. If you want to pin a specific context size (e.g. for a lighter footprint), create a custom model:
Terminal
echo "FROM qwen2.5:72b-instruct-q8_0
PARAMETER num_ctx 65536" > Modelfile
ollama create qwen-brain -f Modelfile
ollama run qwen-brain
03
Open WebUI
Local interface with RAG, document upload, knowledge bases. Fully offline at localhost:8080.
Install and launch
Required
Terminal
pip install open-webui
DATA_DIR=~/.open-webui uvx --python 3.11 open-webui@latest serve
First launch prompts you to create an admin account. Auto-detects Ollama at localhost:11434.
Configure document RAG
Optional
Settings → Documents → set extraction engine to Docling (multi-column PDF aware), chunk size to 1500. Create a Knowledge Base in the sidebar. Link it to a conversation — model retrieves only relevant chunks, not the whole document.
04
Gemma 4
Google's open model. Strong structured output. Good second opinion on anything.
Pull Gemma 4
Free
Terminal
ollama pull gemma3:27b
ollama pull gemma3:12b
ollama run gemma3:27b "What can you do?"
Listed as gemma3 in Ollama's registry. Use Qwen for general reasoning, Gemma when you want a different perspective or need structured output.
05
Gemini Flash — Video Analysis
Only tool that processes video natively including audio. YouTube URLs work directly.
Get a Gemini API key
~Free
Gemini Flash is ~$0.10/M tokens. At normal volumes this rounds to zero.
Local video pipeline — fully offline
Optional
Private footage that doesn't leave your machine — FFmpeg frames + Whisper transcription + Qwen VL, all local.
Terminal
brew install ffmpeg whisper-cpp
ffmpeg -i video.mp4 -vf fps=1 frames/frame_%04d.jpg
whisper-cpp video.mp4 -o transcript.txt
06
NotebookLM
Corpus querying. "What did we decide about X across 20 sessions." Not a Claude replacement.
The Cowork → NotebookLM pipeline
Recommended
NotebookLM treats every source equally. Pre-process with Cowork first — extract key claims, strip filler, structure the output — then import clean summaries as sources. Better in, better out.
Primary use case: Fathom call transcripts + session logs → Cowork cleans and structures → NotebookLM sources → query across your entire project history.
Access NotebookLM
Free
Google account required. Accepts PDFs, Docs, text, URLs, YouTube links. Audio overview feature is worth using on long documents.
07
The Workflow
Claude's context is for reasoning. Not ingestion. Route everything else locally first.
// document & research processing
Raw materialPDFs · transcripts
Qwen 72BLocal · offline
Clean briefStructured output
ClaudeReasoning only
// video analysis
Video / URLAny source
Gemini FlashAudio + visual
Structured textTimestamps
ClaudeSynthesis
// corpus research
Raw sourcesSessions · docs
CoworkExtract + structure
NotebookLMQuery corpus
ClaudeAction
Principle: Every raw document that hits Claude directly is wasted context. Pre-filter, pre-process, pre-structure. Bring Claude a brief — not a dump.
Quick Reference
Full stack at a glance.
Component Tool Command / URL Cost
LLM engineOllamalocalhost:11434Free
Local UI + RAGOpen WebUIlocalhost:8080Free
WorkhorseQwen 2.5 72B Q8ollama run qwen-brainFree
Google open modelGemma 4ollama run gemma3:27bFree
Vision modelQwen 2.5 VL 72Bqwen2.5vl:72b-q4_K_MFree
Video analysisGemini Flashaistudio.google.com~$0.10/M tok
Local transcriptionWhisper.cppbrew install whisper-cppFree
Frame extractionFFmpegbrew install ffmpegFree
Corpus queryingNotebookLMnotebooklm.google.comFree