Home Brain — Adversary

01

Ollama — The Engine

Runs models locally. Metal acceleration on Apple Silicon is automatic. Everything depends on this.

Install via Homebrew

Required

Terminal

brew install ollama
OLLAMA_FLASH_ATTENTION="1" OLLAMA_KV_CACHE_TYPE="q8_0" ollama serve

Metal GPU acceleration is automatic on Apple Silicon. Flash attention and Q8 KV cache cut memory pressure during long context — set these every time you start the server.

Verify the server is running

Required

New terminal tab

curl http://localhost:11434

02

Models to Pull

64GB+ unified memory runs 72B models. 128GB runs them at full Q8 quality.

Qwen 2.5 72B Primary

~75GB · Q8 · GPT-4 class

Best general-purpose model for your setup. Reasoning, instruction following, long-form. The workhorse.

ollama pull qwen2.5:72b-instruct-q8_0

Qwen 2.5 VL 72B

~40GB · Q4 · Vision

Image and frame analysis. Runs alongside text model. Local visual intelligence, fully offline.

ollama pull qwen2.5vl:72b-instruct-q4_K_M

DeepSeek R1 70B

~40GB · Q4 · Reasoning

Strong analytical reasoning. Multi-step problems, research decomposition, structured analysis.

ollama pull deepseek-r1:70b

Phi-4 Mini

~2GB · Tiny · Fast

Classification and pre-filtering. First-pass triage. Zero latency — runs before the big model sees anything.

ollama pull phi4-mini

Memory: Qwen 72B Q8 uses ~75GB. Don't run both 72B models simultaneously — swap as needed.

Context window

Optional

Ollama 0.20+ auto-calculates context from available VRAM — on 96GB you get ~262K by default. No Modelfile needed. If you want to pin a specific context size (e.g. for a lighter footprint), create a custom model:

Terminal

echo "FROM qwen2.5:72b-instruct-q8_0
PARAMETER num_ctx 65536" > Modelfile
ollama create qwen-brain -f Modelfile
ollama run qwen-brain

03

Open WebUI

Local interface with RAG, document upload, knowledge bases. Fully offline at localhost:8080.

Install and launch

Required

Terminal

pip install open-webui
DATA_DIR=~/.open-webui uvx --python 3.11 open-webui@latest serve

First launch prompts you to create an admin account. Auto-detects Ollama at localhost:11434.

Configure document RAG

Optional

Settings → Documents → set extraction engine to Docling (multi-column PDF aware), chunk size to 1500. Create a Knowledge Base in the sidebar. Link it to a conversation — model retrieves only relevant chunks, not the whole document.

04

Gemma 4

Google's open model. Strong structured output. Good second opinion on anything.

Pull Gemma 4

Free

Terminal

ollama pull gemma3:27b
ollama pull gemma3:12b
ollama run gemma3:27b "What can you do?"

Listed as gemma3 in Ollama's registry. Use Qwen for general reasoning, Gemma when you want a different perspective or need structured output.

05

Gemini Flash — Video Analysis

Only tool that processes video natively including audio. YouTube URLs work directly.

Get a Gemini API key

~Free

Gemini Flash is ~$0.10/M tokens. At normal volumes this rounds to zero.

aistudio.google.com/app/apikey

Local video pipeline — fully offline

Optional

Private footage that doesn't leave your machine — FFmpeg frames + Whisper transcription + Qwen VL, all local.

Terminal

brew install ffmpeg whisper-cpp
ffmpeg -i video.mp4 -vf fps=1 frames/frame_%04d.jpg
whisper-cpp video.mp4 -o transcript.txt

06

NotebookLM

Corpus querying. "What did we decide about X across 20 sessions." Not a Claude replacement.

The Cowork → NotebookLM pipeline

Recommended

NotebookLM treats every source equally. Pre-process with Cowork first — extract key claims, strip filler, structure the output — then import clean summaries as sources. Better in, better out.

Primary use case: Fathom call transcripts + session logs → Cowork cleans and structures → NotebookLM sources → query across your entire project history.

Access NotebookLM

Free

notebooklm.google.com

Google account required. Accepts PDFs, Docs, text, URLs, YouTube links. Audio overview feature is worth using on long documents.

07

The Workflow

Claude's context is for reasoning. Not ingestion. Route everything else locally first.

// document & research processing

Raw material_{PDFs · transcripts}

→

Qwen 72B_{Local · offline}

→

Clean brief_{Structured output}

→

Claude_{Reasoning only}

// video analysis

Video / URL_{Any source}

→

Gemini Flash_{Audio + visual}

→

Structured text_Timestamps

→

Claude_Synthesis

// corpus research

Raw sources_{Sessions · docs}

→

Cowork_{Extract + structure}

→

NotebookLM_{Query corpus}

→

Claude_Action

Principle: Every raw document that hits Claude directly is wasted context. Pre-filter, pre-process, pre-structure. Bring Claude a brief — not a dump.

⚡

Quick Reference

Full stack at a glance.

Component	Tool	Command / URL	Cost
LLM engine	Ollama	localhost:11434	Free
Local UI + RAG	Open WebUI	localhost:8080	Free
Workhorse	Qwen 2.5 72B Q8	ollama run qwen-brain	Free
Google open model	Gemma 4	ollama run gemma3:27b	Free
Vision model	Qwen 2.5 VL 72B	qwen2.5vl:72b-q4_K_M	Free
Video analysis	Gemini Flash	aistudio.google.com	~$0.10/M tok
Local transcription	Whisper.cpp	brew install whisper-cpp	Free
Frame extraction	FFmpeg	brew install ffmpeg	Free
Corpus querying	NotebookLM	notebooklm.google.com	Free