The Smartest Intern in the World

Chapter 1

The Room

LLM Basics

Start with the core picture: you hired the smartest person in the world, then shut them in a room with a desk and a mail slot. They can only work with what fits on the desk, and they can only respond with what they send back through the slot. That is the basic LLM model.

Context Window = desk size (Claude: 200K, GPT: 1M tokens) · Tokens = subword chunks via BPE (~1 token ≈ 4 chars) · Temperature = creativity dial (0 = precise, 1 = creative) · Self-Attention = intern cross-references ALL papers on the desk (O(n²) cost)

Chapter 1: The Room

Understanding the LLM

The Desk (Context Window)

The intern has one desk. Everything they need must fit on this desk at once:

The instruction card pinned to the wall (system prompt), always visible
The stack of previous notes (conversation history)
The phone book (tool definitions), we'll get to this
Any documents you attached (files, specs, code)
Space for the intern's answer

Model	"Desk Size"	Real-world equivalent
Claude Opus 4.6	200K tokens (1M beta)	~500–3,000 pages
GPT-5.4	1M tokens	~3,000 pages

Once the desk is full, something has to be removed. Older notes get pushed off. The intern seems to "forget" earlier conversations. The desk is physically full.

The context window is working memory, like RAM. Not storage. Every API call, the desk gets cleared and rebuilt from scratch.

🔬 Why Context Has a Hard Limit

Not an arbitrary marketing number. The model uses self-attention: every token computes a relationship score against every other token. Memory and compute grow quadratically: double the context → 4x the cost. A 200K context = 40 billion pairwise relationships per layer, and models have 80-120 layers. Context windows don't grow to infinity because each doubling is a massive engineering investment (FlashAttention, ring attention, sliding window) to make the quadratic cost survivable.

📖 Go deeper: Attention Is All You Need (original paper, 2017) — FlashAttention-3 (current state-of-the-art, 2024) — Transformer Explainer (interactive visual)

The Paper Slips (Tokens)

The intern doesn't read whole words. They process token-fragments, like Scrabble tiles:

"Hello world" = 2 tokens
"unbelievable" = 3 tokens: "un", "believ", "able"
"Антропик" = probably 4+ tokens (non-English is expensive)
Code is token-hungry: braces, indentation, semicolons all cost tokens

Why this matters:

Pricing: you pay per token, both input and output
Limits: context window is measured in tokens, not words
Behavior: the intern generates one token at a time, each predicted based on ALL previous tokens

🔬 How Tokenization Actually Works

Before text reaches the neural network, a tokenizer (Byte Pair Encoding) splits it into subword chunks from a fixed vocabulary (~100K entries). "Hello world" → [15496, 995], two integer IDs. Each ID maps to a row in a learned embedding matrix, a point in 8,192-dimensional space. The model never sees text. It sees high-dimensional vectors where similar meanings cluster together ("Hello" is near "Hi" and "Hey").

Why non-English costs more: The vocabulary was built mostly from English. "the" = 1 token, but "Антропик" = 4-6 tokens because Cyrillic gets split into byte-level fragments. Same meaning, more tokens, higher cost. Code is expensive too: every {, }, and indentation space burns a token.

📖 Go deeper: OpenAI Tokenizer (try it live) — BPE algorithm explained (Hugging Face LLM course)

The Creativity Dial (Temperature)

There's a dial on the intern's desk.

Dial at 0: The intern picks the most obvious next word. "The capital of France is" → Paris. Every time. Good for code, facts, JSON.
Dial at 1: The intern explores. "The capital of France is" → "a place that smells of fresh bread and diesel." Good for brainstorming, creative writing.

Most coding tools keep the dial at 0.

🔬 What Temperature Really Does

The model doesn't pick one word. It produces a probability distribution over ~100K tokens. For "The capital of France is": Paris gets 92%, "the" 3%, "a" 2%, etc. Temperature divides the raw scores before converting to probabilities: softmax(logits / temperature).

T=0: Highest-scoring token always wins. Deterministic. Same input → same output.
T=1: Probabilities as-is. Paris wins 92% of the time, but sometimes you get surprises.
T>1: Distribution flattens. Even unlikely tokens get a real chance. Weird outputs.

There's also top-p: instead of reshaping the distribution, it cuts off the tail. top_p=0.9 = "only sample from tokens whose cumulative probability reaches 90%." Creative within plausible bounds.

📖 Go deeper: Generation Strategies (Hugging Face docs, kept current)

The Amnesia Problem

Every time you slide a new note in, the intern has amnesia. They don't remember the last conversation. If you want them to remember what you discussed before, you reprint the entire previous conversation and slide it in again.

Chat UIs send the full conversation history with every message. Long conversations get expensive (more tokens each turn). The desk fills up and the oldest messages get dropped. The model seems to "forget" things from early in the conversation.

How the Intern Reads (Attention)

The intern doesn't read left-to-right. They look at ALL words simultaneously and figure out which ones relate to which.

"The server crashed because it ran out of memory." The intern understands that "it" refers to "server" (not "memory"), and makes this connection across the entire desk, from page 1 to page 500.

This is the foundation of prompt engineering: put the right information on the desk → the intern attends to it → better output.

🔬 The Transformer Architecture

Everything comes from one paper: "Attention Is All You Need" (Google, 2017). The architecture:

Token IDs → Embeddings (8192-dim vectors) → [Attention + Feed-Forward] × 100 layers → Probabilities

Self-Attention is the core mechanism. For each token, the model computes three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), and Value ("what information do I carry?"). The dot product Q × K gives an attention score between every pair of tokens — "how relevant is token A to token B?" These scores become weights for combining Values into a new representation.

This happens 128 times in parallel per layer ("multi-head attention"), each head learning different relationship types — one learns syntax, another learns coreference, another learns code bracket matching. A 100-layer model with 128 heads = 12,800 different relationship analyses for every token. That's why LLMs "get" code structure and conversational context.

Scale: Claude Opus 4.6 has ~200B+ parameters, GPT-5.4 ~300B+ (with MoE routing). Each parameter is a learned number. There's no database inside — just billions of numbers that collectively encode language, code, and reasoning.

📖 Go deeper: The Illustrated Transformer (Jay Alammar, updated with modern additions) — 3Blue1Brown: Attention in transformers (video, 2024) — Transformer Explainer (interactive, runs in browser)

The Wall Card (System Prompt)

The instruction card pinned to the wall. Always visible. Shapes everything the intern does.

"You are a senior TypeScript developer. Follow SOLID principles. Never use any type. Respond in Ukrainian when asked about documentation."

The intern reads this before every task. The same model can be a code reviewer, a writing assistant, and a customer support agent. Different wall cards, same brain.

The Paradigm Shift

Before LLMs, we didn't have an intern. We had a vending machine. Press button A3 → always get the same candy bar. getUser(42) → always { name: "Alex" }. Deterministic. Predictable. Testable.

Now we have the intern. Slide in the same question twice, get two different answers. Both correct, just phrased differently.

"Write a function to sort an array" → Bubble sort today, quicksort tomorrow, different variable names each time.
Same input → Different output. A different paradigm.

We went from programming vending machines to writing instructions for a brilliant but unpredictable intern. Different engineering practices apply.

🔬 How the Intern Got Smart (Training)

Three stages, each giving the model different capabilities:

1. Pre-training — Read the internet. Trillions of tokens. For each token, predict the next one. Wrong? Adjust weights. Repeat billions of times on 10,000+ GPUs for months. Cost: $50M-$500M+. This is "read every book."

2. Fine-tuning — Pre-trained models complete any text, including toxic or wrong text. Fine-tuning on curated (prompt, response) pairs teaches the model to be a helpful assistant instead of a text-completion engine.

3. Alignment — Anthropic uses Constitutional AI — Claude has a published constitution (updated January 2026) that defines its values, with a 4-tier priority hierarchy: safety → ethics → compliance → helpfulness. The model generates synthetic training data from this constitution and self-critiques against it. OpenAI uses RLHF (humans rank outputs, model learns preferences) guided by their Model Spec. Both approaches teach "helpful without being harmful," but the methods keep evolving — newer techniques like RLTHF (Targeted Human Feedback) and Direct Preference Optimization (DPO) are supplementing classic RLHF.

Critical: After training, the weights are frozen. The model never learns from your conversations. When you "teach" it in a chat, you're putting info on the desk. Close the conversation → desk clears → model is exactly as before. It can't permanently learn your codebase — only temporarily, per session.

📖 Go deeper: Claude's Constitution (Anthropic, 2026) — Constitutional AI paper (foundational, 2022) — OpenAI Model Spec (current alignment spec) — OpenAI's original RLHF approach (2022)

🔬 Under the Hood: What Happens When You Press Send (Inference)

Your message → Tokenize → Assemble full context (system + history + tools + message)
→ Forward pass through all 100+ layers → First token generated (this is the initial delay)
→ Append token, run again with KV-cache → Next token → Repeat until done

Why there's a delay, then fast streaming: The first token requires processing the ENTIRE context through all layers. After that, the KV-cache stores previous computations — each subsequent token only needs a lightweight forward pass. This is why responses start slow and speed up.

Why output costs 5x more than input: Input tokens are processed in parallel (one pass for all of them). Output tokens are generated one-at-a-time (one pass PER token). That's why Claude charges $5/MTok input but $25/MTok output. Every word the model writes is a separate computation.

📖 Go deeper: LLM inference explained (Databricks) — KV-cache visualized

Paste this code into your conversation and ask "Review this for problems":

var user = await _repo.GetById(userId);
return new UserProfile {
  Name = user.Name,
  Email = user.Email,
  LastLogin = user.Profile
    .LastLoginDate
    .ToString("yyyy-MM-dd")
};

There are null reference bugs hiding in there. See if the intern spots them.

Now send the exact same message a second time. The answer will be slightly different. That's temperature at work: the intern rolls dice on every word.

Chapter 2

The Phone

Tools & Function Calling

One day, a phone appears on the wall. The intern can't leave the room, but now they can call out for data, calculations, or actions.

4-Step Protocol: User asks → LLM generates JSON tool_use → Host executes tool → Result slides back · Key: LLM NEVER executes. It only writes JSON. The host runs the tool. · Strict Mode forces valid JSON matching the tool's schema

Chapter 2: The Phone

Giving the Intern Tools

The Problem

Our intern is smart but isolated. Without a phone, they can't:

Browse the web or call any API
Query your database
Read files on your server
Execute code or run tests
Know today's date, current prices, or any live data

They know everything that was in their training data, but nothing about right now.

The Phone Book

You tape a list of phone numbers to the wall. Each entry describes a service the intern can call:

Name: "get_weather"
Description: "Get current weather for a city"
Parameters: { location: string (required) }

The intern reads the descriptions, decides if a call is relevant, and writes a request:

{ "tool": "get_weather", "arguments": { "location": "Kyiv" } }

They slide this request out through the slot. YOUR code makes the actual call, gets the weather data, and slides the result back in. The intern then uses that result to write their answer.

The intern NEVER leaves the room. They NEVER make the call themselves. They write "please call this number for me" on a slip. Your code does the actual work. Same on both Anthropic and OpenAI.

🔬 How Tool Calling Actually Works

Tool calling is NOT a separate system — it's the same next-token prediction. The model generates structured JSON character by character, because it was fine-tuned to emit tool calls when appropriate. Tool definitions are injected into the context as text (eating desk space — 10 complex tools can cost 2,000-5,000 tokens before your question even arrives).

Strict mode is the real magic. Both platforms offer strict: true, which uses constrained decoding — at each token, a grammar mask sets the probability of all schema-violating tokens to zero. The model literally cannot produce invalid JSON. Your schema is compiled into a context-free grammar and enforced at every step. Cost: slightly more latency, but 100% valid output.

Stop reasons tell your code what happened: "end_turn" (done talking), "tool_use" (wants to call a tool), or "max_tokens" (ran out of space). Your code checks this, executes the tool, and sends the result back.

Tool calling is a "structured hallucination that happens to be useful." The model doesn't know it's calling a function. It generates tokens that happen to form valid JSON. Reliability is 99.9%+ with strict mode, but it's the same next-token prediction as writing a poem.

📖 Go deeper: Anthropic Tool Use docs — OpenAI Function Calling — Structured Outputs explained

The Phone Call Protocol (4 Steps)

Step 1: You → [tools + prompt] → Intern
        "Here's a phone book and a question about weather"

Step 2: Intern → "I want to call get_weather('Kyiv')" → You
        The intern writes a structured call request

Step 3: You → [execute, get result: "12°C, cloudy"] → Intern
        YOUR code makes the call, slides the result back

Step 4: Intern → "It's 12°C and cloudy in Kyiv!" → You
        The intern writes the final answer using the result

Same Room, Different Phone Books

Anthropic calls it "Tool Use." OpenAI calls it "Function Calling." Same concept, same JSON Schema format for the phone book entries.

Aspect	Anthropic (Claude)	OpenAI (GPT)
They call it	"Tool Use"	"Function Calling"
Call request	`tool_use` content block	`function_call` in message
Return format	`tool_result` message	`function` role message
Schema format	JSON Schema (strict mode)	JSON Schema (strict mode)
Built-in phones	Web search, bash, text editor, computer use, code execution	Web search, file search, code interpreter, shell
Key API	Messages API	Responses API

Two Types of Calls

Anthropic splits them:

Client tools — the call goes to YOUR infrastructure. You build the "kitchen."
Server tools — the call goes to Anthropic's servers (web search, etc.). No implementation needed.

OpenAI has five categories:

Hosted tools — web search, file search, code interpreter (on OpenAI servers)
Local tools — ComputerTool, ShellTool (on your machine)
Function calling — wrap any Python function as a tool
Agents as tools — one agent can call another
Codex tool — run coding tasks in a sandbox

What the Phone Unlocks

With a phone, the intern CAN:

Search the web, fetch URLs
CRUD operations on any database
Read/write files, manage git repos
Run tests, deploy code
Interact with any service that has an API

The intern goes from "brain in a jar" to "brain with hands." Each hand needs to be wired up individually, which brings us to the next problem.

🔬 The Real API Round-Trip

A single tool call requires minimum 2 HTTP requests:

REQUEST 1:  You send [tools + "What's the weather in Kyiv?"]
RESPONSE 1: Claude returns stop_reason: "tool_use"
            + {name: "get_weather", input: {location: "Kyiv"}}

REQUEST 2:  You send [ENTIRE conversation + tool_result: "12°C, cloudy"]
RESPONSE 2: Claude returns "It's 12°C and cloudy in Kyiv!"

Notice: Request 2 includes the FULL conversation history (amnesia — the desk clears between requests). The tool_use_id links each result to its call, supporting multiple parallel tool calls per turn.

Why agentic workflows are slow: 10 tool calls = 11 HTTP round-trips, each a full inference pass. A complex code review with file reads, PR diff, and ADO lookup might take 15-30 seconds across 4-5 round-trips. This is the main bottleneck in agent performance.

📖 Go deeper: Anthropic Messages API reference — OpenAI Responses API

Ask your AI:

What's the weather in Kyiv right now?

Watch what happens. If you see "Searching...", the intern just picked up the phone, dialed a weather service, and read back the result. That's the 4-step protocol running live.

If you get "I can't check real-time data," your intern has no phone on the wall. Same brain, zero tools.

Chapter 3

The Intercom

MCP — The Universal Adapter

Instead of individual phone numbers, the building installs a standardized intercom. The intern writes requests and slides them under the door. The building routes them.

Host = the building (Claude Desktop, VS Code) · Client = intercom panel outside the room · Server = each service (GitHub, Slack, DB) · Protocol: JSON-RPC 2.0 · Transport: stdio (local) or HTTP+SSE (remote) · 5 AI × 10 services = 15 integrations (not 50)

Chapter 3: The Intercom

MCP — The Universal Adapter

The Problem: Too Many Phone Numbers

Remember when every phone brand had a different charger? Nokia one connector, Samsung another, Motorola a third. You needed a drawer full of cables.

That's where AI tools are without MCP:

Claude needs tools defined one way
GPT needs them slightly differently
Every data source needs a custom integration
Every IDE/app rebuilds the same connectors
5 AI tools × 10 services = 50 custom integrations

The Intercom System

Instead of taping individual phone numbers to the wall, we install a standardized intercom system. Any service can plug into the intercom using the same standard connector.

With MCP: 5 AI tools + 10 services = 15 integrations. One protocol, one standard.

MCP stands for Model Context Protocol, an open standard created by Anthropic, now community-driven and adopted by OpenAI, Google, Microsoft, and hundreds of tool providers.

Architecture: Three Roles

THE BUILDING (MCP Host: Claude Desktop, VS Code, your app)
  └── INTERCOM PANEL ON THE WALL (MCP Client: protocol handler)
        ├── LINE → GitHub Server
        ├── LINE → Azure DevOps Server
        ├── LINE → PostgreSQL Server
        └── LINE → Your Custom API Server
  └── THE ROOM (LLM — the intern inside)

Host = the building that contains the room (Claude Desktop, VS Code, Cursor)
Client = the intercom panel mounted on the wall outside the room (routes requests to the right server). The intern can't touch it directly. They slide a note under the door saying "I need data from GitHub," and the building's intercom system handles the call.
Server = each service plugged into the intercom (GitHub, database, Slack)

Three Buttons on the Intercom

Button	What it does	Analogy
Tool	"Do something"	Menu items you can order
Resource	"Read something"	The wine list / specials board
Prompt	"Give me a template"	"Chef's recommendation" combo

Button 1 — Do Something (Tool Call):

The intern writes a note: "Search issues for 'bug login'." They slide it under the door. The building routes it through the intercom to the right server, which executes and returns results back through the slot.

Button 2 — Read Something (Resource):

The intern writes: "Read file config.yaml." Same process — note goes out, data comes back. Read-only.

Button 3 — Give Me a Template (Prompt):

The intern writes: "Give me the code review template for TypeScript." The server sends back a structured prompt ready to fill in.

The Wiring (Transport)

Two ways the intercom works:

stdio — the server runs locally, communicates via stdin/stdout. Like a Unix pipe between processes. Great for local dev tools, CLI integrations.

HTTP+SSE (Streamable HTTP) — the server runs remotely, communicates over HTTP. Like a REST API that can also push updates. Great for cloud deployments, enterprise setups.

Both speak the same MCP "language" — JSON-RPC 2.0.

🔬 The MCP Protocol

MCP is JSON-RPC 2.0 with a defined lifecycle:

CLIENT                          SERVER
  │── initialize ──────────────▶│  (exchange capabilities)
  │◀── result ──────────────────│  "I have tools + resources"
  │── tools/list ──────────────▶│  (discover available tools)
  │◀── result ──────────────────│  [{name: "search_issues", inputSchema: {...}}]
  │── tools/call ──────────────▶│  (LLM wants to call a tool)
  │◀── result ──────────────────│  {content: [{type: "text", text: "Found 3 issues..."}]}

Capability negotiation happens at initialize — both sides declare what they support. This makes MCP forward-compatible: old clients gracefully ignore new server capabilities.

stdio transport: Claude Desktop spawns a child process (npx @modelcontextprotocol/server-github), sends JSON-RPC on stdin, reads responses from stdout. Fast (no network), but local-only. Process dies when the host closes.

HTTP+SSE transport: Client POSTs requests to a URL. Server responds inline or opens an SSE stream for long operations. Supports session management via Mcp-Session-Id header. This enables enterprise setups — a central MCP server with auth, rate limiting, and audit logging shared by the whole team.

📖 Go deeper: MCP specification — Build your first MCP server (tutorial) — MCP TypeScript SDK

Two Ways to Set Up the Intercom

Way 1: Build the intercom yourself (full control)

# Connect to MCP server, get tools, convert to Claude format
mcp_tools = await mcp_session.list_tools()
claude_tools = [{"name": t.name, "description": t.description,
                 "input_schema": t.inputSchema} for t in mcp_tools.tools]
# Pass to Claude API
response = client.messages.create(tools=claude_tools, ...)

Way 2: Use the MCP Connector (Anthropic's "easy button")

# Just point Claude at the MCP server URL — no client code needed
response = client.messages.create(
    mcp_servers=[{"type": "url", "url": "https://mcp.example.com/sse"}],
    ...
)

OpenAI supports MCP natively in both the Agents SDK and the Responses API.

Real-World Intercom Lines Your Team Can Use

Server	What it does	Relevance
GitHub MCP	Repos, issues, PRs, code search	Code review, PR automation
Azure DevOps MCP	Work items, sprints, pipelines	Sprint management from AI
PostgreSQL MCP	Query databases	Data exploration
Filesystem MCP	Read/write local files	Document processing
Slack MCP	Send/read messages	Team notifications
Google Drive MCP	Read/search docs	Knowledge base access

MCP vs. Direct Tool Integration

Use MCP when...	Use direct tools when...
One integration, multiple AI platforms	Building for a single platform
Exposing your API to AI clients you don't control	Maximum control over tool behavior
Tool server and AI client are separate concerns	Tool is tightly coupled to your app
You want to share/reuse tool servers	One-off internal function
You need the ecosystem (registry, discovery)	Simplicity over portability

MCP is REST for AI tools. Just as REST standardized web APIs, MCP standardizes AI-tool connections. Build once, works everywhere.

🔬 How MCP Tools Become LLM Tools

The MCP client is a translator between two protocols:

MCP Server → tools/list → {name, description, inputSchema}
     ↓ (MCP client translates)
Claude API → tools: [{name, description, input_schema}]  ← tokens on desk
     ↓ (LLM generates tool_use)
MCP Client → tools/call → {name, arguments}  ← translated back to MCP

The server doesn't know or care if the host is Claude, GPT, or a custom app — it only speaks MCP. The client handles translation both ways. This is why MCP is platform-agnostic.

Anthropic's MCP Connector moves the client to their servers: you pass mcp_servers: [{url: "..."}] in the API call and Anthropic handles the MCP connection, tool discovery, and result injection. Zero MCP code on your side. Tradeoff: less control.

📖 Go deeper: MCP Connector (Anthropic docs) — MCP server registry · MCP Registry

Count your team's tools. Here's an example:

AI tools:  Claude, ChatGPT, Copilot  = 3
Services:  GitHub, Azure DevOps, Slack, DB,
           Jira, email, CI/CD         = 7

Without MCP: 3 × 7 = 21 integrations
With MCP:    3 + 7 = 10 connectors

Plug in your real numbers. The gap between "times" and "plus" is how much wiring MCP saves.

Chapter 4

The Instruction Manuals

Skills

Skills are instruction manuals loaded on demand. Hand the intern the right manual when they need it. Keeps the desk clean, the intern focused.

SKILL.md files: rules + examples + checklists in markdown · Progressive disclosure: only loaded when triggered by keyword · Team skills: one person writes it, everyone benefits · Prevents context window pollution

Chapter 4: The Instruction Manuals

Skills — Teaching the Intern Procedures

The Gap

You gave the intern a phone (tools). They can call the weather API, search GitHub, query your database. But knowing how to use a drill doesn't mean you know how to build a kitchen.

Tools = giving the intern a power drill
Skills = giving them the blueprint, the building codes, and your company's style guide

A tool is a capability. A skill is knowledge of when and how to use capabilities, with all the domain context that makes the result professional.

What a Skill Actually Is

A Skill is a directory on the intern's desk containing an instruction manual:

my-skill/
├── SKILL.md          ← Required: instructions + metadata
├── scripts/          ← Optional: executable code
│   └── validate.py
├── templates/        ← Optional: reference templates
│   └── report.docx
└── assets/           ← Optional: images, data files
    └── logo.png

The SKILL.md file has two parts:

YAML frontmatter — metadata (name, description, when to trigger)
Markdown body — step-by-step instructions, rules, examples

---
name: quarterly-report
description: >
  Generate quarterly business reports following company
  template. Use when user asks for quarterly reports.
---

# Quarterly Report Generation

## Steps
1. Read the template from templates/report.docx
2. Extract financial data from provided spreadsheet
3. Generate executive summary
4. Fill in each section following the template structure
5. Validate formatting with scripts/validate.py

## Rules
- Always use company color scheme (#2E75B6 primary)
- Financial figures must include YoY comparison
- Executive summary must be under 200 words

Progressive Disclosure: The Table of Contents Trick

The intern doesn't load all manuals upfront. That would fill the desk. Instead:

STAGE 1: System prompt loads ONLY metadata
         (name + description for each installed skill)
         ≈ 50 tokens per skill

STAGE 2: User message triggers a skill match
         The intern reads just THAT SKILL.md
         ≈ hundreds of tokens

STAGE 3: Skill loads reference files, scripts,
         templates AS NEEDED
         ≈ thousands of tokens, loaded incrementally

Like a table of contents in a manual — the intern first sees chapter titles. When they need Chapter 7, they read just that chapter. They never load the whole manual upfront.

Why this matters for desk economics:

20 skills installed → ~1,000 tokens of metadata (tiny)
Without progressive disclosure → 50,000+ tokens upfront (desk overflow)

🔬 How Skills Actually Get Loaded

No magic. Context window engineering:

1. System prompt includes only skill names + descriptions (~50 tokens each). The actual SKILL.md files are NOT loaded yet.

2. Matching is the model reading those descriptions. When you say "Review PR #847," the model decides code-review is relevant. No separate matching engine, no vector search, just next-token prediction.

3. Loading is a tool call: Read("skills/code-review/SKILL.md"). Now the full instructions are on the desk.

4. Assets load on-demand: if SKILL.md says "read the template," the model makes another tool call.

No special infrastructure. Skills = files. Loading = tool calls. Matching = the model reading text. Prompt engineering + file I/O.

The tradeoff: 1-3 extra tool calls before actual work starts (1-6 seconds of latency). Noticeable for quick questions, negligible for a 10-minute code review.

When skills conflict: If two match, both SKILL.md files land on the desk. The model follows both, interleaving steps. The model resolves contradictions by judgment, which is why writing clear, non-overlapping descriptions matters.

📖 Go deeper: Agent Skills docs (Anthropic) — Skills open standard — Skills engineering blog post

OpenAI Adopts Skills

Skills were created by Anthropic in October 2025, published as an open standard in December 2025. By late 2025, OpenAI adopted them:

GPT-5.4 lists "Skills" as a supported tool type
Codex CLI uses skills for repo-level workflows
AGENTS.md files reference skills via $skill-name triggers

Both platforms now support Skills + MCP + Tools. The ecosystem is standardizing.

Skills vs. No Skills — A Demo Idea

Ask the intern to create a PowerPoint WITHOUT skills → generic white slides, bullet points, boring layout.

Ask the same thing WITH the pptx skill loaded → professional design, branded colors, proper typography, icon usage, visual QA verification.

The SKILL.md that made the difference is a markdown file with instructions, built from hundreds of trial-and-error iterations.

🔬 Full Chain — "Review PR #847" End to End

| Step | What happens | Desk size |

|------|-------------|-----------|

| 1 | Context assembled: system prompt + skill metadata + MCP tool definitions + your message | ~2,260 tokens |

| 2 | Model reads skill descriptions, decides code-review matches. Calls Read("skills/code-review/SKILL.md") | +800 → ~3,060 |

| 3 | Skill says "Read the PR diff." Model calls GitHub MCP: get_pull_request_files(847) | +5,000 → ~8,060 |

| 4 | Skill says "Read the ADO work item." Model calls ADO MCP: get_work_item(4521) | +500 → ~8,560 |

| 5 | Skill says "Check each file against the rules." Model generates structured review. | Output: ~800 tokens |

| Total | 4 forward passes, ~9,360 tokens processed, 15-30 seconds, ~$0.05-0.10 | |

You can watch each tool call in Claude Desktop. No black box. The skill improved the review by adding procedure: "check THIS list, in THIS order, format the output THIS way."

Start a new conversation and paste this code. Just say "Review this code":

public async Task<PagedResult<Product>> GetProducts(
    string cursor, int limit)
{
    var products = await _db.Products
        .OrderBy(p => p.Id)
        .Where(p => p.Id > cursor)
        .Take(limit)
        .ToListAsync();

    var nextCursor = products.Last().Id;

    return new PagedResult<Product>
    {
        Items = products,
        NextCursor = nextCursor,
        Limit = limit
    };
}

You'll get a decent review. Generic, though. No team standards, no severity levels, no checklist. Save this result. You'll redo this exercise later with a skill, and the difference will be obvious.

Chapter 5

The Complete Stack

All Five Layers

Every layer builds on the one below: LLM → Tools → MCP → Skills → Agent. Together, an AI that gets work done.

LLM reads & writes · Tools let it call out · MCP standardizes connections · Skills provide expertise · Agent = the loop: while(!done) { think → act → observe }

Chapter 5: The Complete Stack

How It All Fits Together

The Layer Cake

Concept	Intern Analogy	What it does
LLM	The intern's brain	Thinks, reasons, predicts
Context Window	The intern's desk	Working space for current task
Token	Paper slips through the slot	Units of communication
System Prompt	Wall card with instructions	Persistent behavior rules
Tool	A phone number	Single capability: "call this service"
MCP	The intercom system	Universal way to connect any tool to any room
Skill	An instruction manual	"When doing X: use these tools in this order, follow these rules"
Agent	The intern with building keys	Uses brain + phone + manuals to complete tasks autonomously
AGENTS.md / CLAUDE.md	House rules in the lobby	"Before any work in this building, read these rules"

These are layers, not alternatives:

Agent (orchestration)
  └── uses Skills (procedural knowledge)
       └── which reference Tools (capabilities)
            └── connected via MCP (standard protocol)
                 └── feeding the LLM (intelligence)
                      └── within Context Window (working memory)

Three Levels of Freedom

Level 1: Paper Slips (Chat mode)

Intern in the room. Notes in, notes out. No tools.

"Format this." "Explain this error." "Write a regex."

Everyone does this already.

Level 2: Intercom Access (Professional + MCP)

Intern has the intercom. Calls services when you ask.

"Analyze Raygun errors." "Create ADO tasks from this spec."

Where most teams are today.

Level 3: Keys to the Building (Agent mode)

Intern leaves the room. Picks tasks from the board, uses the intercom, handles errors, reports back.

"Monitor deploys, rollback if tests fail." "Auto-review incoming PRs."

The frontier.

All three levels stay relevant — Level 1 for quick questions, Level 2 for daily work, Level 3 for automation.

Time to build your first skill. Open any text editor, create a file called code-review-skill.md, and paste this:

# Code Review Skill

## Description
Review code for common issues.

## Steps
1. Check for null safety
2. Check error handling
3. Check naming conventions
4. Check validation & tests

## Checklist
- Null safety
- Error handling
- Naming conventions
- Input validation
- Tests coverage
- No hardcoded values
- Logging
- Security

## Output Format
- MUST FIX: critical issues
- SHOULD FIX: improvements
- CONSIDER: suggestions

That took about 3 minutes. You just gave your intern a manual it can follow every time.

Chapter 6

Vending Machine, Intern, or Both?

Deterministic vs Hybrid

A vending machine always gives the same thing. The intern might surprise you. Combine both: the sandwich pattern.

Deterministic: same input → same output (if/else, SQL) · Non-deterministic: LLM varies even with same prompt · Sandwich Pattern: deterministic input → LLM reasoning → deterministic output validation · Example: CI reads tests (code) → LLM analyzes failures (creative) → JSON report (validated)

Chapter 6: Vending Machine, Intern, or Both?

When to Use Deterministic, Non-Deterministic, and Hybrid Approaches

The old world was a vending machine (deterministic). The new world is an intern (non-deterministic). You don't have to choose one. The best systems combine both.

The Spectrum

DETERMINISTIC                     HYBRID                      NON-DETERMINISTIC
(vending machine)              (intern + checklist)              (intern freestyle)
     │                               │                               │
if/else, regex,               LLM decides WHAT to do,        "Figure it out, here's
lookup tables,                code enforces HOW it's done     the goal, good luck"
SQL queries
     │                               │                               │
 Predictable                  Best of both worlds              Creative, flexible
 Testable                     Controlled creativity            Unpredictable
 Brittle                      Robust                           Hard to test

The Decision Framework

Situation	Approach	Why
Always the same logic, never changes	Deterministic	Don't use an intern to press a button. Use code.
Structured input → structured output, but logic is complex	Hybrid	Let the intern reason, but constrain the output with schemas and validation.
Ambiguous input, creative output, requires judgment	Non-deterministic	This is what the intern is built for.
Safety-critical, auditable, regulated	Deterministic (or hybrid with heavy guardrails)	You can't explain to an auditor that "the intern thought it was fine."
Evolving requirements, new edge cases constantly	Hybrid / Non-deterministic	The intern adapts without code changes. Vending machines need reprogramming.

Real Scenarios from Our Work

Scenario 1: Parsing ADO work item into subtasks

❌ Pure non-deterministic: "Read this work item and create subtasks." The intern creates random subtask structures every time — different formats, inconsistent estimation scales, sometimes forgets fields.

❌ Pure deterministic: Regex-parse the work item description, extract keywords, create predefined subtask templates. Breaks the moment someone writes the description differently.

✅ Hybrid:

DETERMINISTIC: ADO API fetches work item fields (title, description, acceptance criteria)
NON-DETERMINISTIC: Intern reads the content, reasons about subtask decomposition
DETERMINISTIC: Output schema enforces {title, description, estimate, type} per subtask
DETERMINISTIC: Validation rejects estimates outside [0.5h, 16h] range
DETERMINISTIC: ADO API creates the subtasks

The intern does the thinking. The code does the structure enforcement. You get creative decomposition with predictable output format.

Scenario 2: Code review

❌ Pure deterministic: Linters, static analysis. Catches syntax issues and known patterns. Misses logic bugs, architectural problems, and "this doesn't match how we do things."

❌ Pure non-deterministic: "Review this code." Different results every time. Sometimes catches 10 issues, sometimes 3. No consistency in severity labeling.

✅ Hybrid:

DETERMINISTIC: Linter runs first (ESLint, StyleCop) → catches all mechanical issues
NON-DETERMINISTIC: Intern reviews for logic, architecture, patterns (with a skill!)
DETERMINISTIC: Skill enforces output format (MUST FIX / SHOULD FIX / CONSIDER)
DETERMINISTIC: CI gate — if "MUST FIX" count > 0, block merge

Linter handles what it's good at (100% reliable, instant). The intern handles what requires judgment. The skill constrains the output. The CI gate makes the decision deterministic.

Scenario 3: Test case generation

❌ Pure deterministic: Template-based test generation from API spec. Covers happy path and basic validation. Misses implementation-specific edge cases entirely.

❌ Pure non-deterministic: "Write tests." Inconsistent coverage, sometimes over-tests trivial things, sometimes misses critical paths.

✅ Hybrid:

DETERMINISTIC: Parse OpenAPI spec → generate all endpoint/method/status combinations
NON-DETERMINISTIC: Intern reads the actual code → adds edge cases from implementation
DETERMINISTIC: Merge both lists, deduplicate, validate against test schema
DETERMINISTIC: Convert to Playwright/Jest tests using deterministic templates
NON-DETERMINISTIC: Intern reviews final tests for completeness

Scenario 4: Incident response

❌ Pure deterministic: Alert → PagerDuty → human investigates. Works, but slow. The human is the bottleneck at 3am.

❌ Pure non-deterministic: "Investigate and fix this." Absolutely not for production. The intern might decide a rollback is fine when it isn't.

✅ Hybrid:

DETERMINISTIC: Alert triggers, error data collected automatically
NON-DETERMINISTIC: Intern analyzes logs, identifies probable cause, drafts fix
DETERMINISTIC: Fix goes through existing PR review + CI pipeline
DETERMINISTIC: Deploy requires human approval (the intern proposes, human disposes)

The intern accelerates investigation from 45 minutes to 5 minutes. But the human makes the final call on what ships to production.

The Golden Rule

Use the intern for reasoning. Use code for enforcement.

                    ┌─────────────────┐
  User input  ───▶  │   DETERMINISTIC  │  Validate input, fetch data
                    │   (your code)    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ NON-DETERMINISTIC│  Reason, analyze, decide, generate
                    │   (the intern)   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   DETERMINISTIC  │  Validate output, enforce schema,
                    │   (your code)    │  apply business rules, execute action
                    └─────────────────┘

This is the "sandwich pattern": deterministic bread on both sides, non-deterministic filling in the middle. Your code ensures clean input and safe output. The intern does the thinking in between. Production AI systems work this way, and your Skills should be designed this way.

🔬 How Strict Mode Enables the Sandwich

The strict: true parameter in tool definitions and structured outputs makes the hybrid approach practical. Without it, the intern might return JSON with missing fields, wrong types, or extra properties, and your deterministic validation layer has to handle infinite edge cases. With strict mode, the intern's output is guaranteed to match your schema. The grammar-constrained decoding ensures every token is valid. "Hopefully valid JSON" becomes "provably valid JSON."

📖 Go deeper: Anthropic Structured Outputs — OpenAI Structured Outputs

Open a new conversation. Paste any failing test output you have lying around, then add this at the end:

Your response MUST be valid JSON:
{
  "test_name": "...",
  "root_cause": "...",
  "fix": "...",
  "confidence": "HIGH | MEDIUM | LOW"
}
Respond with ONLY JSON.

The intern thinks creatively about the bug, but the output lands in a strict schema you can parse with code. Bread on both sides, creative filling in the middle.

Chapter 7

Two Management Styles

Codex vs Claude Code

Codex is the delegating boss: hand off a task, come back for the result. Claude Code is pair programming: talk it through together.

Codex: cloud sandbox, async, git diff output, reasoning summaries · Claude Code: local machine, real-time streaming, conversation summaries · Compaction: both summarize when context fills up · Multi-agent: both support teams of interns working in parallel

Chapter 7: Two Management Styles

Codex vs. Claude Code — A Deep Dive

Both Codex (OpenAI) and Claude Code (Anthropic) are coding agents. They give the intern tools and let them write code. They represent two different management philosophies.

The Autonomous Manager (Codex)

Philosophy: "Here's the task. I'll check in later."

You write a task on a card: "Add OAuth login with password reset." You slide it under the intern's door and walk away. The intern works in a soundproof office (cloud sandbox), writes code, runs tests, and when done, puts a finished PR on your desk for review.

┌──────────────────────────────────────────────────┐
│                  CODEX PLATFORM                  │
├──────────────────────────────────────────────────┤
│                                                  │
│  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
│  │  Codex CLI │  │ IDE Plugin │  │ Codex     │  │
│  │  (terminal)│  │ (VS Code)  │  │ Cloud     │  │
│  └─────┬──────┘  └─────┬──────┘  └─────┬─────┘  │
│        └───────────┬───┘───────────────┘        │
│                    │                             │
│  ┌─────────────────▼───────────────────────────┐ │
│  │         RESPONSES API (unified API)         │ │
│  │                                             │ │
│  │  ┌──────────────────────────────────────┐   │ │
│  │  │           GPT-5.4 MODEL              │   │ │
│  │  │  Dynamic reasoning: auto-routes      │   │ │
│  │  │  none → low → med → high → xhigh    │   │ │
│  │  └──────────────────────────────────────┘   │ │
│  │                                             │ │
│  │  Built-in tools:                            │ │
│  │  Web search, file search, shell,            │ │
│  │  computer use (native!), code interpreter   │ │
│  │                                             │ │
│  │  External: MCP servers, Skills, AGENTS.md   │ │
│  └─────────────────────────────────────────────┘ │
│                                                  │
│  Conversation state / Compaction / Prompt cache  │
└──────────────────────────────────────────────────┘

Key architectural decisions:

GPT-5.4 is a convergence model. OpenAI merged separate model families (reasoning models like o3/o4-mini and coding models like Codex) into ONE model that dynamically adjusts how much it "thinks." Simple questions use the fast path; complex debugging uses deep reasoning. Under the hood, GPT-5.4 contains effectively three models in one (main, mini, nano tiers) with a built-in router.

The Codex platform offers three surfaces — CLI, IDE, Cloud — all hitting the same API. Codex Cloud is the most differentiated: you delegate entire features to a cloud sandbox. The agent clones your repo, creates a branch, makes changes, runs tests, opens a PR — all without you watching.

The Pair-Programming Manager (Claude Code)

Philosophy: "Walk me through your thinking."

You sit next to the intern's desk. They think out loud, show you their reasoning at each step, ask "should I go left or right here?" You steer in real-time.

┌──────────────────────────────────────────────────┐
│              CLAUDE CODE PLATFORM                │
├──────────────────────────────────────────────────┤
│                                                  │
│  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
│  │ Claude Code│  │ VS Code /  │  │ Claude.ai │  │
│  │  CLI       │  │ JetBrains  │  │ (Cowork)  │  │
│  └─────┬──────┘  └─────┬──────┘  └─────┬─────┘  │
│        └───────────┬───┘───────────────┘        │
│                    │                             │
│  ┌─────────────────▼───────────────────────────┐ │
│  │          MESSAGES API (unified API)         │ │
│  │                                             │ │
│  │  ┌──────────────────────────────────────┐   │ │
│  │  │       CLAUDE OPUS 4.6 MODEL          │   │ │
│  │  │  Extended Thinking (visible!)         │   │ │
│  │  │  Interleaved thinking (think between  │   │ │
│  │  │  tool calls — Claude 4+ feature)      │   │ │
│  │  │  Adaptive: low → medium → high        │   │ │
│  │  └──────────────────────────────────────┘   │ │
│  │                                             │ │
│  │  Built-in tools (5 categories):             │ │
│  │  Read, Write, Execute, Browse, Orchestrate  │ │
│  │                                             │ │
│  │  External: MCP servers, Skills, CLAUDE.md   │ │
│  │  Hooks (pre/post tool execution)            │ │
│  │  Agent Teams (parallel instances)           │ │
│  └─────────────────────────────────────────────┘ │
│                                                  │
│  Compaction / Context editing / Prompt caching   │
└──────────────────────────────────────────────────┘

Key architectural decisions:

Claude Code's philosophy is an "agentic harness." The model is the brain. Claude Code is the body. Three phases per task: gather context → take action → verify results, blending in a loop.

Interleaved thinking is the differentiator. Claude 4+ models think between tool calls. After getting a tool result, the model reasons about it BEFORE deciding what to do next. Earlier models had to finish all thinking before tool use.

Agent Teams (Feb 2026): Spawn 4-16 Claude instances working in parallel on a shared codebase, each with a different role. A lead agent coordinates. The C compiler experiment: 16 agents, 2,000 sessions, 2 billion input tokens, $20,000 → 100,000 lines of production Rust code.

The Agentic Loops Side by Side

Codex's Loop:

You: "Fix the auth bug in the login module"
  │
  ▼
1. LOAD CONTEXT
   System prompt + AGENTS.md + Skills matched
   Reasoning level auto-selected (detects "hard" → high)
  │
  ▼
2. PLAN (reasoning tokens — internal)
   "I need to find the login module, read the auth flow,
    identify the bug, write a fix, run tests."
   [GPT-5.4 auto-decides depth per turn. Simple fix?
    Nano-tier. Race condition? Full reasoning.]
  │
  ▼
3. EXECUTE (tool calls)
   → shell: find . -name "*.ts" | grep "login"
   → file_read: src/auth/login.ts
   → file_edit: src/auth/login.ts (applies patch)
   → shell: npm test -- --grep "login"
   [If context fills → COMPACTION → continues seamlessly]
  │
  ▼
4. VERIFY
   Tests pass? → Done. Open PR.
   Tests fail? → Loop back to step 2.
   [Can iterate 7-24 hours autonomously]

Claude Code's Loop:

You: "Fix the auth bug in the login module"
  │
  ▼
1. GATHER CONTEXT
   System prompt + CLAUDE.md + Skills matched
   Git state, project structure scanned
  │
  ▼
2. THINK (extended thinking — VISIBLE to you)
   <thinking>
     The user wants me to fix an auth bug.
     Let me search for login-related files...
   </thinking>
  │
  ▼
3. ACT + THINK (interleaved!)
   → grep: "login" across project
   ← results
   <thinking>Found 3 files. session.ts looks relevant...</thinking>
   → view: src/auth/session.ts
   ← file content
   <thinking>I see the bug — token not being refreshed...</thinking>
   → edit: src/auth/session.ts
   → bash: npm test
   ← 2 tests fail
   <thinking>Need to fix the mock setup...</thinking>
   → edit: test/auth.test.ts

   This THINK-ACT-THINK-ACT pattern is unique to Claude 4+.
  │
  ▼
4. VERIFY
   Tests pass? → Show diff for approval.
   Tests fail? → Adjusts approach.
   You can interrupt ANY time to steer.

How They "Think" — Reasoning Architectures

GPT-5.4: The Auto-Router

GPT-5.4 contains a built-in intelligence router:

User query arrives
       │
  ┌────┴──────────┬────────────┐
  ▼               ▼            ▼
┌──────┐    ┌──────────┐   ┌────────┐
│ NANO │    │   MINI   │   │  FULL  │
│ tier │    │   tier   │   │ reason │
│      │    │          │   │        │
│93.7% │    │ Moderate │   │ 2x more│
│fewer │    │ tasks    │   │thinking│
│tokens│    │          │   │tokens  │
└──────┘    └──────────┘   └────────┘

Bottom 10% of turns use 93.7% fewer tokens. Top 10% think 2x longer. You can override with reasoning.effort: none/low/medium/high/xhigh.

Claude Opus 4.6: Adaptive Thinking

User query arrives
       │
       ▼
┌─────────────────────────┐
│   ADAPTIVE THINKING     │
│   (model reads context  │
│    to decide budget)    │
└───────┬─────────────────┘
       │
       ▼
┌─────────────────────────┐
│ EXTENDED THINKING BLOCK │
│ <thinking>              │
│   Visible to developers │
│   Interleaved with tools│
│ </thinking>             │
│                         │
│ Previous thinking blocks│
│ STRIPPED from future    │
│ turns (saves desk space)│
│ EXCEPT during tool-use  │
│ cycles                  │
└─────────────────────────┘

Control with /effort: low/medium/high. At medium effort, Opus 4.6 matches Sonnet 4.5's SWE-bench with 76% fewer output tokens.

The practical difference:

What you see	Codex	Claude Code
While it "thinks"	Spinner, result appears when ready	Thinking content visible in real-time
After a tool call	Next action appears	Thinking block shows WHY it chose that action
Debugging AI behavior	Tracing dashboard	Read the thinking blocks directly
Steering mid-task	Interrupt with new instructions	Interrupt + see what it was thinking

Codex is a senior engineer who goes away, works on the problem, and comes back with a solution. Claude Code is a pair programmer who narrates their thinking out loud. Both produce excellent code. Claude's approach is more transparent during execution.

Compaction: When the Desk Gets Full

Both platforms independently solved the same problem: how to work on tasks that exceed the context window.

A complex refactor might use 115,000+ tokens of context, growing with every turn. Without compaction, the agent hits the desk limit and stops.

The solution (both platforms):

Turns 1-15: Normal operation, desk fills up
                     │
                     ▼
Turn 16: COMPACTION TRIGGERED
         The intern summarizes their own work:
         "I'm refactoring the auth module.
          Done: fixed session.ts, updated 3 tests.
          Next: fix remaining test mock."
                     │
                     ▼
Turn 17: FRESH DESK
         [system + summary + current files]
         Intern continues where they left off.
         Can repeat multiple times.

The intern writes commit messages for their own work. When the desk gets cluttered, they write a summary, clear the desk, and keep going. The AI equivalent of going home, sleeping, and picking up from your notes in the morning.

Aspect	Codex	Claude Code
Training	Natively trained for compaction since 5.1-Codex-Max	Server-side compaction
Max session	24+ hours continuous	2,000 sessions / 2 weeks (C compiler)
Trigger	Automatic	Automatic + manual

Multi-Agent: The Team of Interns

Codex: Experimental multi-agent. Launch parallel tasks, each runs in Codex Cloud sandbox. Results merge via PR workflow.

Claude Code Agent Teams (more mature):

Lead Intern (you interact with this one)
    │
    ├── Intern 1: "Analyze auth module"
    │   └── Own desk, own tools, own thinking
    │
    ├── Intern 2: "Analyze database layer"
    │   └── Works in parallel with Intern 1
    │
    ├── Intern 3: "Analyze API endpoints"
    │   └── Independent desk and tools
    │
    └── Intern 4: "Cross-reference findings"
        └── Reads output from 1-3, synthesizes

All share the same codebase (building).
Each has their own desk (context window).
Lead intern coordinates and synthesizes.

The C Compiler benchmark: 16 interns, 2 weeks, $20,000, 100,000 lines of Rust that compiles the Linux kernel. One intern "cheated" by calling GCC when it couldn't solve a problem. Very intern behavior.

Head-to-Head Comparison

Dimension	GPT-5.4 (Codex)	Claude Opus 4.6
Context window	1M tokens	200K (1M beta)
Reasoning control	5 levels + auto-router	3 levels + adaptive
Thinking visibility	Becoming transparent	Fully visible
Computer use	Native in 5.4	Tool-based
Pricing (per MTok)	$2.50 in / $15 out	$5 in / $25 out
Max sustained session	24+ hours	2,000 sessions / 2 weeks

Platform Feature	Codex	Claude Code
CLI	Yes (open source)	Yes (open source)
IDE	VS Code	VS Code, JetBrains
Cloud/async	Codex Cloud	Cowork
Multi-agent	Experimental	Agent Teams (mature)
MCP support	Native	Native + MCP Connector
Skills	Adopted the standard	Created the standard
Project config	AGENTS.md	CLAUDE.md
Approval modes	Suggest / auto / full auto	Ask / auto / YOLO
GitHub integration	Deep native	Via MCP

The Model Evolution Race

OPENAI (7 models in 6 months):
──────────────────────────────────
Sep 2025: GPT-5-Codex          (first dedicated)
Oct 2025: GPT-5-Codex-Mini     (cost-efficient)
Nov 2025: GPT-5.1-Codex        (improved)
Nov 2025: GPT-5.1-Codex-Max    (first compaction)
Dec 2025: GPT-5.2-Codex        (long-horizon)
Feb 2026: GPT-5.3-Codex        (most capable coding)
Mar 2026: GPT-5.4              (converged model)

ANTHROPIC (fewer releases, bigger jumps):
──────────────────────────────────
May 2025: Claude 4 (interleaved thinking)
Sep 2025: Claude Sonnet 4.5 (top SWE-bench)
Oct 2025: Agent Skills launched
Feb 2026: Claude Opus 4.6 (Agent Teams, 1M context)

When to Use Which

Your Situation	Recommendation	Why
Quick interactive coding	Either	Comparable experience
Async/background features	Codex Cloud	Built for delegation
Need to see AI reasoning	Claude Code	Visible thinking blocks
Large parallel refactoring	Claude Code Agent Teams	Mature multi-agent
Tight budget, high volume	Codex (GPT-5.4)	Lower per-token pricing
Windows development	Codex	Specific Windows support
GitHub-native workflow	Codex	Deep GitHub/Slack integration
Building MCP toolchains	Claude Code	Created MCP, MCP Connector
Already in Claude ecosystem	Claude Code	Seamless with Cowork, Skills
Already in OpenAI ecosystem	Codex	Seamless with AgentKit, Evals

Try both management styles on the same piece of code. First, the delegator:

"Fix the null safety issues in this code.
Return only the corrected code."

You get a result. Clean, fast, done. Now try the pair-programmer:

"Let's review this code together.
What potential issues do you see?
Walk me through your thinking."

You get a conversation. The intern explains its reasoning, catches things you might question, and you learn along the way. Which fits how you work?

Chapter 8

The Messenger App

OpenClaw

What if you could text your intern on WhatsApp? OpenClaw puts AI on messaging platforms. Quick and accessible, but think about security.

OpenClaw: open-source bridge from LLMs to WhatsApp/Telegram/Slack · Caution: API keys in transit, message logging, no enterprise auth by default · Fine for personal use, careful with company data

Chapter 8: The Messenger App

OpenClaw — The Intern Gets WhatsApp

What Is OpenClaw?

Someone connected the intercom system (MCP), the instruction manuals (Skills), and the intern's brain (LLM) to a WhatsApp/Telegram/Slack interface. Now you can text the intern from your phone.

"Hey, what broke in production?" The intern uses the intercom to check Raygun, reads K8s logs, searches GitHub, and texts you back. 24/7, from your phone, while you're at breakfast.

How It Maps to Our Analogy

Analogy	OpenClaw	Anthropic	OpenAI
The intern	LLM	Claude	GPT
The intercom	Gateway	MCP Client	Codex runtime
Instruction manuals	Skills (SKILL.md)	Agent Skills	Custom functions
Manual library	ClawHub	MCP marketplace	—
Intern's notes	Markdown on disk	Conversation history	Conversation history
Your phone	WhatsApp/TG/Slack	Claude Desktop	ChatGPT app

Why It Went Viral

Simple memory. Notes stored as Markdown. No vector DBs. Readable, editable, grep-able.
Self-improving. The intern discovers gaps in its manuals and writes new ones.
Familiar interface. Text it like a colleague on WhatsApp or Telegram.
Models got smart enough. Claude Opus 4 and GPT-5 chain tools reliably. AutoGPT (2023) failed because the intern wasn't ready.

Security Concerns

The analogy gets darker here:

Broad permissions. Email, calendar, code repos all accessible to the intern.
Prompt injection. Malicious instructions hidden in data the intern reads.
800+ malicious skills found on ClawHub (Cisco research). Instruction manuals that steal data instead of helping.
"If you can't run a command line, this is too dangerous for you." An actual quote from the community.

The intern has the keys to the building. Make sure the building has alarms.

Imagine your team wants to connect AI to Slack. Before you say yes, score each risk from 1 to 5:

[ ] API keys exposed in transit
[ ] Message history stored by 3rd party
[ ] No role-based access control
[ ] AI sees sensitive client data
[ ] No audit log of AI actions

Anything you scored 4 or higher is a blocker. What guardrails would you need before approving this?

Chapter 9

Writing Better Notes

Prompt Engineering

Vague notes get vague results. Specific, structured prompts with examples and constraints get precise, useful output.

6 Rules: 1) Be specific (context + task + format) · 2) Give examples (few-shot) · 3) Set constraints · 4) Step-by-step instructions · 5) Specify output format · 6) Iterate and refine

Chapter 11: Writing Better Notes

Prompt Engineering in 60 Seconds

The intern is only as good as the notes you slide through the slot. Six rules:

Be Specific, Not Vague — "Refactor the authentication middleware to use async/await with proper error handling, following our existing pattern in loggingMiddleware.ts" beats "help with code."

Give Context — "I'm working on ADO #4521. The goal is adding pagination. Here's what we have now: [paste current code or say 'read it from GitHub']."

Show Examples — "Here's how we format PR descriptions: [example]. Now write one for my changes."

Say What You DON'T Want — "Don't use any external libraries. Don't change the existing API contract. Don't modify the database schema."

Ask for the Thinking — "Before writing code, explain your approach and what edge cases you see." This catches misunderstandings before the intern writes 200 lines of wrong code.

Break Big Tasks into Steps — Instead of "build the entire feature," say "Step 1: Read the spec. Step 2: Propose the approach. Step 3: Implement. Step 4: Write tests. Check in with me after each step."

The Prompt Template That Works

CONTEXT: [What you're working on, link to ADO item]
CODEBASE: [Which repo/folder to look at]
TASK: [Exactly what you want done]
CONSTRAINTS: [What NOT to do, patterns to follow]
OUTPUT: [What you expect back — code, explanation, test cases, etc.]

Example:

CONTEXT: ADO #4521 — adding cursor-based pagination to products API
CODEBASE: Check ProductsController.cs and OrdersController.cs (pagination reference)
TASK: Implement pagination on GET /api/products matching our existing pattern
CONSTRAINTS: Don't change the response schema for existing fields.
             Use cursor-based, not offset-based. Max page size = 100.
OUTPUT: Implementation code + unit tests + PR description

Open a new conversation and try this structured prompt with the same code from earlier:

CONTEXT: C# backend, SOLID, async/await

TASK: Review for bugs, null safety, naming

CONSTRAINTS: Only real issues,
categorize by severity

CODE:
[paste your code here]

OUTPUT: Numbered list with
severity + suggested fix

Compare this to the generic review you got back in Step 4. Same intern, same code, better note. The difference is all in how you asked.

Chapter 10

Best Practices

For Daily Work

Your intern works best with a clean desk. Start fresh conversations. Keep context focused. More papers = more diluted attention.

Sweet spots: <10K tokens = peak · 10-50K = great · 50-100K = degradation · 100K+ = careful · "Lost in the Middle" effect: info in the middle of context gets less attention · Tips: one task per conversation, fresh start for new topics, include only relevant files

Chapter 12: Best Practices for Daily Work

Treating the Intern Right So They Do Their Best Work

You know what the intern can do. Now: how to work with them day-to-day for great results instead of hit-or-miss guesses.

The Desk Size Sweet Spot (Context Window Management)

Everything (your question, conversation history, system prompt, tool definitions, attached files) must fit on the desk. A bigger desk doesn't mean better work.

The problem with a cluttered desk (and why self-attention is the reason):

The intern doesn't skim top-to-bottom like a human. They use self-attention — for every single token on the desk, they compute a relevance score against every other token. Think of it as the intern drawing an invisible line between every pair of words and asking: "How much does word A matter for understanding word B?"

With 5,000 tokens on the desk, that's 25 million lines to draw. At 100,000 tokens, it's 10 billion lines. This happens across 128 parallel "attention heads" (each specializing in different relationships: syntax, logic, code structure, coreference) across 100+ layers. The math works, but the signal gets diluted. When the intern draws 10 billion relationship lines, the important ones ("this variable is null because that constructor wasn't updated") compete with millions of irrelevant connections. The attention "budget" (which sums to 1.0 via softmax) gets spread across more tokens, meaning each individual token gets a smaller slice of focus.

Example: you paste a 500-line file and ask the intern to find a bug on line 47. With just that file on the desk (~2K tokens), the intern's attention on line 47 is concentrated. It can relate line 47 to every other relevant line. Now paste 20 more files alongside it (~40K tokens). The intern still sees line 47, but allocates attention across 40K tokens. Line 47's share shrinks from ~0.05% to ~0.0025% of the attention budget. The intern is more likely to miss the relationship between line 47 and the DI registration on line 312 of a different file.

Smaller, focused context = sharper attention = better results.

Research shows:

Context Used	Quality Behavior
< 10K tokens (~25 pages)	Peak performance. High attention density. Good for focused tasks: single file reviews, specific questions, targeted generation.
10K–50K tokens (~25–125 pages)	Still excellent. Good for multi-file tasks, PR reviews with context, feature implementation with spec + existing code.
50K–100K tokens (~125–250 pages)	Noticeable degradation on details buried in the middle. The "lost in the middle" effect: info at the start and end gets more attention than info in the middle.
100K+ tokens (~250+ pages)	Use with care. Great for search/retrieval tasks ("find X in this codebase"), but for reasoning tasks the intern starts missing details. Compaction helps here.

Don't dump your entire codebase into the context "just in case." Give the intern what they need. Five relevant files beat fifty irrelevant ones.

How to right-size the desk:

Let the intern fetch what they need. Instead of pasting 20 files, say "Read the ProductsController and its related service from GitHub." The intern uses MCP to grab those files. Nothing more.
Use references, not copies. "Follow the same pattern as OrdersController" + MCP access to the repo is better than pasting the entire OrdersController into the prompt.
Break big jobs into focused steps. Instead of "refactor these 15 files," say "Start with the data layer. Read the models. Propose changes. Then we'll move to the API layer." Each step has a clean, small desk.
Watch for compaction. In long sessions, both Claude and Codex compact (summarize) earlier context to free up space. The intern summarizes their own notes. If you notice quality dropping in a long conversation, start a fresh one. Clean desk, focused prompt.

🔬 The "Lost in the Middle" Effect

A 2023 Stanford study (Liu et al.) showed that LLMs recall information at the start and end of the context much better than the middle. At 100K+ tokens, accuracy on middle-positioned facts dropped by 20-30%. Attention scores decay with positional distance, and both the beginning (system prompt area) and end (recent conversation) get natural attention boosts. Practical fix: put your most important context near the top or the end of the prompt. MCP helps because the model requests information on demand, placing it fresh at the end of context when needed.

📖 Go deeper: Lost in the Middle (Stanford) — Context Window Best Practices (Anthropic)

New Conversation vs. Continue (When to Reset the Desk)

One of the most common mistakes: having a 2-hour conversation, getting progressively worse results, and blaming the model.

Start a new conversation when:

The topic changes significantly. You were reviewing a PR and now want to write a new feature? New conversation. The old PR context is noise on the desk.
Quality starts dropping. If the intern seems to "forget" instructions or repeat mistakes, the desk is cluttered. Fresh start.
After compaction. If you see the conversation compress/compact, important details may have been summarized away. Start fresh with a focused prompt.
Context is over ~80K tokens. This is roughly 40–60 back-and-forth messages with code. Beyond this, you're in the degradation zone.

Continue the conversation when:

The tasks are related. "Now write tests for the code you just generated." The intern already has the code on the desk.
You're iterating on output. "Make it handle null inputs too." Keeping context is essential.
The conversation is still short. Under ~20 messages? Keep going.

Rule of thumb: Treat conversations like browser tabs. Open a fresh one for each distinct task. Don't let a single tab become a 200-page novel.

One Task at a Time (Focused vs. Shotgun Prompts)

Bad: "Review my PR, also check if there are any related bugs in ADO, and while you're at it write release notes and update the wiki."

Good: "Review PR #387. Focus on: error handling, null safety, and test coverage."

The intern can do many things, but attention is finite. Four tasks in one prompt means the intern allocates ~25% of reasoning capacity to each. One task gets 100%. The quality difference is stark, especially for deep-thinking tasks like code review.

Exception: Simple factual multi-part questions ("status of items #101, #102, and #103?") are fine batched. The intern isn't reasoning, just fetching.

Show, Don't Tell (Examples Beat Instructions)

The intern learns patterns faster than rules. Compare:

Instruction-based: "Write commit messages using conventional commit format with a type prefix, scope in parentheses, and a concise description."

Example-based:

Write commit messages like these:
  feat(auth): add OAuth2 support for GitHub login
  fix(api): handle null response from payment gateway
  refactor(db): extract connection pooling into shared module

Now write a commit message for my changes.

The second approach works better because the intern's core skill is pattern recognition, trained on trillions of examples. Showing 2–3 examples activates the right pattern more reliably than a paragraph of rules. This is few-shot prompting, and it works for code style, PR descriptions, test formats, documentation templates.

Build a team skill (SKILL.md) that includes your examples. Every team member gets the same patterns without retyping them.

Temperature and Sampling: When to Tune

Most tools pick sensible defaults, but knowing when to adjust helps:

Task	Temperature	Why
Code generation	0 – 0.2	You want correct, deterministic code. Creativity = bugs.
Code review	0 – 0.3	You want consistent, reliable analysis.
Test generation	0.3 – 0.5	A little creativity helps find edge cases.
Brainstorming	0.7 – 1.0	You want varied ideas, not the first obvious one.
Writing (docs, emails)	0.5 – 0.7	Natural variation, not robotic.

In Claude Desktop and Codex: You don't control temperature directly. The tools choose based on the task. Knowing this helps you understand why the intern sometimes gives different answers to the same question (temperature > 0), and why code generation is rock-solid (temperature ≈ 0).

Verify, Don't Trust (The Human-in-the-Loop)

The intern is smart but not infallible.

Review generated code before merging. The intern writes plausible code that may have subtle bugs, especially around edge cases, concurrency, and security.
Run tests. If the intern generates tests, run them. If they fail, the intern can fix them, but you verify the fix makes sense, not just that it passes.
Check facts against tools. If the intern says "build #1847 succeeded," verify in Azure DevOps. The intern can hallucinate tool results if the MCP connection failed silently.
Use the sandwich pattern. From Chapter 6: wrap LLM output in deterministic validation. Generated SQL? Run EXPLAIN before executing. Generated API endpoint? Test it against your contract tests.

The intern drafts, you approve.

Cost Awareness (Tokens = Money)

Every token costs real money. Smart habits save budget:

Don't paste entire files when the intern can read them via MCP. Pasting a 2,000-line file into the prompt costs input tokens every message. Letting the intern read it via MCP costs tokens once, and only the relevant parts stay in context.
Use prompt caching. If you have a system prompt + SKILL.md + repo conventions that stay the same across conversations, prompt caching means you pay full price once and ~90% less for subsequent calls. Both Anthropic and OpenAI support this.
Output tokens cost 3–5x more than input tokens. If you ask for verbose explanations you don't need, you're overpaying. "Give me just the code, no explanation" saves tokens when you don't need the walkthrough.
Compaction saves money in long sessions. When context gets compressed, subsequent turns process fewer tokens. Both Claude and Codex compact automatically for quality and cost.

Action	Token Impact
Paste 1,000-line file every turn	~4K input tokens × every message
Let intern read via MCP	~4K tokens once, plus ~200 for the tool call
Ask for "code only, no explanation"	Saves ~500–2,000 output tokens per response
Start fresh instead of continuing a 50-message thread	Saves ~30K+ context tokens per turn

Skills as Team Knowledge (Don't Reinvent the Wheel)

Skills are instruction manuals (Chapter 4). Using them as a team:

Every repeated prompt should become a skill. If you've typed the same code review instructions three times, make a skill. If QA always needs the same test format, make a skill.
Share skills via your repo. Put them in a /skills folder in your project repo. Everyone on the team gets them automatically.
Version skills like code. PR reviews for skill changes. "We updated the code review skill to also check for logging." Everyone benefits immediately.
Layer general + specific. A general "code-review" skill for the whole team, plus "frontend-code-review" that inherits the general one and adds React-specific checks.

The Daily Workflow Cheat Sheet

Time	Action	Why
Start of day	Open Claude Desktop / Codex. Fresh conversation.	Clean desk. Full attention.
Pick up a task	"Read ADO #[item]. Summarize what needs to be done."	Intern reads the full context so you start informed.
Implementation	Share relevant files (or let the intern read via MCP). One task per conversation.	Focused desk = better code.
Before committing	"Review the changes I made. Focus on bugs and edge cases."	Fresh eyes. The intern hasn't seen this code before, so they catch what you've gone blind to.
PR description	"Write a PR description linking ADO #[item]. Include what changed, why, and how to test."	Saves 5 minutes every PR. Consistent format.
End of day	Close conversations. Don't save stale ones for tomorrow.	Tomorrow you'll want a clean desk, not yesterday's clutter.

🔬 Why Fresh Conversations Win

Each conversation starts with a prefill phase where the entire context (system prompt + conversation history) is re-processed through all model layers. At 200K tokens, this takes 5-15 seconds and costs significant compute. Shorter contexts mean faster first-token latency, lower cost, and higher quality. The KV-cache helps within a single turn, but between turns the cost of a long history compounds. Prompt caching helps for the static parts (system prompt, skills), but the conversation portion is always reprocessed in full.

📖 Go deeper: Prompt Caching (Anthropic) — Context Window Management

The Smartest Intern in the World

The Room

The Phone

The Intercom

The Instruction Manuals

The Complete Stack

Vending Machine, Intern, or Both?

Two Management Styles

The Messenger App

Writing Better Notes

Best Practices

Five Things to Remember

The Five Things Everyone Should Remember

The Smartest Intern in the World

🧪 Exercise 0

The Room

Chapter 1: The Room

Understanding the LLM

The Desk (Context Window)

The Paper Slips (Tokens)

The Creativity Dial (Temperature)

The Amnesia Problem

How the Intern Reads (Attention)

The Wall Card (System Prompt)

The Paradigm Shift

🧪 Exercise 1

The Phone

Chapter 2: The Phone

Giving the Intern Tools

The Problem

The Phone Book

The Phone Call Protocol (4 Steps)

Same Room, Different Phone Books

Two Types of Calls

What the Phone Unlocks

🧪 Exercise 2

The Intercom

Chapter 3: The Intercom

MCP — The Universal Adapter

The Problem: Too Many Phone Numbers

The Intercom System

Architecture: Three Roles

Three Buttons on the Intercom

The Wiring (Transport)

Two Ways to Set Up the Intercom

Real-World Intercom Lines Your Team Can Use

MCP vs. Direct Tool Integration

🧪 Exercise 3

The Instruction Manuals

Chapter 4: The Instruction Manuals

Skills — Teaching the Intern Procedures

The Gap

What a Skill Actually Is

Progressive Disclosure: The Table of Contents Trick

OpenAI Adopts Skills

Skills vs. No Skills — A Demo Idea

🧪 Exercise 4

The Complete Stack

Chapter 5: The Complete Stack

How It All Fits Together

The Layer Cake

Three Levels of Freedom

🧪 Exercise 5

Vending Machine, Intern, or Both?

Chapter 6: Vending Machine, Intern, or Both?

When to Use Deterministic, Non-Deterministic, and Hybrid Approaches

The Spectrum

The Decision Framework

Real Scenarios from Our Work

The Golden Rule

🧪 Exercise 6

Two Management Styles

Chapter 7: Two Management Styles

Codex vs. Claude Code — A Deep Dive

The Autonomous Manager (Codex)

The Pair-Programming Manager (Claude Code)

The Agentic Loops Side by Side

How They "Think" — Reasoning Architectures

Compaction: When the Desk Gets Full

Multi-Agent: The Team of Interns

Head-to-Head Comparison

The Model Evolution Race

When to Use Which

🧪 Exercise 7

The Messenger App

Chapter 8: The Messenger App

OpenClaw — The Intern Gets WhatsApp

What Is OpenClaw?

How It Maps to Our Analogy

Why It Went Viral

Security Concerns

🧪 Exercise 8

Writing Better Notes