How to explain LLMs, tools, MCP, skills, and agents without pretending any of this is magic.
Think of modern AI as a very smart intern: fast, capable, and surprisingly useful, but only if you give it the right tools, clear instructions, and a system to work inside. This guide walks through that story one layer at a time, from raw LLMs to full agent workflows.
Grab your AI tool of choice. Claude, ChatGPT, Copilot, whatever you have. Open a fresh conversation and leave it running. You'll come back to it at every step.
LLM Basics
Start with the core picture: you hired the smartest person in the world, then shut them in a room with a desk and a mail slot. They can only work with what fits on the desk, and they can only respond with what they send back through the slot. That is the basic LLM model.
The intern has one desk. Everything they need must fit on this desk at once:
| Model | "Desk Size" | Real-world equivalent |
|---|---|---|
| Claude Opus 4.6 | 200K tokens (1M beta) | ~500–3,000 pages |
| GPT-5.4 | 1M tokens | ~3,000 pages |
Once the desk is full, something has to be removed. Older notes get pushed off. The intern seems to "forget" earlier conversations. The desk is physically full.
The context window is working memory, like RAM. Not storage. Every API call, the desk gets cleared and rebuilt from scratch.
Not an arbitrary marketing number. The model uses self-attention: every token computes a relationship score against every other token. Memory and compute grow quadratically: double the context → 4x the cost. A 200K context = 40 billion pairwise relationships per layer, and models have 80-120 layers. Context windows don't grow to infinity because each doubling is a massive engineering investment (FlashAttention, ring attention, sliding window) to make the quadratic cost survivable.
📖 Go deeper: Attention Is All You Need (original paper, 2017) — FlashAttention-3 (current state-of-the-art, 2024) — Transformer Explainer (interactive visual)
The intern doesn't read whole words. They process token-fragments, like Scrabble tiles:
Why this matters:
Before text reaches the neural network, a tokenizer (Byte Pair Encoding) splits it into subword chunks from a fixed vocabulary (~100K entries). "Hello world" → [15496, 995], two integer IDs. Each ID maps to a row in a learned embedding matrix, a point in 8,192-dimensional space. The model never sees text. It sees high-dimensional vectors where similar meanings cluster together ("Hello" is near "Hi" and "Hey").
Why non-English costs more: The vocabulary was built mostly from English. "the" = 1 token, but "Антропик" = 4-6 tokens because Cyrillic gets split into byte-level fragments. Same meaning, more tokens, higher cost. Code is expensive too: every {, }, and indentation space burns a token.
📖 Go deeper: OpenAI Tokenizer (try it live) — BPE algorithm explained (Hugging Face LLM course)
There's a dial on the intern's desk.
Most coding tools keep the dial at 0.
The model doesn't pick one word. It produces a probability distribution over ~100K tokens. For "The capital of France is": Paris gets 92%, "the" 3%, "a" 2%, etc. Temperature divides the raw scores before converting to probabilities: softmax(logits / temperature).
There's also top-p: instead of reshaping the distribution, it cuts off the tail. top_p=0.9 = "only sample from tokens whose cumulative probability reaches 90%." Creative within plausible bounds.
📖 Go deeper: Generation Strategies (Hugging Face docs, kept current)
Every time you slide a new note in, the intern has amnesia. They don't remember the last conversation. If you want them to remember what you discussed before, you reprint the entire previous conversation and slide it in again.
Chat UIs send the full conversation history with every message. Long conversations get expensive (more tokens each turn). The desk fills up and the oldest messages get dropped. The model seems to "forget" things from early in the conversation.
The intern doesn't read left-to-right. They look at ALL words simultaneously and figure out which ones relate to which.
"The server crashed because it ran out of memory." The intern understands that "it" refers to "server" (not "memory"), and makes this connection across the entire desk, from page 1 to page 500.
This is the foundation of prompt engineering: put the right information on the desk → the intern attends to it → better output.
Everything comes from one paper: "Attention Is All You Need" (Google, 2017). The architecture:
Token IDs → Embeddings (8192-dim vectors) → [Attention + Feed-Forward] × 100 layers → Probabilities
Self-Attention is the core mechanism. For each token, the model computes three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), and Value ("what information do I carry?"). The dot product Q × K gives an attention score between every pair of tokens — "how relevant is token A to token B?" These scores become weights for combining Values into a new representation.
This happens 128 times in parallel per layer ("multi-head attention"), each head learning different relationship types — one learns syntax, another learns coreference, another learns code bracket matching. A 100-layer model with 128 heads = 12,800 different relationship analyses for every token. That's why LLMs "get" code structure and conversational context.
Scale: Claude Opus 4.6 has ~200B+ parameters, GPT-5.4 ~300B+ (with MoE routing). Each parameter is a learned number. There's no database inside — just billions of numbers that collectively encode language, code, and reasoning.
📖 Go deeper: The Illustrated Transformer (Jay Alammar, updated with modern additions) — 3Blue1Brown: Attention in transformers (video, 2024) — Transformer Explainer (interactive, runs in browser)
The instruction card pinned to the wall. Always visible. Shapes everything the intern does.
"You are a senior TypeScript developer. Follow SOLID principles. Never use any type. Respond in Ukrainian when asked about documentation."
The intern reads this before every task. The same model can be a code reviewer, a writing assistant, and a customer support agent. Different wall cards, same brain.
Before LLMs, we didn't have an intern. We had a vending machine. Press button A3 → always get the same candy bar. getUser(42) → always { name: "Alex" }. Deterministic. Predictable. Testable.
Now we have the intern. Slide in the same question twice, get two different answers. Both correct, just phrased differently.
We went from programming vending machines to writing instructions for a brilliant but unpredictable intern. Different engineering practices apply.
Three stages, each giving the model different capabilities:
1. Pre-training — Read the internet. Trillions of tokens. For each token, predict the next one. Wrong? Adjust weights. Repeat billions of times on 10,000+ GPUs for months. Cost: $50M-$500M+. This is "read every book."
2. Fine-tuning — Pre-trained models complete any text, including toxic or wrong text. Fine-tuning on curated (prompt, response) pairs teaches the model to be a helpful assistant instead of a text-completion engine.
3. Alignment — Anthropic uses Constitutional AI — Claude has a published constitution (updated January 2026) that defines its values, with a 4-tier priority hierarchy: safety → ethics → compliance → helpfulness. The model generates synthetic training data from this constitution and self-critiques against it. OpenAI uses RLHF (humans rank outputs, model learns preferences) guided by their Model Spec. Both approaches teach "helpful without being harmful," but the methods keep evolving — newer techniques like RLTHF (Targeted Human Feedback) and Direct Preference Optimization (DPO) are supplementing classic RLHF.
Critical: After training, the weights are frozen. The model never learns from your conversations. When you "teach" it in a chat, you're putting info on the desk. Close the conversation → desk clears → model is exactly as before. It can't permanently learn your codebase — only temporarily, per session.
📖 Go deeper: Claude's Constitution (Anthropic, 2026) — Constitutional AI paper (foundational, 2022) — OpenAI Model Spec (current alignment spec) — OpenAI's original RLHF approach (2022)
🔬 Under the Hood: What Happens When You Press Send (Inference)
Your message → Tokenize → Assemble full context (system + history + tools + message)
→ Forward pass through all 100+ layers → First token generated (this is the initial delay)
→ Append token, run again with KV-cache → Next token → Repeat until done
Why there's a delay, then fast streaming: The first token requires processing the ENTIRE context through all layers. After that, the KV-cache stores previous computations — each subsequent token only needs a lightweight forward pass. This is why responses start slow and speed up.
Why output costs 5x more than input: Input tokens are processed in parallel (one pass for all of them). Output tokens are generated one-at-a-time (one pass PER token). That's why Claude charges $5/MTok input but $25/MTok output. Every word the model writes is a separate computation.
📖 Go deeper: LLM inference explained (Databricks) — KV-cache visualized
Paste this code into your conversation and ask "Review this for problems":
var user = await _repo.GetById(userId);
return new UserProfile {
Name = user.Name,
Email = user.Email,
LastLogin = user.Profile
.LastLoginDate
.ToString("yyyy-MM-dd")
};
There are null reference bugs hiding in there. See if the intern spots them.
Now send the exact same message a second time. The answer will be slightly different. That's temperature at work: the intern rolls dice on every word.
Tools & Function Calling
One day, a phone appears on the wall. The intern can't leave the room, but now they can call out for data, calculations, or actions.
tool_use → Host executes tool → Result slides back · Key: LLM NEVER executes. It only writes JSON. The host runs the tool. · Strict Mode forces valid JSON matching the tool's schemaOur intern is smart but isolated. Without a phone, they can't:
They know everything that was in their training data, but nothing about right now.
You tape a list of phone numbers to the wall. Each entry describes a service the intern can call:
Name: "get_weather"
Description: "Get current weather for a city"
Parameters: { location: string (required) }
The intern reads the descriptions, decides if a call is relevant, and writes a request:
{ "tool": "get_weather", "arguments": { "location": "Kyiv" } }
They slide this request out through the slot. YOUR code makes the actual call, gets the weather data, and slides the result back in. The intern then uses that result to write their answer.
The intern NEVER leaves the room. They NEVER make the call themselves. They write "please call this number for me" on a slip. Your code does the actual work. Same on both Anthropic and OpenAI.
Tool calling is NOT a separate system — it's the same next-token prediction. The model generates structured JSON character by character, because it was fine-tuned to emit tool calls when appropriate. Tool definitions are injected into the context as text (eating desk space — 10 complex tools can cost 2,000-5,000 tokens before your question even arrives).
Strict mode is the real magic. Both platforms offer strict: true, which uses constrained decoding — at each token, a grammar mask sets the probability of all schema-violating tokens to zero. The model literally cannot produce invalid JSON. Your schema is compiled into a context-free grammar and enforced at every step. Cost: slightly more latency, but 100% valid output.
Stop reasons tell your code what happened: "end_turn" (done talking), "tool_use" (wants to call a tool), or "max_tokens" (ran out of space). Your code checks this, executes the tool, and sends the result back.
Tool calling is a "structured hallucination that happens to be useful." The model doesn't know it's calling a function. It generates tokens that happen to form valid JSON. Reliability is 99.9%+ with strict mode, but it's the same next-token prediction as writing a poem.
📖 Go deeper: Anthropic Tool Use docs — OpenAI Function Calling — Structured Outputs explained
Step 1: You → [tools + prompt] → Intern
"Here's a phone book and a question about weather"
Step 2: Intern → "I want to call get_weather('Kyiv')" → You
The intern writes a structured call request
Step 3: You → [execute, get result: "12°C, cloudy"] → Intern
YOUR code makes the call, slides the result back
Step 4: Intern → "It's 12°C and cloudy in Kyiv!" → You
The intern writes the final answer using the result
Anthropic calls it "Tool Use." OpenAI calls it "Function Calling." Same concept, same JSON Schema format for the phone book entries.
| Aspect | Anthropic (Claude) | OpenAI (GPT) |
|---|---|---|
| They call it | "Tool Use" | "Function Calling" |
| Call request | tool_use content block | function_call in message |
| Return format | tool_result message | function role message |
| Schema format | JSON Schema (strict mode) | JSON Schema (strict mode) |
| Built-in phones | Web search, bash, text editor, computer use, code execution | Web search, file search, code interpreter, shell |
| Key API | Messages API | Responses API |
Anthropic splits them:
OpenAI has five categories:
With a phone, the intern CAN:
The intern goes from "brain in a jar" to "brain with hands." Each hand needs to be wired up individually, which brings us to the next problem.
A single tool call requires minimum 2 HTTP requests:
REQUEST 1: You send [tools + "What's the weather in Kyiv?"]
RESPONSE 1: Claude returns stop_reason: "tool_use"
+ {name: "get_weather", input: {location: "Kyiv"}}
REQUEST 2: You send [ENTIRE conversation + tool_result: "12°C, cloudy"]
RESPONSE 2: Claude returns "It's 12°C and cloudy in Kyiv!"
Notice: Request 2 includes the FULL conversation history (amnesia — the desk clears between requests). The tool_use_id links each result to its call, supporting multiple parallel tool calls per turn.
Why agentic workflows are slow: 10 tool calls = 11 HTTP round-trips, each a full inference pass. A complex code review with file reads, PR diff, and ADO lookup might take 15-30 seconds across 4-5 round-trips. This is the main bottleneck in agent performance.
📖 Go deeper: Anthropic Messages API reference — OpenAI Responses API
Ask your AI:
What's the weather in Kyiv right now?
Watch what happens. If you see "Searching...", the intern just picked up the phone, dialed a weather service, and read back the result. That's the 4-step protocol running live.
If you get "I can't check real-time data," your intern has no phone on the wall. Same brain, zero tools.
MCP — The Universal Adapter
Instead of individual phone numbers, the building installs a standardized intercom. The intern writes requests and slides them under the door. The building routes them.
Remember when every phone brand had a different charger? Nokia one connector, Samsung another, Motorola a third. You needed a drawer full of cables.
That's where AI tools are without MCP:
Instead of taping individual phone numbers to the wall, we install a standardized intercom system. Any service can plug into the intercom using the same standard connector.
With MCP: 5 AI tools + 10 services = 15 integrations. One protocol, one standard.
MCP stands for Model Context Protocol, an open standard created by Anthropic, now community-driven and adopted by OpenAI, Google, Microsoft, and hundreds of tool providers.
THE BUILDING (MCP Host: Claude Desktop, VS Code, your app)
└── INTERCOM PANEL ON THE WALL (MCP Client: protocol handler)
├── LINE → GitHub Server
├── LINE → Azure DevOps Server
├── LINE → PostgreSQL Server
└── LINE → Your Custom API Server
└── THE ROOM (LLM — the intern inside)
| Button | What it does | Analogy |
|---|---|---|
| Tool | "Do something" | Menu items you can order |
| Resource | "Read something" | The wine list / specials board |
| Prompt | "Give me a template" | "Chef's recommendation" combo |
Button 1 — Do Something (Tool Call):
The intern writes a note: "Search issues for 'bug login'." They slide it under the door. The building routes it through the intercom to the right server, which executes and returns results back through the slot.
Button 2 — Read Something (Resource):
The intern writes: "Read file config.yaml." Same process — note goes out, data comes back. Read-only.
Button 3 — Give Me a Template (Prompt):
The intern writes: "Give me the code review template for TypeScript." The server sends back a structured prompt ready to fill in.
Two ways the intercom works:
Both speak the same MCP "language" — JSON-RPC 2.0.
MCP is JSON-RPC 2.0 with a defined lifecycle:
CLIENT SERVER
│── initialize ──────────────▶│ (exchange capabilities)
│◀── result ──────────────────│ "I have tools + resources"
│── tools/list ──────────────▶│ (discover available tools)
│◀── result ──────────────────│ [{name: "search_issues", inputSchema: {...}}]
│── tools/call ──────────────▶│ (LLM wants to call a tool)
│◀── result ──────────────────│ {content: [{type: "text", text: "Found 3 issues..."}]}
Capability negotiation happens at initialize — both sides declare what they support. This makes MCP forward-compatible: old clients gracefully ignore new server capabilities.
stdio transport: Claude Desktop spawns a child process (npx @modelcontextprotocol/server-github), sends JSON-RPC on stdin, reads responses from stdout. Fast (no network), but local-only. Process dies when the host closes.
HTTP+SSE transport: Client POSTs requests to a URL. Server responds inline or opens an SSE stream for long operations. Supports session management via Mcp-Session-Id header. This enables enterprise setups — a central MCP server with auth, rate limiting, and audit logging shared by the whole team.
📖 Go deeper: MCP specification — Build your first MCP server (tutorial) — MCP TypeScript SDK
Way 1: Build the intercom yourself (full control)
# Connect to MCP server, get tools, convert to Claude format
mcp_tools = await mcp_session.list_tools()
claude_tools = [{"name": t.name, "description": t.description,
"input_schema": t.inputSchema} for t in mcp_tools.tools]
# Pass to Claude API
response = client.messages.create(tools=claude_tools, ...)
Way 2: Use the MCP Connector (Anthropic's "easy button")
# Just point Claude at the MCP server URL — no client code needed
response = client.messages.create(
mcp_servers=[{"type": "url", "url": "https://mcp.example.com/sse"}],
...
)
OpenAI supports MCP natively in both the Agents SDK and the Responses API.
| Server | What it does | Relevance |
|---|---|---|
| GitHub MCP | Repos, issues, PRs, code search | Code review, PR automation |
| Azure DevOps MCP | Work items, sprints, pipelines | Sprint management from AI |
| PostgreSQL MCP | Query databases | Data exploration |
| Filesystem MCP | Read/write local files | Document processing |
| Slack MCP | Send/read messages | Team notifications |
| Google Drive MCP | Read/search docs | Knowledge base access |
| Use MCP when... | Use direct tools when... |
|---|---|
| One integration, multiple AI platforms | Building for a single platform |
| Exposing your API to AI clients you don't control | Maximum control over tool behavior |
| Tool server and AI client are separate concerns | Tool is tightly coupled to your app |
| You want to share/reuse tool servers | One-off internal function |
| You need the ecosystem (registry, discovery) | Simplicity over portability |
MCP is REST for AI tools. Just as REST standardized web APIs, MCP standardizes AI-tool connections. Build once, works everywhere.
The MCP client is a translator between two protocols:
MCP Server → tools/list → {name, description, inputSchema}
↓ (MCP client translates)
Claude API → tools: [{name, description, input_schema}] ← tokens on desk
↓ (LLM generates tool_use)
MCP Client → tools/call → {name, arguments} ← translated back to MCP
The server doesn't know or care if the host is Claude, GPT, or a custom app — it only speaks MCP. The client handles translation both ways. This is why MCP is platform-agnostic.
Anthropic's MCP Connector moves the client to their servers: you pass mcp_servers: [{url: "..."}] in the API call and Anthropic handles the MCP connection, tool discovery, and result injection. Zero MCP code on your side. Tradeoff: less control.
📖 Go deeper: MCP Connector (Anthropic docs) — MCP server registry · MCP Registry
Count your team's tools. Here's an example:
AI tools: Claude, ChatGPT, Copilot = 3
Services: GitHub, Azure DevOps, Slack, DB,
Jira, email, CI/CD = 7
Without MCP: 3 × 7 = 21 integrations
With MCP: 3 + 7 = 10 connectors
Plug in your real numbers. The gap between "times" and "plus" is how much wiring MCP saves.
Skills
Skills are instruction manuals loaded on demand. Hand the intern the right manual when they need it. Keeps the desk clean, the intern focused.
You gave the intern a phone (tools). They can call the weather API, search GitHub, query your database. But knowing how to use a drill doesn't mean you know how to build a kitchen.
A tool is a capability. A skill is knowledge of when and how to use capabilities, with all the domain context that makes the result professional.
A Skill is a directory on the intern's desk containing an instruction manual:
my-skill/
├── SKILL.md ← Required: instructions + metadata
├── scripts/ ← Optional: executable code
│ └── validate.py
├── templates/ ← Optional: reference templates
│ └── report.docx
└── assets/ ← Optional: images, data files
└── logo.png
The SKILL.md file has two parts:
---
name: quarterly-report
description: >
Generate quarterly business reports following company
template. Use when user asks for quarterly reports.
---
# Quarterly Report Generation
## Steps
1. Read the template from templates/report.docx
2. Extract financial data from provided spreadsheet
3. Generate executive summary
4. Fill in each section following the template structure
5. Validate formatting with scripts/validate.py
## Rules
- Always use company color scheme (#2E75B6 primary)
- Financial figures must include YoY comparison
- Executive summary must be under 200 words
The intern doesn't load all manuals upfront. That would fill the desk. Instead:
STAGE 1: System prompt loads ONLY metadata
(name + description for each installed skill)
≈ 50 tokens per skill
STAGE 2: User message triggers a skill match
The intern reads just THAT SKILL.md
≈ hundreds of tokens
STAGE 3: Skill loads reference files, scripts,
templates AS NEEDED
≈ thousands of tokens, loaded incrementally
Like a table of contents in a manual — the intern first sees chapter titles. When they need Chapter 7, they read just that chapter. They never load the whole manual upfront.
Why this matters for desk economics:
No magic. Context window engineering:
1. System prompt includes only skill names + descriptions (~50 tokens each). The actual SKILL.md files are NOT loaded yet.
2. Matching is the model reading those descriptions. When you say "Review PR #847," the model decides code-review is relevant. No separate matching engine, no vector search, just next-token prediction.
3. Loading is a tool call: Read("skills/code-review/SKILL.md"). Now the full instructions are on the desk.
4. Assets load on-demand: if SKILL.md says "read the template," the model makes another tool call.
No special infrastructure. Skills = files. Loading = tool calls. Matching = the model reading text. Prompt engineering + file I/O.
The tradeoff: 1-3 extra tool calls before actual work starts (1-6 seconds of latency). Noticeable for quick questions, negligible for a 10-minute code review.
When skills conflict: If two match, both SKILL.md files land on the desk. The model follows both, interleaving steps. The model resolves contradictions by judgment, which is why writing clear, non-overlapping descriptions matters.
📖 Go deeper: Agent Skills docs (Anthropic) — Skills open standard — Skills engineering blog post
Skills were created by Anthropic in October 2025, published as an open standard in December 2025. By late 2025, OpenAI adopted them:
$skill-name triggersBoth platforms now support Skills + MCP + Tools. The ecosystem is standardizing.
Ask the intern to create a PowerPoint WITHOUT skills → generic white slides, bullet points, boring layout.
Ask the same thing WITH the pptx skill loaded → professional design, branded colors, proper typography, icon usage, visual QA verification.
The SKILL.md that made the difference is a markdown file with instructions, built from hundreds of trial-and-error iterations.
| Step | What happens | Desk size |
|------|-------------|-----------|
| 1 | Context assembled: system prompt + skill metadata + MCP tool definitions + your message | ~2,260 tokens |
| 2 | Model reads skill descriptions, decides code-review matches. Calls Read("skills/code-review/SKILL.md") | +800 → ~3,060 |
| 3 | Skill says "Read the PR diff." Model calls GitHub MCP: get_pull_request_files(847) | +5,000 → ~8,060 |
| 4 | Skill says "Read the ADO work item." Model calls ADO MCP: get_work_item(4521) | +500 → ~8,560 |
| 5 | Skill says "Check each file against the rules." Model generates structured review. | Output: ~800 tokens |
| Total | 4 forward passes, ~9,360 tokens processed, 15-30 seconds, ~$0.05-0.10 | |
You can watch each tool call in Claude Desktop. No black box. The skill improved the review by adding procedure: "check THIS list, in THIS order, format the output THIS way."
Start a new conversation and paste this code. Just say "Review this code":
public async Task<PagedResult<Product>> GetProducts(
string cursor, int limit)
{
var products = await _db.Products
.OrderBy(p => p.Id)
.Where(p => p.Id > cursor)
.Take(limit)
.ToListAsync();
var nextCursor = products.Last().Id;
return new PagedResult<Product>
{
Items = products,
NextCursor = nextCursor,
Limit = limit
};
}
You'll get a decent review. Generic, though. No team standards, no severity levels, no checklist. Save this result. You'll redo this exercise later with a skill, and the difference will be obvious.
All Five Layers
Every layer builds on the one below: LLM → Tools → MCP → Skills → Agent. Together, an AI that gets work done.
while(!done) { think → act → observe }| Concept | Intern Analogy | What it does |
|---|---|---|
| LLM | The intern's brain | Thinks, reasons, predicts |
| Context Window | The intern's desk | Working space for current task |
| Token | Paper slips through the slot | Units of communication |
| System Prompt | Wall card with instructions | Persistent behavior rules |
| Tool | A phone number | Single capability: "call this service" |
| MCP | The intercom system | Universal way to connect any tool to any room |
| Skill | An instruction manual | "When doing X: use these tools in this order, follow these rules" |
| Agent | The intern with building keys | Uses brain + phone + manuals to complete tasks autonomously |
| AGENTS.md / CLAUDE.md | House rules in the lobby | "Before any work in this building, read these rules" |
These are layers, not alternatives:
Agent (orchestration)
└── uses Skills (procedural knowledge)
└── which reference Tools (capabilities)
└── connected via MCP (standard protocol)
└── feeding the LLM (intelligence)
└── within Context Window (working memory)
Level 1: Paper Slips (Chat mode)
Intern in the room. Notes in, notes out. No tools.
"Format this." "Explain this error." "Write a regex."
Everyone does this already.
Level 2: Intercom Access (Professional + MCP)
Intern has the intercom. Calls services when you ask.
"Analyze Raygun errors." "Create ADO tasks from this spec."
Where most teams are today.
Level 3: Keys to the Building (Agent mode)
Intern leaves the room. Picks tasks from the board, uses the intercom, handles errors, reports back.
"Monitor deploys, rollback if tests fail." "Auto-review incoming PRs."
The frontier.
All three levels stay relevant — Level 1 for quick questions, Level 2 for daily work, Level 3 for automation.
Time to build your first skill. Open any text editor, create a file called code-review-skill.md, and paste this:
# Code Review Skill
## Description
Review code for common issues.
## Steps
1. Check for null safety
2. Check error handling
3. Check naming conventions
4. Check validation & tests
## Checklist
- Null safety
- Error handling
- Naming conventions
- Input validation
- Tests coverage
- No hardcoded values
- Logging
- Security
## Output Format
- MUST FIX: critical issues
- SHOULD FIX: improvements
- CONSIDER: suggestions
That took about 3 minutes. You just gave your intern a manual it can follow every time.
Deterministic vs Hybrid
A vending machine always gives the same thing. The intern might surprise you. Combine both: the sandwich pattern.
The old world was a vending machine (deterministic). The new world is an intern (non-deterministic). You don't have to choose one. The best systems combine both.
DETERMINISTIC HYBRID NON-DETERMINISTIC
(vending machine) (intern + checklist) (intern freestyle)
│ │ │
if/else, regex, LLM decides WHAT to do, "Figure it out, here's
lookup tables, code enforces HOW it's done the goal, good luck"
SQL queries
│ │ │
Predictable Best of both worlds Creative, flexible
Testable Controlled creativity Unpredictable
Brittle Robust Hard to test
| Situation | Approach | Why |
|---|---|---|
| Always the same logic, never changes | Deterministic | Don't use an intern to press a button. Use code. |
| Structured input → structured output, but logic is complex | Hybrid | Let the intern reason, but constrain the output with schemas and validation. |
| Ambiguous input, creative output, requires judgment | Non-deterministic | This is what the intern is built for. |
| Safety-critical, auditable, regulated | Deterministic (or hybrid with heavy guardrails) | You can't explain to an auditor that "the intern thought it was fine." |
| Evolving requirements, new edge cases constantly | Hybrid / Non-deterministic | The intern adapts without code changes. Vending machines need reprogramming. |
Scenario 1: Parsing ADO work item into subtasks
❌ Pure non-deterministic: "Read this work item and create subtasks." The intern creates random subtask structures every time — different formats, inconsistent estimation scales, sometimes forgets fields.
❌ Pure deterministic: Regex-parse the work item description, extract keywords, create predefined subtask templates. Breaks the moment someone writes the description differently.
✅ Hybrid:
DETERMINISTIC: ADO API fetches work item fields (title, description, acceptance criteria)
NON-DETERMINISTIC: Intern reads the content, reasons about subtask decomposition
DETERMINISTIC: Output schema enforces {title, description, estimate, type} per subtask
DETERMINISTIC: Validation rejects estimates outside [0.5h, 16h] range
DETERMINISTIC: ADO API creates the subtasks
The intern does the thinking. The code does the structure enforcement. You get creative decomposition with predictable output format.
Scenario 2: Code review
❌ Pure deterministic: Linters, static analysis. Catches syntax issues and known patterns. Misses logic bugs, architectural problems, and "this doesn't match how we do things."
❌ Pure non-deterministic: "Review this code." Different results every time. Sometimes catches 10 issues, sometimes 3. No consistency in severity labeling.
✅ Hybrid:
DETERMINISTIC: Linter runs first (ESLint, StyleCop) → catches all mechanical issues
NON-DETERMINISTIC: Intern reviews for logic, architecture, patterns (with a skill!)
DETERMINISTIC: Skill enforces output format (MUST FIX / SHOULD FIX / CONSIDER)
DETERMINISTIC: CI gate — if "MUST FIX" count > 0, block merge
Linter handles what it's good at (100% reliable, instant). The intern handles what requires judgment. The skill constrains the output. The CI gate makes the decision deterministic.
Scenario 3: Test case generation
❌ Pure deterministic: Template-based test generation from API spec. Covers happy path and basic validation. Misses implementation-specific edge cases entirely.
❌ Pure non-deterministic: "Write tests." Inconsistent coverage, sometimes over-tests trivial things, sometimes misses critical paths.
✅ Hybrid:
DETERMINISTIC: Parse OpenAPI spec → generate all endpoint/method/status combinations
NON-DETERMINISTIC: Intern reads the actual code → adds edge cases from implementation
DETERMINISTIC: Merge both lists, deduplicate, validate against test schema
DETERMINISTIC: Convert to Playwright/Jest tests using deterministic templates
NON-DETERMINISTIC: Intern reviews final tests for completeness
Scenario 4: Incident response
❌ Pure deterministic: Alert → PagerDuty → human investigates. Works, but slow. The human is the bottleneck at 3am.
❌ Pure non-deterministic: "Investigate and fix this." Absolutely not for production. The intern might decide a rollback is fine when it isn't.
✅ Hybrid:
DETERMINISTIC: Alert triggers, error data collected automatically
NON-DETERMINISTIC: Intern analyzes logs, identifies probable cause, drafts fix
DETERMINISTIC: Fix goes through existing PR review + CI pipeline
DETERMINISTIC: Deploy requires human approval (the intern proposes, human disposes)
The intern accelerates investigation from 45 minutes to 5 minutes. But the human makes the final call on what ships to production.
Use the intern for reasoning. Use code for enforcement.
┌─────────────────┐
User input ───▶ │ DETERMINISTIC │ Validate input, fetch data
│ (your code) │
└────────┬────────┘
│
┌────────▼────────┐
│ NON-DETERMINISTIC│ Reason, analyze, decide, generate
│ (the intern) │
└────────┬────────┘
│
┌────────▼────────┐
│ DETERMINISTIC │ Validate output, enforce schema,
│ (your code) │ apply business rules, execute action
└─────────────────┘
This is the "sandwich pattern": deterministic bread on both sides, non-deterministic filling in the middle. Your code ensures clean input and safe output. The intern does the thinking in between. Production AI systems work this way, and your Skills should be designed this way.
The strict: true parameter in tool definitions and structured outputs makes the hybrid approach practical. Without it, the intern might return JSON with missing fields, wrong types, or extra properties, and your deterministic validation layer has to handle infinite edge cases. With strict mode, the intern's output is guaranteed to match your schema. The grammar-constrained decoding ensures every token is valid. "Hopefully valid JSON" becomes "provably valid JSON."
📖 Go deeper: Anthropic Structured Outputs — OpenAI Structured Outputs
Open a new conversation. Paste any failing test output you have lying around, then add this at the end:
Your response MUST be valid JSON:
{
"test_name": "...",
"root_cause": "...",
"fix": "...",
"confidence": "HIGH | MEDIUM | LOW"
}
Respond with ONLY JSON.
The intern thinks creatively about the bug, but the output lands in a strict schema you can parse with code. Bread on both sides, creative filling in the middle.
Codex vs Claude Code
Codex is the delegating boss: hand off a task, come back for the result. Claude Code is pair programming: talk it through together.
Both Codex (OpenAI) and Claude Code (Anthropic) are coding agents. They give the intern tools and let them write code. They represent two different management philosophies.
Philosophy: "Here's the task. I'll check in later."
You write a task on a card: "Add OAuth login with password reset." You slide it under the intern's door and walk away. The intern works in a soundproof office (cloud sandbox), writes code, runs tests, and when done, puts a finished PR on your desk for review.
┌──────────────────────────────────────────────────┐
│ CODEX PLATFORM │
├──────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌───────────┐ │
│ │ Codex CLI │ │ IDE Plugin │ │ Codex │ │
│ │ (terminal)│ │ (VS Code) │ │ Cloud │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬─────┘ │
│ └───────────┬───┘───────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────┐ │
│ │ RESPONSES API (unified API) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ GPT-5.4 MODEL │ │ │
│ │ │ Dynamic reasoning: auto-routes │ │ │
│ │ │ none → low → med → high → xhigh │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ Built-in tools: │ │
│ │ Web search, file search, shell, │ │
│ │ computer use (native!), code interpreter │ │
│ │ │ │
│ │ External: MCP servers, Skills, AGENTS.md │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Conversation state / Compaction / Prompt cache │
└──────────────────────────────────────────────────┘
Key architectural decisions:
GPT-5.4 is a convergence model. OpenAI merged separate model families (reasoning models like o3/o4-mini and coding models like Codex) into ONE model that dynamically adjusts how much it "thinks." Simple questions use the fast path; complex debugging uses deep reasoning. Under the hood, GPT-5.4 contains effectively three models in one (main, mini, nano tiers) with a built-in router.
The Codex platform offers three surfaces — CLI, IDE, Cloud — all hitting the same API. Codex Cloud is the most differentiated: you delegate entire features to a cloud sandbox. The agent clones your repo, creates a branch, makes changes, runs tests, opens a PR — all without you watching.
Philosophy: "Walk me through your thinking."
You sit next to the intern's desk. They think out loud, show you their reasoning at each step, ask "should I go left or right here?" You steer in real-time.
┌──────────────────────────────────────────────────┐
│ CLAUDE CODE PLATFORM │
├──────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌───────────┐ │
│ │ Claude Code│ │ VS Code / │ │ Claude.ai │ │
│ │ CLI │ │ JetBrains │ │ (Cowork) │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬─────┘ │
│ └───────────┬───┘───────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────────┐ │
│ │ MESSAGES API (unified API) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ CLAUDE OPUS 4.6 MODEL │ │ │
│ │ │ Extended Thinking (visible!) │ │ │
│ │ │ Interleaved thinking (think between │ │ │
│ │ │ tool calls — Claude 4+ feature) │ │ │
│ │ │ Adaptive: low → medium → high │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ Built-in tools (5 categories): │ │
│ │ Read, Write, Execute, Browse, Orchestrate │ │
│ │ │ │
│ │ External: MCP servers, Skills, CLAUDE.md │ │
│ │ Hooks (pre/post tool execution) │ │
│ │ Agent Teams (parallel instances) │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Compaction / Context editing / Prompt caching │
└──────────────────────────────────────────────────┘
Key architectural decisions:
Claude Code's philosophy is an "agentic harness." The model is the brain. Claude Code is the body. Three phases per task: gather context → take action → verify results, blending in a loop.
Interleaved thinking is the differentiator. Claude 4+ models think between tool calls. After getting a tool result, the model reasons about it BEFORE deciding what to do next. Earlier models had to finish all thinking before tool use.
Agent Teams (Feb 2026): Spawn 4-16 Claude instances working in parallel on a shared codebase, each with a different role. A lead agent coordinates. The C compiler experiment: 16 agents, 2,000 sessions, 2 billion input tokens, $20,000 → 100,000 lines of production Rust code.
Codex's Loop:
You: "Fix the auth bug in the login module"
│
▼
1. LOAD CONTEXT
System prompt + AGENTS.md + Skills matched
Reasoning level auto-selected (detects "hard" → high)
│
▼
2. PLAN (reasoning tokens — internal)
"I need to find the login module, read the auth flow,
identify the bug, write a fix, run tests."
[GPT-5.4 auto-decides depth per turn. Simple fix?
Nano-tier. Race condition? Full reasoning.]
│
▼
3. EXECUTE (tool calls)
→ shell: find . -name "*.ts" | grep "login"
→ file_read: src/auth/login.ts
→ file_edit: src/auth/login.ts (applies patch)
→ shell: npm test -- --grep "login"
[If context fills → COMPACTION → continues seamlessly]
│
▼
4. VERIFY
Tests pass? → Done. Open PR.
Tests fail? → Loop back to step 2.
[Can iterate 7-24 hours autonomously]
Claude Code's Loop:
You: "Fix the auth bug in the login module"
│
▼
1. GATHER CONTEXT
System prompt + CLAUDE.md + Skills matched
Git state, project structure scanned
│
▼
2. THINK (extended thinking — VISIBLE to you)
<thinking>
The user wants me to fix an auth bug.
Let me search for login-related files...
</thinking>
│
▼
3. ACT + THINK (interleaved!)
→ grep: "login" across project
← results
<thinking>Found 3 files. session.ts looks relevant...</thinking>
→ view: src/auth/session.ts
← file content
<thinking>I see the bug — token not being refreshed...</thinking>
→ edit: src/auth/session.ts
→ bash: npm test
← 2 tests fail
<thinking>Need to fix the mock setup...</thinking>
→ edit: test/auth.test.ts
This THINK-ACT-THINK-ACT pattern is unique to Claude 4+.
│
▼
4. VERIFY
Tests pass? → Show diff for approval.
Tests fail? → Adjusts approach.
You can interrupt ANY time to steer.
GPT-5.4: The Auto-Router
GPT-5.4 contains a built-in intelligence router:
User query arrives
│
┌────┴──────────┬────────────┐
▼ ▼ ▼
┌──────┐ ┌──────────┐ ┌────────┐
│ NANO │ │ MINI │ │ FULL │
│ tier │ │ tier │ │ reason │
│ │ │ │ │ │
│93.7% │ │ Moderate │ │ 2x more│
│fewer │ │ tasks │ │thinking│
│tokens│ │ │ │tokens │
└──────┘ └──────────┘ └────────┘
Bottom 10% of turns use 93.7% fewer tokens. Top 10% think 2x longer. You can override with reasoning.effort: none/low/medium/high/xhigh.
Claude Opus 4.6: Adaptive Thinking
User query arrives
│
▼
┌─────────────────────────┐
│ ADAPTIVE THINKING │
│ (model reads context │
│ to decide budget) │
└───────┬─────────────────┘
│
▼
┌─────────────────────────┐
│ EXTENDED THINKING BLOCK │
│ <thinking> │
│ Visible to developers │
│ Interleaved with tools│
│ </thinking> │
│ │
│ Previous thinking blocks│
│ STRIPPED from future │
│ turns (saves desk space)│
│ EXCEPT during tool-use │
│ cycles │
└─────────────────────────┘
Control with /effort: low/medium/high. At medium effort, Opus 4.6 matches Sonnet 4.5's SWE-bench with 76% fewer output tokens.
The practical difference:
| What you see | Codex | Claude Code |
|---|---|---|
| While it "thinks" | Spinner, result appears when ready | Thinking content visible in real-time |
| After a tool call | Next action appears | Thinking block shows WHY it chose that action |
| Debugging AI behavior | Tracing dashboard | Read the thinking blocks directly |
| Steering mid-task | Interrupt with new instructions | Interrupt + see what it was thinking |
Codex is a senior engineer who goes away, works on the problem, and comes back with a solution. Claude Code is a pair programmer who narrates their thinking out loud. Both produce excellent code. Claude's approach is more transparent during execution.
Both platforms independently solved the same problem: how to work on tasks that exceed the context window.
A complex refactor might use 115,000+ tokens of context, growing with every turn. Without compaction, the agent hits the desk limit and stops.
The solution (both platforms):
Turns 1-15: Normal operation, desk fills up
│
▼
Turn 16: COMPACTION TRIGGERED
The intern summarizes their own work:
"I'm refactoring the auth module.
Done: fixed session.ts, updated 3 tests.
Next: fix remaining test mock."
│
▼
Turn 17: FRESH DESK
[system + summary + current files]
Intern continues where they left off.
Can repeat multiple times.
The intern writes commit messages for their own work. When the desk gets cluttered, they write a summary, clear the desk, and keep going. The AI equivalent of going home, sleeping, and picking up from your notes in the morning.
| Aspect | Codex | Claude Code |
|---|---|---|
| Training | Natively trained for compaction since 5.1-Codex-Max | Server-side compaction |
| Max session | 24+ hours continuous | 2,000 sessions / 2 weeks (C compiler) |
| Trigger | Automatic | Automatic + manual |
Codex: Experimental multi-agent. Launch parallel tasks, each runs in Codex Cloud sandbox. Results merge via PR workflow.
Claude Code Agent Teams (more mature):
Lead Intern (you interact with this one)
│
├── Intern 1: "Analyze auth module"
│ └── Own desk, own tools, own thinking
│
├── Intern 2: "Analyze database layer"
│ └── Works in parallel with Intern 1
│
├── Intern 3: "Analyze API endpoints"
│ └── Independent desk and tools
│
└── Intern 4: "Cross-reference findings"
└── Reads output from 1-3, synthesizes
All share the same codebase (building).
Each has their own desk (context window).
Lead intern coordinates and synthesizes.
The C Compiler benchmark: 16 interns, 2 weeks, $20,000, 100,000 lines of Rust that compiles the Linux kernel. One intern "cheated" by calling GCC when it couldn't solve a problem. Very intern behavior.
| Dimension | GPT-5.4 (Codex) | Claude Opus 4.6 |
|---|---|---|
| Context window | 1M tokens | 200K (1M beta) |
| Reasoning control | 5 levels + auto-router | 3 levels + adaptive |
| Thinking visibility | Becoming transparent | Fully visible |
| Computer use | Native in 5.4 | Tool-based |
| Pricing (per MTok) | $2.50 in / $15 out | $5 in / $25 out |
| Max sustained session | 24+ hours | 2,000 sessions / 2 weeks |
| Platform Feature | Codex | Claude Code |
|---|---|---|
| CLI | Yes (open source) | Yes (open source) |
| IDE | VS Code | VS Code, JetBrains |
| Cloud/async | Codex Cloud | Cowork |
| Multi-agent | Experimental | Agent Teams (mature) |
| MCP support | Native | Native + MCP Connector |
| Skills | Adopted the standard | Created the standard |
| Project config | AGENTS.md | CLAUDE.md |
| Approval modes | Suggest / auto / full auto | Ask / auto / YOLO |
| GitHub integration | Deep native | Via MCP |
OPENAI (7 models in 6 months):
──────────────────────────────────
Sep 2025: GPT-5-Codex (first dedicated)
Oct 2025: GPT-5-Codex-Mini (cost-efficient)
Nov 2025: GPT-5.1-Codex (improved)
Nov 2025: GPT-5.1-Codex-Max (first compaction)
Dec 2025: GPT-5.2-Codex (long-horizon)
Feb 2026: GPT-5.3-Codex (most capable coding)
Mar 2026: GPT-5.4 (converged model)
ANTHROPIC (fewer releases, bigger jumps):
──────────────────────────────────
May 2025: Claude 4 (interleaved thinking)
Sep 2025: Claude Sonnet 4.5 (top SWE-bench)
Oct 2025: Agent Skills launched
Feb 2026: Claude Opus 4.6 (Agent Teams, 1M context)
| Your Situation | Recommendation | Why |
|---|---|---|
| Quick interactive coding | Either | Comparable experience |
| Async/background features | Codex Cloud | Built for delegation |
| Need to see AI reasoning | Claude Code | Visible thinking blocks |
| Large parallel refactoring | Claude Code Agent Teams | Mature multi-agent |
| Tight budget, high volume | Codex (GPT-5.4) | Lower per-token pricing |
| Windows development | Codex | Specific Windows support |
| GitHub-native workflow | Codex | Deep GitHub/Slack integration |
| Building MCP toolchains | Claude Code | Created MCP, MCP Connector |
| Already in Claude ecosystem | Claude Code | Seamless with Cowork, Skills |
| Already in OpenAI ecosystem | Codex | Seamless with AgentKit, Evals |
Try both management styles on the same piece of code. First, the delegator:
"Fix the null safety issues in this code.
Return only the corrected code."
You get a result. Clean, fast, done. Now try the pair-programmer:
"Let's review this code together.
What potential issues do you see?
Walk me through your thinking."
You get a conversation. The intern explains its reasoning, catches things you might question, and you learn along the way. Which fits how you work?
OpenClaw
What if you could text your intern on WhatsApp? OpenClaw puts AI on messaging platforms. Quick and accessible, but think about security.
Someone connected the intercom system (MCP), the instruction manuals (Skills), and the intern's brain (LLM) to a WhatsApp/Telegram/Slack interface. Now you can text the intern from your phone.
"Hey, what broke in production?" The intern uses the intercom to check Raygun, reads K8s logs, searches GitHub, and texts you back. 24/7, from your phone, while you're at breakfast.
| Analogy | OpenClaw | Anthropic | OpenAI |
|---|---|---|---|
| The intern | LLM | Claude | GPT |
| The intercom | Gateway | MCP Client | Codex runtime |
| Instruction manuals | Skills (SKILL.md) | Agent Skills | Custom functions |
| Manual library | ClawHub | MCP marketplace | — |
| Intern's notes | Markdown on disk | Conversation history | Conversation history |
| Your phone | WhatsApp/TG/Slack | Claude Desktop | ChatGPT app |
The analogy gets darker here:
The intern has the keys to the building. Make sure the building has alarms.
Imagine your team wants to connect AI to Slack. Before you say yes, score each risk from 1 to 5:
[ ] API keys exposed in transit
[ ] Message history stored by 3rd party
[ ] No role-based access control
[ ] AI sees sensitive client data
[ ] No audit log of AI actions
Anything you scored 4 or higher is a blocker. What guardrails would you need before approving this?
Prompt Engineering
Vague notes get vague results. Specific, structured prompts with examples and constraints get precise, useful output.
The intern is only as good as the notes you slide through the slot. Six rules:
CONTEXT: [What you're working on, link to ADO item]
CODEBASE: [Which repo/folder to look at]
TASK: [Exactly what you want done]
CONSTRAINTS: [What NOT to do, patterns to follow]
OUTPUT: [What you expect back — code, explanation, test cases, etc.]
Example:
CONTEXT: ADO #4521 — adding cursor-based pagination to products API
CODEBASE: Check ProductsController.cs and OrdersController.cs (pagination reference)
TASK: Implement pagination on GET /api/products matching our existing pattern
CONSTRAINTS: Don't change the response schema for existing fields.
Use cursor-based, not offset-based. Max page size = 100.
OUTPUT: Implementation code + unit tests + PR description
Open a new conversation and try this structured prompt with the same code from earlier:
CONTEXT: C# backend, SOLID, async/await
TASK: Review for bugs, null safety, naming
CONSTRAINTS: Only real issues,
categorize by severity
CODE:
[paste your code here]
OUTPUT: Numbered list with
severity + suggested fix
Compare this to the generic review you got back in Step 4. Same intern, same code, better note. The difference is all in how you asked.
For Daily Work
Your intern works best with a clean desk. Start fresh conversations. Keep context focused. More papers = more diluted attention.
You know what the intern can do. Now: how to work with them day-to-day for great results instead of hit-or-miss guesses.
Everything (your question, conversation history, system prompt, tool definitions, attached files) must fit on the desk. A bigger desk doesn't mean better work.
The problem with a cluttered desk (and why self-attention is the reason):
The intern doesn't skim top-to-bottom like a human. They use self-attention — for every single token on the desk, they compute a relevance score against every other token. Think of it as the intern drawing an invisible line between every pair of words and asking: "How much does word A matter for understanding word B?"
With 5,000 tokens on the desk, that's 25 million lines to draw. At 100,000 tokens, it's 10 billion lines. This happens across 128 parallel "attention heads" (each specializing in different relationships: syntax, logic, code structure, coreference) across 100+ layers. The math works, but the signal gets diluted. When the intern draws 10 billion relationship lines, the important ones ("this variable is null because that constructor wasn't updated") compete with millions of irrelevant connections. The attention "budget" (which sums to 1.0 via softmax) gets spread across more tokens, meaning each individual token gets a smaller slice of focus.
Example: you paste a 500-line file and ask the intern to find a bug on line 47. With just that file on the desk (~2K tokens), the intern's attention on line 47 is concentrated. It can relate line 47 to every other relevant line. Now paste 20 more files alongside it (~40K tokens). The intern still sees line 47, but allocates attention across 40K tokens. Line 47's share shrinks from ~0.05% to ~0.0025% of the attention budget. The intern is more likely to miss the relationship between line 47 and the DI registration on line 312 of a different file.
Smaller, focused context = sharper attention = better results.
Research shows:
| Context Used | Quality Behavior |
|---|---|
| < 10K tokens (~25 pages) | Peak performance. High attention density. Good for focused tasks: single file reviews, specific questions, targeted generation. |
| 10K–50K tokens (~25–125 pages) | Still excellent. Good for multi-file tasks, PR reviews with context, feature implementation with spec + existing code. |
| 50K–100K tokens (~125–250 pages) | Noticeable degradation on details buried in the middle. The "lost in the middle" effect: info at the start and end gets more attention than info in the middle. |
| 100K+ tokens (~250+ pages) | Use with care. Great for search/retrieval tasks ("find X in this codebase"), but for reasoning tasks the intern starts missing details. Compaction helps here. |
Don't dump your entire codebase into the context "just in case." Give the intern what they need. Five relevant files beat fifty irrelevant ones.
How to right-size the desk:
A 2023 Stanford study (Liu et al.) showed that LLMs recall information at the start and end of the context much better than the middle. At 100K+ tokens, accuracy on middle-positioned facts dropped by 20-30%. Attention scores decay with positional distance, and both the beginning (system prompt area) and end (recent conversation) get natural attention boosts. Practical fix: put your most important context near the top or the end of the prompt. MCP helps because the model requests information on demand, placing it fresh at the end of context when needed.
📖 Go deeper: Lost in the Middle (Stanford) — Context Window Best Practices (Anthropic)
One of the most common mistakes: having a 2-hour conversation, getting progressively worse results, and blaming the model.
Start a new conversation when:
Continue the conversation when:
Rule of thumb: Treat conversations like browser tabs. Open a fresh one for each distinct task. Don't let a single tab become a 200-page novel.
Bad: "Review my PR, also check if there are any related bugs in ADO, and while you're at it write release notes and update the wiki."
Good: "Review PR #387. Focus on: error handling, null safety, and test coverage."
The intern can do many things, but attention is finite. Four tasks in one prompt means the intern allocates ~25% of reasoning capacity to each. One task gets 100%. The quality difference is stark, especially for deep-thinking tasks like code review.
Exception: Simple factual multi-part questions ("status of items #101, #102, and #103?") are fine batched. The intern isn't reasoning, just fetching.
The intern learns patterns faster than rules. Compare:
Instruction-based: "Write commit messages using conventional commit format with a type prefix, scope in parentheses, and a concise description."
Example-based:
Write commit messages like these:
feat(auth): add OAuth2 support for GitHub login
fix(api): handle null response from payment gateway
refactor(db): extract connection pooling into shared module
Now write a commit message for my changes.
The second approach works better because the intern's core skill is pattern recognition, trained on trillions of examples. Showing 2–3 examples activates the right pattern more reliably than a paragraph of rules. This is few-shot prompting, and it works for code style, PR descriptions, test formats, documentation templates.
Build a team skill (SKILL.md) that includes your examples. Every team member gets the same patterns without retyping them.
Most tools pick sensible defaults, but knowing when to adjust helps:
| Task | Temperature | Why |
|---|---|---|
| Code generation | 0 – 0.2 | You want correct, deterministic code. Creativity = bugs. |
| Code review | 0 – 0.3 | You want consistent, reliable analysis. |
| Test generation | 0.3 – 0.5 | A little creativity helps find edge cases. |
| Brainstorming | 0.7 – 1.0 | You want varied ideas, not the first obvious one. |
| Writing (docs, emails) | 0.5 – 0.7 | Natural variation, not robotic. |
In Claude Desktop and Codex: You don't control temperature directly. The tools choose based on the task. Knowing this helps you understand why the intern sometimes gives different answers to the same question (temperature > 0), and why code generation is rock-solid (temperature ≈ 0).
The intern is smart but not infallible.
EXPLAIN before executing. Generated API endpoint? Test it against your contract tests.The intern drafts, you approve.
Every token costs real money. Smart habits save budget:
| Action | Token Impact |
|---|---|
| Paste 1,000-line file every turn | ~4K input tokens × every message |
| Let intern read via MCP | ~4K tokens once, plus ~200 for the tool call |
| Ask for "code only, no explanation" | Saves ~500–2,000 output tokens per response |
| Start fresh instead of continuing a 50-message thread | Saves ~30K+ context tokens per turn |
Skills are instruction manuals (Chapter 4). Using them as a team:
/skills folder in your project repo. Everyone on the team gets them automatically.| Time | Action | Why |
|---|---|---|
| Start of day | Open Claude Desktop / Codex. Fresh conversation. | Clean desk. Full attention. |
| Pick up a task | "Read ADO #[item]. Summarize what needs to be done." | Intern reads the full context so you start informed. |
| Implementation | Share relevant files (or let the intern read via MCP). One task per conversation. | Focused desk = better code. |
| Before committing | "Review the changes I made. Focus on bugs and edge cases." | Fresh eyes. The intern hasn't seen this code before, so they catch what you've gone blind to. |
| PR description | "Write a PR description linking ADO #[item]. Include what changed, why, and how to test." | Saves 5 minutes every PR. Consistent format. |
| End of day | Close conversations. Don't save stale ones for tomorrow. | Tomorrow you'll want a clean desk, not yesterday's clutter. |
Each conversation starts with a prefill phase where the entire context (system prompt + conversation history) is re-processed through all model layers. At 200K tokens, this takes 5-15 seconds and costs significant compute. Shorter contexts mean faster first-token latency, lower cost, and higher quality. The KV-cache helps within a single turn, but between turns the cost of a long history compounds. Prompt caching helps for the static parts (system prompt, skills), but the conversation portion is always reprocessed in full.
📖 Go deeper: Prompt Caching (Anthropic) — Context Window Management
Your conversation has been running for a while now. Go back to it and ask the intern to recall your skill checklist from earlier. See how fuzzy the answer gets?
Now open a fresh conversation. Paste only the skill file and the code. Nothing else. The response will be noticeably sharper. That's what a clean desk does.
Your Monday Starts Now
Take the code-review-skill.md you built and drop it in a shared spot: a /skills folder in your repo, a shared drive, wherever your team can grab it.
Then think about what comes next. What other repeated tasks on your team could become a skill file? Who writes each one? How do you keep them current as your standards evolve?