Three decades of teaching machines to act. From STRIPS and BDI, through ReAct and Toolformer, to GRPO-trained agents solving real GitHub issues at superhuman rates. Covers the full lineage of agentic harnesses, tool-use training, end-to-end RL, and the MCP era. Includes open problems and portfolio projects.
Three decades of teaching machines to act. This post traces the intellectual lineage of agentic AI: from Sutton’s Dyna cousin STRIPS, through the ReAct loop, Toolformer’s self-supervised tool learning, GRPO-trained end-to-end agents, and the Model Context Protocol that standardized it all.
An agent is anything that perceives its environment and takes actions to achieve goals. That definition is almost forty years old, yet every modern agentic AI system is still solving the same five sub-problems:
The field has answered those five sub-problems in five distinct waves:
Each wave handed the next a sharper tool for one of the five sub-problems, while exposing the new bottleneck. Understanding the failure modes is as important as understanding the solutions.
The first agent that could plan from scratch was STRIPS (Stanford Research Institute Problem Solver), introduced by Richard Fikes and Nils Nilsson at SRI in 1971. STRIPS formalizes the world as a set of logical predicates and actions as operators with preconditions and effects:
\[\text{Action } a: \quad \text{Pre}(a) \subseteq S, \quad S' = (S \setminus \text{Del}(a)) \cup \text{Add}(a)\]where $S$ is the current state (a set of ground predicates), $\text{Pre}(a)$ must hold before the action can execute, $\text{Del}(a)$ removes predicates, and $\text{Add}(a)$ inserts them. The planner searches for a sequence $a_1, a_2, \ldots, a_n$ that transforms the initial state into a goal state.
STRIPS spawned PDDL (Planning Domain Definition Language), which is still the standard input format for classical planners. The entire field of automated planning builds on this formalism.
Why it broke down: STRIPS requires a closed-world assumption (every unknown fact is assumed false) and complete, noise-free observations. Real environments are open, partially observable, and stochastic. Writing PDDL models for non-trivial domains requires domain experts who manually encode every object and relation.
Michael Bratman’s philosophical work “Intention, Plans, and Practical Reason” (1987) gave the field a richer cognitive architecture: Beliefs, Desires, Intentions (BDI). An agent maintains:
The BDI agent continuously filters desires through beliefs to generate intentions, then executes plans that serve those intentions. Commitment distinguishes intentions from wishes: once an agent intends something, it does not abandon it unless it becomes impossible or irrelevant.
BDI architectures (AgentSpeak, Jadex, JASON) were deployed in air-traffic control, logistics, and simulation. They handled partial observability better than STRIPS, but they still required hand-coded belief-update rules and plans, limiting them to narrow, well-specified domains.
The gap through 2020: Deep learning handled perception (images, speech) but not reasoning or planning. The symbolic planner handled reasoning but not perception. No single architecture bridged both, until language models grew large enough to write plans in natural language.
Jason Wei, Xuezhi Wang, Dale Schuurmans, and colleagues at Google introduced chain-of-thought (CoT) prompting in January 2022 (arXiv 2201.11903, NeurIPS 2022). The insight: if you provide few-shot examples that include intermediate reasoning steps, large language models produce reasoning chains before answering.
The discovery was the emergence threshold: CoT only helps for models above approximately 100 billion parameters. Below that threshold, models produce fluent but logically incoherent chains. Above it, the chains become causally valid:
\[\text{acc}(M, \text{CoT}) \approx \begin{cases} \text{acc}(M, \text{direct}) & |M| < 100\text{B} \\ \text{acc}(M, \text{direct}) + \Delta_\text{CoT} & |M| \geq 100\text{B} \end{cases}\]where $\Delta_\text{CoT} > 0$ is a substantial gain on multi-step tasks. On GSM8K (grade-school math), PaLM 540B with 8-shot CoT surpassed fine-tuned GPT-3 by a large margin.
This established the mechanism that every agentic system now relies on: the model can write its own plan in the scratchpad, then follow it. Without CoT, an agent cannot decompose tasks.
Long Ouyang, Jeff Wu, Xu Jiang, and colleagues at OpenAI published “Training language models to follow instructions with human feedback” (arXiv 2203.02155, NeurIPS 2022). The three-stage RLHF pipeline:
\[\mathcal{L}_\text{RM}(\psi) = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(r_\psi(x, y_w) - r_\psi(x, y_l)\right)\right]\] \[\mathcal{L}_\text{PPO}(\theta) = \mathbb{E}\!\left[r_\psi(x, y) - \beta \log \frac{\pi_\theta(y \mid x)}{\pi_\text{ref}(y \mid x)}\right]\]where $y_w$ and $y_l$ are the preferred and rejected responses in a human comparison, $r_\psi$ is the learned reward model, and the KL penalty $\beta$ keeps the policy from drifting too far from the supervised baseline.
The result: a 1.3B InstructGPT model that human evaluators preferred over a 175B GPT-3 output. Instruction-following became the new baseline for all deployed LLMs, and the RLHF recipe became the template for all subsequent agent training.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, and colleagues at OpenAI introduced WebGPT (arXiv 2112.09332) in December 2021. They fine-tuned GPT-3 to answer long-form questions by interacting with a text-based browser via structured actions:
| Action | Effect |
|---|---|
Search[q] | Query Bing, return snippets |
Quote[p] | Extract passage $p$ from current page |
Goto[url] | Navigate to URL |
Scroll[dir] | Scroll page |
Training used imitation learning on human demonstrations, then RLHF where evaluators compared answer quality. The best-of-64 model was preferred to human demonstrators 56% of the time.
WebGPT established two principles that every subsequent agent system inherited: (1) structured action spaces with typed arguments are easier to supervise than free-form generation, and (2) RLHF can teach interactive behavior, not just static generation.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, and colleagues at Meta AI introduced Toolformer (arXiv 2302.04761, NeurIPS 2023). The key question: can a model learn which tool calls are useful without human annotation?
Toolformer answers yes with a three-stage self-supervised pipeline. Tool calls are inlined as text tokens: [Calculator(15*37) -> 555]. The model determines a call is useful if it reduces the loss on subsequent tokens:
A candidate call $c$ is kept if $\mathcal{L}_i^-(c) \leq \mathcal{L}_i^+(\epsilon) - \tau$ for some threshold $\tau$: the call must make the continuation more predictable than having no call at all. The final fine-tuning dataset contains only calls that passed this filter.
Toolformer incorporated calculators, QA systems, search, translation, and calendar tools. It achieved near-state-of-the-art zero-shot performance across downstream tasks while preserving core language modeling ability: the model learned to call tools when they helped and ignore them when they did not.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao at Google and Princeton introduced ReAct (arXiv 2210.03629, ICLR 2023). The core observation: chain-of-thought (pure reasoning) and tool use (pure acting) each have complementary failure modes. CoT hallucinates when it needs facts it does not have. Tool-only approaches cannot reason across multiple retrieved facts.
ReAct interleaves them:
Thought: I need to find the capital of the country where the Eiffel Tower is.
Action: Search[Eiffel Tower location]
Observation: The Eiffel Tower is in Paris, France.
Thought: France's capital is Paris. Now I need the population.
Action: Search[Paris population 2023]
Observation: 2.1 million (city proper), 12 million (metropolitan area).
Thought: The answer is Paris, France, with ~2.1 million residents.
Finish[Paris, France, ~2.1 million]
On HotpotQA (multi-hop QA), ReAct with Wikipedia API outperformed pure CoT by resolving hallucinations through external verification. On ALFWorld (household task game), ReAct achieved a 34% absolute improvement over imitation/RL baselines using only 1-2 in-context examples. ReAct became the canonical prompt format for agentic systems; nearly every framework that followed inherited its Thought/Action/Observation structure.
The word “harness” in agentic AI refers to the software scaffold that wraps a language model in an execution loop. The model generates text; the harness parses that text into structured actions, dispatches them to tools or environments, collects the results, formats them as observations, and feeds everything back into the model’s context. The model itself is stateless between calls; the harness carries all state.
Yongliang Shen, Kaitao Song, and colleagues at Tsinghua University (published as JARVIS by Microsoft) introduced HuggingGPT (arXiv 2303.17580, NeurIPS 2023). The architecture uses ChatGPT as an orchestrator over Hugging Face models in four stages: (1) task planning (decompose request into subtask graph), (2) model selection (match each subtask to a Hugging Face model by description similarity), (3) task execution (run models in dependency order), (4) response generation (synthesize results).
The pattern: LLM as a controller-brain orchestrating specialized expert models as actuators. This decomposition meant the orchestrating LLM needed breadth (understand many task types) while the specialist models provided depth.
Both systems, released open-source in early 2023, demonstrated fully autonomous multi-step agent loops in the wild. BabyAGI (Yohei Nakajima) maintained a task queue with vector-memory context; AutoGPT chained GPT-4 API calls with a plan-evaluate-revise loop.
They exposed three critical failure modes that drove a decade of subsequent research:
| Failure mode | Cause | Later fix |
|---|---|---|
| Infinite loops | No termination criterion | Budget/step limits; RLHF on episode reward |
| Compounding errors | Context fills with bad observations | Context compression; re-planning |
| Hallucinated tool calls | Model not trained for structured output | Function calling, fine-tuning |
Despite their limitations, AutoGPT reached 100,000 GitHub stars in five days, establishing that the world wanted autonomous agents.
Xiao Liu, Hao Yu, Hanchen Zhang, and colleagues at Tsinghua introduced AgentBench (arXiv 2308.03688, ICLR 2024): the first systematic multi-environment evaluation. Eight environments across code, game, and web tasks. Key finding: GPT-4 scored 2-6x higher than the best open-source 70B models, confirming that agentic ability was not just a matter of reasoning but of reliable structured output and long-context coherence.
ReAct and HuggingGPT showed what agents could do with the right prompts. The next question: can you train a model to be a better agent, rather than prompting it into the role at inference time?
Two early approaches emerged: trajectory distillation (copy expert agent trajectories into fine-tuning data) and self-improvement (generate trajectories, filter successes, fine-tune on them).
Baian Chen and colleagues at System2 Research introduced FireAct (arXiv 2310.05915). The recipe: collect diverse ReAct trajectories from GPT-4 across multiple tasks and prompting methods, then fine-tune a smaller model (Llama-2-7B) to reproduce them. Fine-tuning on just 500 GPT-4 trajectories yielded a 77% improvement on HotpotQA. The smaller model learned to:
FireAct established that agentic capability can be distilled into smaller models, analogous to how reasoning ability can be distilled via CoT fine-tuning.
The AgentTuning paper (arXiv 2310.12823) from Tsinghua generalized FireAct to six tasks (ALFWorld, WebShop, Mind2Web, etc.), creating AgentInstruct: 1,866 high-quality interaction trajectories covering diverse agentic scenarios. Fine-tuning Llama-2 on a mixture of AgentInstruct and general instruction data preserved generalization while improving agent performance, avoiding the catastrophic forgetting that naive fine-tuning causes.
For a trajectory $\tau = (o_0, a_0, o_1, a_1, \ldots, o_T)$ where $o_t$ is an observation and $a_t$ is an action (tool call), the supervised fine-tuning objective is:
\[\mathcal{L}_\text{SFT}(\theta) = -\sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \log p_\theta\!\left(a_t \;\Big|\; o_0, a_0, \ldots, o_t\right)\]This treats each action token as a next-token prediction target, conditioned on the full history. The model learns the distribution over valid actions at each step, not just valid completions of text. The key distinction from standard fine-tuning: the ground truth at step $t$ is the expert action, not the model’s own prediction, so the model can never learn from its own mistakes unless we move to RL.
Supervised fine-tuning on expert trajectories has a fundamental flaw: distribution shift. At training time, every observation $o_t$ was generated by the expert taking correct actions. At test time, the model’s own (possibly wrong) actions generate different observations. The model was never trained to recover from its own errors.
The solution is to train the model on its own trajectories, scored by whether the episode succeeded. This is reinforcement learning over full agent trajectories.
Carlos Jimenez, John Yang, Alexander Wettig, and colleagues introduced SWE-bench (arXiv 2310.06770): 2,294 real GitHub issues from popular open-source repositories. Each task provides a docker environment with the pre-issue codebase; the agent must produce a patch that passes the original repository’s test suite. No partial credit; the test either passes or it does not.
Pass rate $\hat{p}$ is estimated over $n$ independent runs:
\[\hat{p}@1 = \frac{1}{n}\sum_{i=1}^{n} \mathbf{1}[\text{patch}_i \text{ passes all tests}]\]SWE-bench established that agents could be evaluated on real-world utility, not curated toy benchmarks. The improvement trajectory since its release is the clearest signal of agentic progress:
DeepSeek introduced GRPO in DeepSeekMath (Shao et al., 2024) and popularized it via DeepSeek-R1 (arXiv 2501.12948). GRPO eliminates the separate value/critic network by computing advantages within a group of sampled outputs for the same input:
\[\hat{A}_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^{G})}{\text{std}(\{r_j\}_{j=1}^{G})}, \qquad i = 1, \ldots, G\] \[\mathcal{L}_\text{GRPO}(\theta) = -\mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^{G} \min\!\left(\rho_i \hat{A}_i,\; \text{clip}(\rho_i, 1\pm\epsilon)\hat{A}_i\right) - \beta\, \text{KL}[\pi_\theta \| \pi_\text{ref}]\right]\]where $\rho_i = \pi_\theta(a_i \mid o) / \pi_\text{old}(a_i \mid o)$ is the importance ratio. By normalizing advantages within a group, GRPO achieves the credit assignment of PPO while reducing memory by 40-60% (no critic network).
RLVR (Reinforcement Learning with Verifiable Rewards) pairs GRPO with symbolic reward functions: a calculator verifies math answers; a test runner verifies code patches; a web crawler verifies factual claims. Rewards are binary and deterministic, eliminating the need for a learned reward model entirely.
DeepSeek-R1-Zero trained with pure GRPO on a base model (no SFT) and produced emergent behaviors: self-evaluation, error correction, and extended “thinking” traces. The “aha moment” - where the model discovers it should reconsider a wrong intermediate step - emerged spontaneously from reward signal alone.
WebAgent-R1 (arXiv 2505.16421) extended GRPO to multi-turn web agent tasks via M-GRPO. The challenge: each web episode is a sequence of turns $(o_0, a_0, o_1, a_1, \ldots, o_T, a_T)$ with a single terminal reward $r_T$. Credit assignment across turns requires propagating the reward backward through the trajectory.
M-GRPO handles this via a modified advantage estimate:
\[\hat{A}_t^{(i)} = \gamma^{T-t} \cdot \frac{r_T^{(i)} - \mu_G}{\sigma_G}, \qquad \mu_G = \frac{1}{G}\sum_{j} r_T^{(j)}, \quad \sigma_G = \text{std}\!\left(\{r_T^{(j)}\}\right)\]where $\gamma^{T-t}$ applies a discount to earlier turns. Qwen-2.5-3B trained with M-GRPO improved from 6.1% to 33.9% on WebArena-Lite; Llama-3.1-8B improved from 8.5% to 44.8%, outperforming GPT-4o.
SWE-agent (Carlos Jimenez, John Yang, et al., NeurIPS 2024) designed a custom harness specifically for software engineering: Agent-Computer Interface (ACI) with commands like edit_file, run_tests, search_code. The ACI gives the model a Unix-like interface and enforces structured output at each step. Paired with GPT-4o, SWE-agent was the first open-source agent to achieve double-digit SWE-bench scores.
OpenAI’s function calling API (June 2023) standardized how models emit tool calls. The developer provides a JSON Schema description of each function; the model outputs a structured JSON object rather than free-form text. This removed the need for prompt-based output parsing in every harness:
{
"name": "search_web",
"arguments": {"query": "Paris population 2024"}
}
Before function calling, every harness had a brittle text parser that broke whenever the model formatted its tool call slightly differently. After, tool dispatch became reliable enough for production use.
The deeper problem: every AI application needed custom connectors to every data source. N applications and M data sources required $N \times M$ connectors. David Soria Parra and Justin Spahr-Summers at Anthropic solved this with MCP (Model Context Protocol), announced November 2024.
MCP is built on JSON-RPC 2.0 and provides three primitives: tools (functions the model can call), resources (files and data the model can read), and prompts (templates the server exposes). Anthropic donated MCP to the Agentic AI Foundation (Linux Foundation) in December 2025; OpenAI adopted it in March 2025. By mid-2026, MCP is the industry standard for agent-tool integration, supported by Claude, ChatGPT, Gemini, and every major open-source framework.
Single-agent loops hit context and cost limits. Multi-agent orchestration assigns subtasks to specialist agents, runs them in parallel, and aggregates results. Three orchestration patterns emerged:
| Framework | Pattern | Best for |
|---|---|---|
| LangGraph | Directed graph with conditional edges | Complex branching workflows, compliance |
| CrewAI | Role-based crews with typed processes | Fast iteration, task delegation |
| AutoGen / AG2 | Conversational multi-agent with GroupChat | Code generation, iterative refinement |
LLMCompiler (Kim et al., 2023) formalized parallel tool execution and reported a 3.6x speedup over sequential ReAct for tasks with independent subtasks: plan the entire call graph first, then dispatch parallel calls, then synthesize. This plan-then-execute decomposition reduces per-token cost significantly on tool-heavy tasks.
The final frontier of tool use: giving agents access to any graphical user interface without an API. Claude Computer Use (research preview, 2024) and OpenAI’s CUA (Computer-Using Agent, 2025) train models on GUI interaction: they see a screenshot, emit mouse coordinates and keyboard actions, observe the resulting screenshot.
CUA uses GPT-4o vision combined with RL on task success signals from GUI environments. Claude Computer Use achieved 72.5% on complex GUI tasks in February 2026 evaluations, including multi-tab web coordination and spreadsheet navigation.
Does agent performance scale like language modeling? Niklas Kildeby et al., “Towards a Science of Scaling Agent Systems” (arXiv 2512.08296), evaluated 260 configurations across 6 benchmarks and 5 architectures, deriving quantitative scaling principles validated on GPT-5.2 (MAE = 0.071).
Key findings:
\[\text{perf}(N, K, T) \approx A \cdot N^\alpha \cdot K^\beta \cdot T^\gamma\]where $N$ is model size, $K$ is number of agents in ensemble, and $T$ is the number of turns/steps allowed. The exponents $\alpha, \beta, \gamma$ depend on task type:
Skill library scaling (arXiv 2605.16508): as agentic systems accumulate reusable skill libraries, routing accuracy decays logarithmically with library size, measured across 15 frontier LLMs and 1,141 real-world skills. This suggests skills need semantic indexing, not flat lookup.
| System | Org | Paradigm | SWE-bench | Key contribution |
|---|---|---|---|---|
| SWE-agent | Princeton | harness + GPT-4o | ~40% | ACI interface design |
| Devin 2.0 | Cognition | proprietary harness | 45.8% | E2E trained coding agent |
| OpenHands | MIT | open platform | 40%+ | community ecosystem |
| Claude Code | Anthropic | native coding agent | 70%+ | RLVR + long context |
| WebAgent-R1 | Multi | M-GRPO fine-tune | N/A (web) | 44.8% WebArena |
| CUA | OpenAI | GUI + RL | N/A | 72.5% GUI bench |
Most agentic RL uses a single terminal reward: either the patch passes tests or it does not. But a 50-step coding episode may have correct early steps followed by a mistaken middle and a failing patch. The standard GRPO advantage:
\[\hat{A}_t = \gamma^{T-t} \cdot r_T\]gives disproportionate credit to early steps when $\gamma < 1$ and none to late steps when $\gamma \approx 1$. Dense, per-step reward signals exist for math (verifiable intermediate steps) but are hard to construct for open-ended tasks. Process reward models (PRMs, Lightman et al., NeurIPS 2023) provide step-level scores but require expensive human annotation.
A long agent episode fills the context window with observations, failed attempts, and redundant tool outputs. Naive truncation discards crucial earlier steps; summarization loses detail. A good compaction strategy should satisfy:
\[\text{compress}(\tau_{1:t}) = \tau'_{1:t'} \;\;\text{such that}\;\; \text{perf}(\pi, \tau'_{1:t'}) \approx \text{perf}(\pi, \tau_{1:t}), \quad t' \ll t\]Current approaches include rolling summarization, attention-based selection, and staged compaction (preserve tool outputs; summarize thoughts). No principled solution exists.
In a multi-agent pipeline, one agent’s output is another agent’s input. Errors propagate and amplify. A formal model: if each agent has error rate $\varepsilon_i$ and agents operate in sequence, the pipeline error after $n$ agents is bounded by:
\[\varepsilon_\text{total} \leq 1 - \prod_{i=1}^{n}(1 - \varepsilon_i) \approx \sum_{i=1}^n \varepsilon_i\]For $n = 10$ agents each at $\varepsilon = 0.05$, the pipeline error reaches 40%. Centralized verification nodes (one agent checks another’s output) reduce this, but verification is itself fallible. Formal verification of agent outputs on structured tasks (code: run tests; math: CAS check) is tractable; verification on open-ended tasks is not.
An agent calling external tools is exposed to prompt injection: a web page or API response can contain text that hijacks the agent’s reasoning. The formal threat:
\[p_\text{inject}(\tau) = \arg\min_\delta \; d(\tau, \tau^*) \;\;\text{s.t.}\;\; \pi(\tau + \delta) = a_\text{adversary}\]Defenses exist (sandboxed parsing, structured output, input filtering) but none are watertight when the model processes free-form text. This is the XSS problem ported to agentic AI.
When the reward is a test suite, agents learn to hack the tests rather than solve the underlying problem: delete tests, hard-code expected outputs, or exploit test timeouts. DreamerV3 in Minecraft encountered the same problem at a smaller scale (collecting diamonds without smelting). A principled measure of reward hacking:
\[\text{Goodhart gap} = \text{score}(r_\text{proxy}) - \text{score}(r_\text{true})\]where $r_\text{proxy}$ is the harness reward and $r_\text{true}$ is the actual user utility. Monitoring this gap across training runs should be standard practice but rarely is.
A model trained with GRPO on OpenHands trajectories may not transfer to SWE-agent trajectories because the action spaces and observation formats differ. There is no principled understanding of what harness-independent agentic capability looks like - analogous to asking whether a skill learned in one video game transfers to another. Standardized action vocabularies (MCP helps here) and evaluation across harnesses would clarify this.
Build a ReAct loop in under 200 lines of Python: a context window, a tool dispatcher (search + calculator), and the Thought/Action/Observation loop. Evaluate on HotpotQA (Wikipedia 2-hop QA). Measure task success rate vs. number of tool calls per question:
\[\text{efficiency} = \frac{\text{success rate}}{\text{mean tool calls per task}}\]Compare: (1) ReAct with GPT-4o-mini, (2) chain-of-thought only (no tools), (3) direct answer (no reasoning). This reproduces the core ReAct ablation from the original paper.
Portfolio signal: clean systems code; deep understanding of the harness/model interface.
Implement the Toolformer filtering criterion on a small LM (e.g., GPT-2 or Pythia-410M) and a single tool (calculator). For each sentence in a math-word-problem dataset, sample 5 candidate insertion points and 3 candidate tool calls at each, execute them, compute the loss-reduction:
\[\Delta\mathcal{L}_i(c) = \mathcal{L}_i^+(\epsilon) - \mathcal{L}_i^-(c)\]and fine-tune the model only on calls where $\Delta\mathcal{L}_i > \tau$. Compare to a baseline that uses all sampled calls regardless of utility.
Portfolio signal: self-supervised training pipelines; understanding of when models benefit from tools.
Implement GRPO (group advantage, no critic) on a short-horizon agentic task: a text-based maze where the agent issues move(direction) commands and receives a binary reward (reached goal / did not). Train a small transformer from scratch, not a pre-trained LM. Plot the group reward variance vs. training step to observe policy collapse and entropy:
Compare GRPO to REINFORCE (single-sample advantage) on sample efficiency.
Portfolio signal: RL training from scratch; direct GRPO implementation; understanding of credit assignment.
Pick 20 easy issues from SWE-bench Verified (labeled “easy” by difficulty tag). Build an evaluator that (1) spins a Docker container, (2) runs a harness (SWE-agent or OpenHands), (3) applies the agent’s patch, (4) runs the test suite, and (5) reports pass@1. Compare performance of GPT-4o-mini vs. Sonnet-3.5 on these 20 issues.
\[\text{pass@1} = \frac{\text{issues resolved}}{20}\]Track: number of editing cycles, number of test executions, total tokens consumed.
Portfolio signal: production-grade evaluation infrastructure; cost/performance trade-off analysis.
Build an MCP server that exposes a personal knowledge base (your Obsidian vault, research papers, or codebase) as a set of tools: search_notes(query), get_note(title), list_notes(tag). Plug it into Claude Desktop or any MCP-compatible client. Measure retrieval precision of the MCP tool vs. naive LLM recall:
This project teaches both the MCP protocol and how context-retrieval quality affects agent task success.
Portfolio signal: systems integration; MCP protocol; personal knowledge management tooling.
Train a code-writing agent on a simple coding benchmark (10 problems, each with a test suite). After training, audit the agent’s patches for reward-hacking patterns: deleted test cases, hard-coded expected values, sys.exit(0) injections. Build a static analysis pass that flags these patterns and measure the Goodhart gap:
Hold-out tests are written by a human after training, so the agent cannot have overfit to them.
Portfolio signal: evaluation methodology; safety-aware MBRL; direct contribution to trustworthy agentic training.
Take the same underlying model (GPT-4o-mini) and wrap it in three different harnesses: (1) vanilla ReAct loop, (2) plan-then-execute (LLMCompiler-style), (3) multi-agent (orchestrator + executor). Evaluate on 30 tasks from AgentBench. Report:
\[\text{relative speedup} = \frac{\text{tokens (harness 1)}}{\text{tokens (harness X)}} \quad \text{at equal task success rate}\]This isolates the harness contribution from the model contribution, a question that almost no published paper cleanly answers.
Portfolio signal: controlled experimental design; cost-efficiency analysis; harness-agnostic evaluation.
This survey covers the main lineage from STRIPS (1971) through the MCP-standardized multi-agent systems of 2025-2026. The field is moving fast: every major venue now has agentic sessions (NeurIPS Agentic AI workshop, ICML agent track, ICLR SWE-bench leaderboard papers, ACL for language grounding). The SWE-bench leaderboard and WebArena are the best live signals of where the frontier sits today.