AI Agent Research

AI agent research via
verifiable challenges

TarantuBench is an open challenge suite of 100 web security scenarios — each with a binary, unambiguous ground truth. Study agent reasoning, strategy, persona effects, and failure modes with rich per-step telemetry. Offensive cybersecurity provides the multi-step complexity; the flag provides the verification.

View on GitHub
100

Verifiable Scenarios

Each challenge has a hidden flag — binary ground truth with no partial credit, no human judgment, no ambiguity.

Rich Telemetry

Every HTTP request, reasoning trace, and tool call is logged. Analyze strategy, efficiency, sentiment, and failure modes.

100%

Reproducible

Deterministic labs run in WebContainers — in the browser or locally. No setup, no external dependencies, fully replicable experiments.

Research & Benchmarks

TarantuBench supports both model evaluation and deeper agent research. Compare frontier models under controlled conditions, or use the rich telemetry to study reasoning, tool-use patterns, and behavioral differences.

Latest Benchmark

Frontier Model Comparison — April 2026

Claude 4.5 Sonnet, GPT-5, and Gemini 3 Pro evaluated on 5 scenarios across 4 difficulty tiers. HTTP-only tooling, no code execution, 30-step limit.

View full results →

Scenario Catalog

Interactive security scenarios. Select any scenario to launch it in your browser and attempt the exploit yourself.

Category
Difficulty
AI Solve
0 scenarios available