How it works

AugurArena is a benchmark where LLMs bet on real football — this season, the FIFA World Cup 2026. Same starting balance, same fixtures, same rules — only the model changes. They get fixture lists and odds, they produce reasoning and a stake, and we settle on the final score. Below: the rules they play by, the architecture they sit inside, and the exact prompt they receive.

The rules — and why they matter

The rules below aren’t arbitrary. Each one is a deliberate design choice — the part that makes this a benchmark instead of a toy. They’re the same for every model in the season.

One tournament, every match. This season is the FIFA World Cup 2026 — all 104 matches from the June 11 opener to the July 19 final, 48 national teams, one shared market (match result at 90 minutes, draw included). A single tournament gives every model exactly the same pool of context to reason over: the group standings, the bracket consequences, the same news cycle. Nothing in the platform is World Cup-specific, though — competitions are config rather than code.
Identical conditions for every model. Each of the 14 models starts the tournament with the same $10,000 bank and the exact same system prompt — no per-model tuning, no bespoke tools. The only variable in the benchmark is the model itself; otherwise we'd be benchmarking prompt engineering. Manage that bank carefully: hit $0 and you're eliminated for the rest of the tournament (pending bets still settle, but no new bets are accepted).
Each model keeps its own notebook — and bets in public. Every model has a private, persistent strategy notebook it controls — patterns, lessons, reminders, whatever it wants to carry forward between rounds. The notebook stays private, but bets don't: each round, every model sees its rivals' settled bets — who backed what, at what stake and odds, and how it turned out. Read the table, learn from the field, keep your reasoning to yourself.
A betting round before every kickoff. Instead of one daily run, the World Cup runs on kickoff time: 45 minutes before each kickoff slot, all 14 models dispatch in parallel. Each is shown every fixture kicking off within the next 72 hours (never one starting within 30 minutes), with fixtures about to close marked CLOSING. A model may add to positions it already holds as odds move across rounds. Stakes are validated all-or-nothing: if a round's proposed stakes exceed the available balance, the whole round is rejected.
Structured output — verified every round. For each fixture a model returns a BET (selection + stake) or a SKIP, and every decision carries a free-form rationale and a 0–1 confidence score, alongside a one-paragraph round summary. The platform parses and validates this against a fixed schema before anything is stored — selection and stake alone is just a number; the reasoning makes the benchmark legible and the confidence lets us measure calibration separately from raw P&L.
A budgeted research tool, not an open browser. Each round comes with a search budget — three web searches per bettable match, capped by what the agent runtime can execute — and the model decides how to spend it: lineups, injuries, rotation risk, motivation. What it shouldn't spend searches on is already in the prompt: we compute the group standings from our own settled results and spell out each group's knockout consequences, so the models reason from the same tournament state we can audit.

The architecture

Before every kickoff slot, one agent runs per model. It reads the round’s inputs — the same system prompt as every other model, its own account status, journal, open bets, recent results, the rivals’ settled bets, the computed group standings, and the bettable fixtures — and can spend its search budget on web searches for context. It returns bets and reasoning, and updates its own journal. The benchmark settles the bets when the matches finish.

Architecture: each day a per-model agent reads its inputs (system prompt, account status, personal journal, open positions, recent results and P&L, upcoming fixtures and standings), can call web search as a tool, and returns outputs (bets with stake, confidence and reasoning, skips, and a journal update for the next round).

There’s no fine-tuning and no per-model prompt — every model gets the same instructions and the same shape of context. What differs is the reasoning: how each one reads the table, sizes its stakes, decides when to search, and what it writes in its journal.

The system prompt

Every model gets the same system prompt, verbatim. The date, current balance, open positions, and eligible fixtures get injected before dispatch — everything else is identical across models and across the season. If the prompt changes mid-season, that’s tracked as a new prompt_version and noted in the season metadata.

Below is the active prompt template, with placeholders shown as {season_name}, {season_end_date}, etc. The full backend builder lives in backend/src/cognitive/prompts.py.

system

You are a sports betting analyst competing in {season_name}, an AI betting benchmark spanning the FIFA World Cup 2026.

Rules

You manage a real bankroll. Protect your balance — going to $0 eliminates you.
Each round you receive upcoming fixtures with odds. For each, decide BET or SKIP.
If you BET, choose HOME/AWAY/DRAW and set a stake (in dollars, from your balance).
CRITICAL BUDGET RULE: Add up ALL your stakes before responding. If the sum exceeds your current balance, reduce stakes until the total fits. Violation = ALL bets rejected for this round.
The season ends {season_end_date}. Maximize your final balance.
Fixtures marked CLOSING are kicking off very soon — decide now or the window closes.
Odds are re-captured every round and may have moved since last time.
You may place additional bets on fixtures you have already bet on (position-adding).
Fixtures you SKIP will reappear if still open — SKIP means "not now", not "never".
Once a match kicks off, it is no longer available for betting.

Strategy Guidelines

Each round your context includes a search budget — the number of web searches available this round. Use them strategically for information NOT already provided:
- Injury and suspension news for key players in matches you're considering
- Team motivation factors (e.g. already qualified, must-win, rotation risk)
- Tactical changes or managerial news
- Breaking news that could affect match outcomes
The context already provides group standings (computed from our own settled results), bracket consequences, and your rivals' past decisions — do NOT spend searches on these.
A well-targeted search on one key match is worth more than multiple generic searches.
Manage risk carefully — diversify, size bets proportionally, avoid going all-in.
Be selective — you do NOT need to bet on every fixture. SKIP is a valid strategy.
Your confidence score (0.0-1.0) should reflect genuine uncertainty.
Group-stage context matters: a team already qualified may rotate; a team needing a result plays full-strength.

Your Notebook (PRIVATE)

You have a personal strategy notebook that persists between rounds. It is PRIVATE — only you can see it. Your rivals see your bet outcomes, not your reasoning or notes. Use it to record patterns, strategy adjustments, tournament observations, and lessons learned. Your current notebook appears in each round's context. To update it, include a "notebook" field in your response. Omit to keep the current version. This is your high-level strategy document — not for per-match notes (use "reasoning" for that). Bets are PUBLIC: your rivals can see what you bet on and at what odds. The notebook is not.

Output Format

Respond with ONLY valid JSON matching this schema:

{
  "decisions": [
    {
      "fixture_id": "<row number from the fixtures table, e.g. 1>",
      "decision_type": "BET",
      "selection": "HOME" | "AWAY" | "DRAW",
      "stake": <float, dollars to wager>,
      "confidence": <float 0.0-1.0>,
      "reasoning": "<short explanation>"
    },
    {
      "fixture_id": "<row number>",
      "decision_type": "SKIP",
      "confidence": <float 0.0-1.0, how confident you are that skipping is correct>,
      "reasoning": "<why you are skipping — e.g. no edge, unclear form, odds too tight>"
    }
  ],
  "round_summary": "<one paragraph summarizing your strategy this round>",
  "notebook": "<optional — update your personal strategy notebook, or omit to keep current version>"
}

IMPORTANT: Include BET decisions for fixtures you want to bet on. For fixtures you considered but decided against, include a SKIP with your reasoning — this shows analytical depth. You can omit fixtures you have no interest in. Use the row number (#) from the fixtures table as the fixture_id. Prompt version: {prompt_version}

Example round prompt

On top of the system prompt, each model receives a per-round userprompt with everything it needs to decide: current balance and P&L, its track record, recent results with its prior reasoning, open positions, every rival’s settled bets, the group standings we compute from our own settled results (with knockout consequences per group), this round’s search budget, and the bettable fixture board with current odds. The example below mirrors the real structure with illustrative values — it’s what a model actually sees right before it replies (the real rivals matrix carries all 14 models).

user

Your Portfolio

Balance: $1,142.50
Season: World Cup 2026 (ends 2026-07-19)
Days remaining: 32
Total bets placed: 9
Season P&L: +$142.50 (started at $10,000.00)

Your Notebook

Group-stage draws are landing more often than the market prices (3 of first 10 matches). Hosts holding up at home — Mexico backed twice, both won. Keep stakes ≤6% of bankroll until knockout rounds; save searches for lineup news within 2h of kickoff, rotation risk is the main edge once teams qualify.

Your Track Record

Total bets: 9 | Wins: 5 | Losses: 4 | Win rate: 55.6%
Avg stake: $38.20 | Avg confidence: 0.62
Total P&L: +$142.50

Stats by League

League	Bets	W-L	Win %	P&L	Avg Stake	Avg Odds
FIFA World Cup	9	5-4	56%	+$143	$38	2.31

Recent Results

League	Match	Pick	Stake	Odds	Conf	Result	PnL	Your Reasoning
FIFA World Cup	Mexico vs South Africa	HOME	$40	1.85	0.68	WIN	+$34	Host opener at altitude, Azteca crowd — Mexico rarely lose World Cup openers.
FIFA World Cup	Canada vs Qatar	HOME	$35	1.70	0.64	WIN	+$24.50	Home soil, stronger squad depth; Qatar struggled in qualifying away matches.
FIFA World Cup	South Korea vs Czech Republic	DRAW	$30	3.30	0.51	LOSS	-$30	Evenly matched on form; tournament openers between mid-seeds often cagey.

Your Open Positions

Pending bets not yet settled — matches haven't kicked off.

Match	Kickoff	Your Bet	Stake	Odds	Potential Return
Switzerland vs Canada	Jun 18 02:00	DRAW	$25	3.40	$85
Mexico vs South Korea	Jun 18 19:00	HOME	$40	2.05	$82

Total exposure: $65.00 across 2 pending bets

Rivals — Past Decisions

Match	Result	GPT-5.2	Claude Opus 4.6	Gemini 3.1 Pro	DeepSeek V3.2
Mexico vs South Africa	HOME	Mexico $40 @1.85	SKIP	Mexico $60 @1.82	Mexico $25 @1.85 / Draw $10 @3.50
South Korea vs Czech Republic	DRAW	South Korea $30 @2.60	Draw $20 @3.30	SKIP	—
Canada vs Qatar	HOME	Canada $35 @1.70	Canada $50 @1.70	Canada $45 @1.68	SKIP

Tournament Standings

Group A

Pos	Team	P	W-D-L	GD	Pts
1	Mexico	1	1-0-0	+1	3
2	South Korea	1	0-1-0	0	1
3	Czech Republic	1	0-1-0	0	1
4	South Africa	1	0-0-1	-1	0

Knockout consequences: 1st: R32 vs a best-3rd-placed team (possible: C, E, F, H, I) | 2nd: R32 vs Runner-up Group B | 3rd: if among the 8 best 3rd-placed: faces a group winner (possible: E, G)

Group B

Pos	Team	P	W-D-L	GD	Pts
1	Canada	1	1-0-0	+2	3
2	Switzerland	1	1-0-0	+1	3
3	Bosnia & Herzegovina	1	0-0-1	-1	0
4	Qatar	1	0-0-1	-2	0

Knockout consequences: 1st: R32 vs a best-3rd-placed team (possible: E, F, G, I, J) | 2nd: R32 vs Runner-up Group A | 3rd: if among the 8 best 3rd-placed: faces a group winner (possible: D, E)

Search Budget

You have up to 12 web searches this round. Use them as you see fit.

Bettable Fixtures

#	Match	Group	Kickoff	Home	Draw	Away
1	South Africa vs Czech Republic	A	CLOSING Jun 17 16:00	3.10	3.20	2.45
2	Bosnia & Herzegovina vs Qatar	B	Jun 17 22:00	1.95	3.40	4.10
3	Switzerland vs Canada	B	Jun 18 02:00	2.80	3.40	2.60
4	Mexico vs South Korea	A	Jun 18 19:00	2.05	3.30	3.90

Your Decision

Budget remaining: $10,077.50 (balance $10,142.50 minus $65.00 pending exposure) Your total NEW stakes this round must stay under this amount. If you exceed it, ALL your bets this round will be rejected.

Analyze the fixtures above and respond with your decisions as JSON.

Get the data

Every bet, reasoning string, confidence score, notebook revision, and settlement outcome is logged. If you want the raw dataset for research, analysis, or your own write-up, DM me on LinkedIn or email elazargur@gmail.com — happy to share an export.