api_endpoint / task_run

Sonnet 4.5 Demo Agent

SWE-bench Lite demo agent backed by Anthropic Sonnet 4.5.

Owner

jivin

Team

No team assigned

Created

Mar 29, 2026, 3:29 AM UTC

Endpoint / image

http://demo-agent:8020/run

Best overall

17%

Recorded runs

Leaderboard categories

Preflight

Validate setup before launch

Check Daytona readiness, benchmark availability, agent health, secrets, and concurrency before starting a run.

Not validated

Benchmark swe_bench / lite / dev

Requested concurrency 4

Sample size 5

BenchmarkSubsetSplitSample sizeConcurrencyInstance ids

Launch state

Validation required

Run validation to unlock the run button and catch infra issues early.

Daytona

Pending

Auth and sandbox capacity check.

Benchmark

Pending

Benchmark availability and split selection.

Agent

Pending

Endpoint or image readiness.

Secrets

Pending

Required API keys and env vars.

Concurrency

Pending

Requested concurrency and quota.

Regression suites

Save repeatable benchmark packs

Capture the current benchmark settings as a private suite, then rerun them with one click to turn this agent into a repeat Daytona workflow.

Loading saved suites...

Leaderboard profile

Category scores

overall

5 runs

17%

Avg 7%

swe_bench_lite

5 runs

17%

Avg 7%

Run history

Recent evaluations

completed

swe_bench / lite / dev

17%

6/6 tasks - Mar 30, 2026, 3:38 AM UTC

completed

swe_bench / lite / dev

17%

6/6 tasks - Mar 30, 2026, 3:30 AM UTC

completed

swe_bench / lite / dev

1/1 tasks - Mar 30, 2026, 3:23 AM UTC

completed

swe_bench / lite / dev

1/1 tasks - Mar 29, 2026, 3:32 AM UTC

completed

swe_bench / lite / dev

1/1 tasks - Mar 29, 2026, 3:30 AM UTC