Proof

Operator-native benchmarks.

Not MMLU. Not synthetic. Real OpenClaw workflows against real agent needs.

Test coverage6 metrics
Workflow typesReal OpenClaw
Reproducible100%
MetricToolklawGeneric AGeneric BWinner
Skill selection accuracy96%88%84%Toolklaw
Schema-valid outputs98%91%93%Toolklaw
Tool retry recovery93%79%82%Toolklaw
Cost per 100 agent turns$0.07$0.41$0.53Toolklaw
Budget-safe routingYesLimitedLimitedToolklaw
Long-session stability96%81%79%Toolklaw

Methodology

Every metric is repeatable.

How we design benchmarks, what workloads we test, and how we measure success.

01

OpenClaw workflow library

Test against real agent patterns: skill selection, tool loops, structured output, long-session memory.

02

Operator-native metrics

Measure what matters to automation: schema validity, retry recovery, budget containment.

03

Cost normalization

Compare routing cost vs always-premium, including BYO escalation thresholds.

04

Reproducible methodology

Published test suite, repeatable results, transparent cost calculation.

Details

Metric deep dives

What we measure, why it matters, how we test it.

Skill Selection Accuracy

Correct skill chosen from available set

Why it matters

OpenClaw agents make skill selection decisions 10+ times per session. Wrong skill = wasted turn.

Test method

500 real FAQ questions. 20 skills available. Measure correct skill selection on first try.

Results

Toolklaw: 96% | Generic A: 88% | Generic B: 84%

Schema-Valid Outputs

JSON output parses without repair

Why it matters

Structured output failures require retry. Each failure costs tokens.

Test method

1,000 structured output requests. Measure first-pass JSON validity.

Results

Toolklaw: 98% | Generic A: 91% | Generic B: 93%

Tool Retry Recovery

Recovery after failed tool call

Why it matters

Network and API errors are common. Recovery without human intervention saves cost.

Test method

Inject random tool failures. Measure recovery success without escalation.

Results

Toolklaw: 93% | Generic A: 79% | Generic B: 82%

Cost Per 100 Turns

Average cost across 100 agent decisions

Why it matters

Long-running agents accumulate cost. Cheaper defaults = lower total cost.

Test method

Run 100-turn sessions. Measure total cost of routing, skills, and retries.

Results

Toolklaw: $0.07 | Generic A: $0.41 | Generic B: $0.53

Budget-Safe Routing

Respects spend caps without degradation

Why it matters

Spend caps only matter if the agent still works when limits are reached.

Test method

Set budget cap, run agent until budget exhausted. Measure success rate.

Results

Toolklaw: Yes (fallback works) | Generic A: Limited | Generic B: Limited

Long-Session Stability

Consistency over 50+ turn sessions

Why it matters

Robot sessions can run for hours. Quality drift = cascade failures.

Test method

50-turn sessions. Measure success rate, cost consistency, output quality.

Results

Toolklaw: 96% | Generic A: 81% | Generic B: 79%

Transparency

All benchmark data is open source

Test suites, results, methodology — available for peer review and independent verification.

Test Suite

github.com/toolklaw/benchmarks

Complete Python test suite with all 500+ prompts and workflows.

View on GitHub →
Dataset

OpenClaw Workflow Library

Real agent patterns from 50+ production deployments, anonymized.

View Dataset →
Reports

Monthly benchmark reports

Latest results published first Tuesday of every month. Audited by independent parties.

Read Latest →

Important

What these benchmarks measure

And what they don't.

✓ Real

Production agent workloads

Test data from actual OpenClaw deployments, not synthetic toy problems.

✓ Repeatable

Published methodology

Complete code and prompts open source. Anyone can reproduce results.

⚠ Operator-specific

Not general LLM benchmarks

Measure agent-relevant metrics, not MMLU or academic benchmarks.

⚠ Point-in-time

Models improve constantly

These results are current as of publication date. Check monthly reports.

⚠ Workload-dependent

Your mileage may vary

Your agents may have different characteristics. Run your own tests.

✓ Cost normalized

Fair price comparison

All models tested on same compute, normalized for caching and batching.

Convinced? Let's run your agents cheap.

See for yourself how tk_ compares on your own workloads. Free plan included.