OpenClaw workflow library
Test against real agent patterns: skill selection, tool loops, structured output, long-session memory.
Proof
Not MMLU. Not synthetic. Real OpenClaw workflows against real agent needs.
Methodology
How we design benchmarks, what workloads we test, and how we measure success.
Test against real agent patterns: skill selection, tool loops, structured output, long-session memory.
Measure what matters to automation: schema validity, retry recovery, budget containment.
Compare routing cost vs always-premium, including BYO escalation thresholds.
Published test suite, repeatable results, transparent cost calculation.
Details
What we measure, why it matters, how we test it.
Correct skill chosen from available set
OpenClaw agents make skill selection decisions 10+ times per session. Wrong skill = wasted turn.
500 real FAQ questions. 20 skills available. Measure correct skill selection on first try.
Toolklaw: 96% | Generic A: 88% | Generic B: 84%
JSON output parses without repair
Structured output failures require retry. Each failure costs tokens.
1,000 structured output requests. Measure first-pass JSON validity.
Toolklaw: 98% | Generic A: 91% | Generic B: 93%
Recovery after failed tool call
Network and API errors are common. Recovery without human intervention saves cost.
Inject random tool failures. Measure recovery success without escalation.
Toolklaw: 93% | Generic A: 79% | Generic B: 82%
Average cost across 100 agent decisions
Long-running agents accumulate cost. Cheaper defaults = lower total cost.
Run 100-turn sessions. Measure total cost of routing, skills, and retries.
Toolklaw: $0.07 | Generic A: $0.41 | Generic B: $0.53
Respects spend caps without degradation
Spend caps only matter if the agent still works when limits are reached.
Set budget cap, run agent until budget exhausted. Measure success rate.
Toolklaw: Yes (fallback works) | Generic A: Limited | Generic B: Limited
Consistency over 50+ turn sessions
Robot sessions can run for hours. Quality drift = cascade failures.
50-turn sessions. Measure success rate, cost consistency, output quality.
Toolklaw: 96% | Generic A: 81% | Generic B: 79%
Transparency
Test suites, results, methodology — available for peer review and independent verification.
Complete Python test suite with all 500+ prompts and workflows.
View on GitHub →Real agent patterns from 50+ production deployments, anonymized.
View Dataset →Latest results published first Tuesday of every month. Audited by independent parties.
Read Latest →Important
And what they don't.
Test data from actual OpenClaw deployments, not synthetic toy problems.
Complete code and prompts open source. Anyone can reproduce results.
Measure agent-relevant metrics, not MMLU or academic benchmarks.
These results are current as of publication date. Check monthly reports.
Your agents may have different characteristics. Run your own tests.
All models tested on same compute, normalized for caching and batching.
See for yourself how tk_ compares on your own workloads. Free plan included.