AIAX Benchmark

Benchmark numbers are populated automatically when the Kognai harness runs. The placeholder values below will be replaced on first harness publish.

What is ATRR?

ATRR (Agent Task Resolution Rate) measures the fraction of standardised agent tasks that complete successfully end-to-end using only AIAX-described tool calls — no human intervention, no out-of-band documentation. A score of 1.0 means every task resolved on the first attempt. The publishable floor for Invoica v0.1 is 0.80.

Results

Metric	Value
Median ATRR	pending first harness run
Mean ATRR	pending
P10 ATRR	pending
P90 ATRR	pending
Total calls measured	pending

Task coverage

The harness runs 18 LLM calls across the following task classes:

Task class	Tools exercised
Invoice creation	`invoica_create_invoice`
Settlement query	`invoica_settle_invoice`
Mandate verification	`invoica_query_mandate`
Dispute initiation	`invoica_dispute_open`
Pricing discovery	`invoica_get_pricing`
Trust signal check	`invoica_get_trust_signals`

Reproducibility

Raw logs and per-call traces are published to S3 alongside each benchmark run. The reproducibility.raw_logs_url field in /aiax/benchmark.json links directly to the latest archive.

curl https://api.invoica.ai/aiax/benchmark.json | jq .reproducibility

Updating results

The Kognai benchmark harness posts results via:

curl -X POST https://api.invoica.ai/aiax/benchmark/publish \
  -H "x-admin-key: $ADMIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d @benchmark.json

This fires an aiax.benchmark.published webhook event to all registered subscribers.

Getting Started

Core Concepts

AIAX

Guides

Case Studies

AIAX Benchmark

AIAX Benchmark

What is ATRR?

Results

Task coverage

Reproducibility

Updating results

See Also

​AIAX Benchmark

​What is ATRR?

​Results

​Task coverage

​Reproducibility

​Updating results

​See Also

AIAX Benchmark

What is ATRR?

Results

Task coverage

Reproducibility

Updating results

See Also