Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.invoica.ai/llms.txt

Use this file to discover all available pages before exploring further.

AIAX Benchmark

Benchmark numbers are populated automatically when the Kognai harness runs. The placeholder values below will be replaced on first harness publish.

What is ATRR?

ATRR (Agent Task Resolution Rate) measures the fraction of standardised agent tasks that complete successfully end-to-end using only AIAX-described tool calls — no human intervention, no out-of-band documentation. A score of 1.0 means every task resolved on the first attempt. The publishable floor for Invoica v0.1 is 0.80.

Results

MetricValue
Median ATRRpending first harness run
Mean ATRRpending
P10 ATRRpending
P90 ATRRpending
Total calls measuredpending

Task coverage

The harness runs 18 LLM calls across the following task classes:
Task classTools exercised
Invoice creationinvoica_create_invoice
Settlement queryinvoica_settle_invoice
Mandate verificationinvoica_query_mandate
Dispute initiationinvoica_dispute_open
Pricing discoveryinvoica_get_pricing
Trust signal checkinvoica_get_trust_signals

Reproducibility

Raw logs and per-call traces are published to S3 alongside each benchmark run. The reproducibility.raw_logs_url field in /aiax/benchmark.json links directly to the latest archive.
curl https://api.invoica.ai/aiax/benchmark.json | jq .reproducibility

Updating results

The Kognai benchmark harness posts results via:
curl -X POST https://api.invoica.ai/aiax/benchmark/publish \
  -H "x-admin-key: $ADMIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d @benchmark.json
This fires an aiax.benchmark.published webhook event to all registered subscribers.

See Also