ClawBio · Benchmark

GenomeQA v0.1

A benchmark for provenance-grounded genomic agents.

v0.1 in development

The problem

Most genomics evaluation rewards one thing: the correct answer. That is not enough for a clinical-grade agent. A genome question can be answered confidently and still be wrong, unsupported, or valid only for one ancestry. GenomeQA scores what a trustworthy genomic agent actually has to get right.

What GenomeQA scores

ACorrect answerthe fact, when there is one

+Correct abstentionknowing when not to call

+Provenance completenesscan the answer be reconstructed from sources?

+Population robustnessdoes it hold across ancestries, not just European?

The scoring is the contribution. Confident hallucination and one-population validity are failures, not near-misses.

Sub-benchmarks

GenomeQA-PGx GenomeQA-Clinical GenomeQA-Equity GenomeQA-Family GenomeQA-Agentic GenomeQA-1000G GenomeQA-SGDP GenomeQA-UKB

The roadmap

→ v0: 100 expert-validated questions across PGx, clinical, carrier, ancestry, equity, provenance and abstention.

→ Baseline frontier LLMs (Claude, GPT, Gemini) against an executed-skill, provenance-grounded agent.

→ Preprint introducing GenomeQA and the agent.