A benchmark for provenance-grounded genomic agents.
Most genomics evaluation rewards one thing: the correct answer. That is not enough for a clinical-grade agent. A genome question can be answered confidently and still be wrong, unsupported, or valid only for one ancestry. GenomeQA scores what a trustworthy genomic agent actually has to get right.
The scoring is the contribution. Confident hallucination and one-population validity are failures, not near-misses.
→ v0: 100 expert-validated questions across PGx, clinical, carrier, ancestry, equity, provenance and abstention.
→ Baseline frontier LLMs (Claude, GPT, Gemini) against an executed-skill, provenance-grounded agent.
→ Preprint introducing GenomeQA and the agent.