Healthcare

Verifiable Benchmarking of Long-Horizon Spatial Biology

Introducing SpatialBench-Long, a benchmark for long-horizon spatial biology. Agents must recover biological claims from raw data and realistic experimental context without prescribed methods.

24 evaluations span primary tumors, organoids, xenograft models, lineage-tracing systems, and aging/intervention biology. The best agents score 11.1%.

Thanks for reading! Subscribe for free to receive new posts and support my work.

Read the manuscript.
Interact with the leaderboard.


Benchmark Construction

Evaluations mix the assay types scientists use in practice.

A single task may depend on spatial transcriptomics, histology, single-cell references, and lineage-recording data. Solving them requires cross-assay reasoning, experimental-design awareness, and command of spatial workflows like tissue segmentation, niche analysis, and spatial differential expression.

This tests the transition from running data analysis to doing science.

Constructing ground truth in science

Deriving ground truth is very hard in long-horizon biology benchmarks. The same data can support multiple valid conclusions, and some published claims do not reproduce cleanly under unbiased reanalysis.

Candidate tasks are hardened through independent reproduction, randomized expert review, and trajectories from multiple model families.

Pairing deterministic grading with diagnostic rubrics

Grading uses deterministic functions over structured final answers. We grade recovery of scientific conclusions expressed through controlled biological vocabularies instead of numbers from individual statistical operations.

Results

Across 15 model-harness pairs and 1,080 trajectories, Gemini 3.5 Flash / Pi, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex each passed 8/72 attempts (11.11%), with Claude Opus 4.6 / Claude Code close behind at 7/72 attempts (9.72%).

Judge rubrics over analysis ‘chokepoints’ provide interpretive tools

However, final-answer grading provides sparse diagnostic signal for long-horizon tasks. Chokepoint rubrics – analysis decisions expected to remain stable across plausible solution paths – are graded by judges as a companion diagnostic.

But rubric judging is probably not a replacement for verifiable grading

We were interested in exploring the utility of rubric grading for intermediate reward and asked if denser signal was partially aligned with endpoint quality. We conclude rubric scores are promising auxiliary tools, not substitutes for verifiable endpoint grading.

Rubric judging is reasonably stable over judge and agent model variation

We also stratified rubric scores by the source agent model whose trajectory was being judged to see if rubric patterns are consistent across judges. All four judges preserved the same broad ordering with some variability.

Actually looking at your trajectories will continue to be important

Pairing manual trajectory review with rubric and verifiable scores provides more tools to interpret model failures.

In practice, manual trajectory inspection is a first-class tool to understand this data. Eval authors maintain reproduction notes to provide a record for future benchmark updates, especially as stronger models may solve tasks through unanticipated but valid analysis paths that challenge current grading assumptions.

There is a path to agents that behave a lot like scientists

Results suggest compounding local analysis errors prevent reliable long-horizon scientific reasoning.

Before models can reliably reason about disease mechanisms, drug response, or other deep results in biology, they must become procedurally competent in local steps.

But the few task completions we observed were very impressive. It seems there is a realistic path to agents thinking and behaving a lot like scientists do.


Read the manuscript
Interact with the leaderboard

Thanks for reading! Subscribe for free to receive new posts and support my work.

Picture of John Doe
John Doe

Sociosqu conubia dis malesuada volutpat feugiat urna tortor vehicula adipiscing cubilia. Pede montes cras porttitor habitasse mollis nostra malesuada volutpat letius.

Related Article

Leave a Reply

Your email address will not be published. Required fields are marked *

X
"Hello! Let’s get started on your journey with us."
Site SearchBusiness ServicesBusiness Services

Meet Eve: Your AI Training Assistant

Welcome to Enlightening Methodology! We are excited to introduce Eve, our innovative AI-powered assistant designed specifically for our organization. Eve represents a glimpse into the future of artificial intelligence, continuously learning and growing to enhance the user experience across both healthcare and business sectors.

In Healthcare

In the healthcare category, Eve serves as a valuable resource for our clients. She is capable of answering questions about our business and providing "Day in the Life" training scenario examples that illustrate real-world applications of the training methodologies we employ. Eve offers insights into our unique compliance tool, detailing its capabilities and how it enhances operational efficiency while ensuring adherence to all regulatory statues and full HIPAA compliance. Furthermore, Eve can provide clients with compelling reasons why Enlightening Methodology should be their company of choice for Electronic Health Record (EHR) implementations and AI support. While Eve is purposefully designed for our in-house needs and is just a small example of what AI can offer, her continuous growth highlights the vast potential of AI in transforming healthcare practices.

In Business

In the business section, Eve showcases our extensive offerings, including our cutting-edge compliance tool. She provides examples of its functionality, helping organizations understand how it can streamline compliance processes and improve overall efficiency. Eve also explores our cybersecurity solutions powered by AI, demonstrating how these technologies can protect organizations from potential threats while ensuring data integrity and security. While Eve is tailored for internal purposes, she represents only a fraction of the incredible capabilities that AI can provide. With Eve, you gain access to an intelligent assistant that enhances training, compliance, and operational capabilities, making the journey towards AI implementation more accessible. At Enlightening Methodology, we are committed to innovation and continuous improvement. Join us on this exciting journey as we leverage Eve's abilities to drive progress in both healthcare and business, paving the way for a smarter and more efficient future. With Eve by your side, you're not just engaging with AI; you're witnessing the growth potential of technology that is reshaping training, compliance and our world! Welcome to Enlightening Methodology, where innovation meets opportunity!

[wpbotvoicemessage id="402"]