LABBench2 分析报告¶

仓库: https://github.com/EdisonScientific/labbench2 数据集: https://huggingface.co/datasets/EdisonScientific/labbench2 许可证: CC BY-SA 4.0 本地位置: /home/bianc/projects/benchmarks_agentic_rl/labbench2/

1. 项目概览¶

定位: LAB-Bench (arxiv 2407.10362) 的演进版本,是由 Edison Scientific (原 FutureHouse 作者团队) 发布的生物学科研任务基准。
规模: 约 1,900 个任务,覆盖文献问答、数据检索、实验设计、克隆、序列分析等多个面向。
难度: 论文声称比 LAB-Bench 在同类子任务上下降 26%~46%,意在衡量前沿模型的真实科研能力。
代码仓库职责: 提供评估脚手架 (evaluation harness) + 克隆/seqqa2 的程序化校验工具,数据集独立托管在 HuggingFace。
作者: Jon M. Laurent、Albert Bou、Michael Pieler、Conor Igoe、Alex Andonian、Siddharth Narayanan、James Braza、Alexandros Sanchez Vassopoulos、Jacob L. Steenwyk、Blake Lash、Andrew D. White、Samuel G. Rodriques (2026)。
论文 PDF: https://drive.google.com/file/d/1BV5UtmBRdpbQoz9jC1AuUF8WUTRQMqK_/view
技术栈: Python 3.11–3.13,Go 1.21+ (克隆验证需要),uv 管理依赖,pydantic-ai / pydantic-evals / datasets (HuggingFace) / anthropic / openai / google-genai 等。
数据缓存: 附件 (PDF、FASTA、图片等) 按需从 Google Cloud Storage 下载到 ~/.cache/labbench2。
报告输出: 默认 assets/reports/{tag}/{mode}/{model}.json,论文复现结果在 assets/reports_paper/。

2. 14 个任务标签 (Task Tags)¶

通过 --tag 过滤。每个标签构成一个子基准,使用不同的评估策略。

Tag	能力考察	附件类型	评估策略
`cloning`	分子克隆方案 (Gibson / Golden Gate / 限制性酶切/PCR 串联)	`.fa` / `.gb` / `.txt` 模板与引物	程序化奖励函数 `cloning_reward`
`seqqa2`	23 类 DNA/蛋白序列分析 (GC 含量、引物、密码子优化等)	FASTA / 蛋白序列文件	23 个子 validator (`VALIDATORS`)
`dbqa2`	数据库检索 / 提取结构化字段 (Ensembl、UniProt 风格)	可能含 JSON / 无	Recall-based LLM judge (≥0.95)
`litqa3`	基于科研论文的阅读理解	PDF 论文	通用 LLM judge
`patentqa`	生物医学专利问答	PDF	通用 LLM judge
`protocolqa2`	实验协议排错/问答	协议文本	通用 LLM judge
`trialqa`	临床试验信息问答	ClinicalTrials 风格文本	通用 LLM judge
`sourcequality`	信息来源真伪评估 (2026-03-13 重制 150 题)	文献/引用文本	通用 LLM judge
`suppqa2`	论文补充材料问答	PDF/补充资料	Exact-match LLM judge (1e-6 容差)
`figqa2` / `figqa2-img` / `figqa2-pdf`	论文插图阅读 (分别通过图像、PDF 两种载体)	图片 / PDF	Exact-match LLM judge
`tableqa2` / `tableqa2-img` / `tableqa2-pdf`	论文表格阅读	图片 / PDF / 文本	Exact-match LLM judge

评估分流逻辑 (HybridEvaluator,见 evals/evaluators.py:178):

tag ∈ {cloning, seqqa2}         → 程序化 reward 函数
tag == dbqa2                    → 召回 recall ≥ 0.95 的 LLM judge
tag.startswith({figqa2, tableqa2, suppqa2}) → 数值/字符串精确匹配 judge
其它                             → 通用语义 LLM judge

默认判定模型: anthropic:claude-sonnet-4-5,temperature=0,超时 120s。

3. 数据模型与加载流程¶

3.1 `LabBenchQuestion` (evals/models.py)¶

每条样本包含:

字段	含义
`id`	全局唯一 ID
`tag`	子集名 (14 个标签之一)
`version`	数据集版本
`type`	细分题型 (seqqa2 专用: `amplicon_gc`、`gibson_primers` 等;其它为空)
`question`	题干文本
`ideal`	参考答案
`files`	GCS 相对前缀 (附件路径)
`sources`	溯源 URL / 引用
`prompt_suffix`	额外拼接到 prompt 末尾的提示
`validator_params`	JSON 字符串,传给程序化 validator 的静态参数 (如参考酶、容差)
`answer_regex`	`<answer>{regex}</answer>` 用于从模型输出中抽取结构化答案
`mode`	三种载体 (inject/file/retrieve) 各自是否支持的布尔位图

3.2 加载 (evals/loader.py)¶

create_dataset(tag, ids, limit, mode, native): 按 config=tag or "all" 调 HuggingFace load_dataset("EdisonScientific/labbench2", config)。
create_case: 把每个题目转成 pydantic_evals.Case。
按 mode 决定附件注入方式。
附件通过 download_question_files(GCS_BUCKET, question.files) 从 GCS 下载并缓存。
Metadata 会带上 id / tag / type / sources / validator_params / answer_regex / files_path,供评估器后续使用。

3.3 三种文件处理模式 (Mode)¶

Mode	行为	适用场景
`file`	默认。通过 API 上传文件;PDF/图片直接进上下文,文本文件若运行器支持则挂到虚拟文件系统	有附件的题目
`inject`	把可读文本文件内容拼进 prompt (`## filename\n\n...`)	小体积 FASTA / TXT
`retrieve`	告知 Agent "请自行检索以下文件名对应的序列数据"	RAG / 外部检索场景

4. Agent 运行方式¶

--agent 支持三种格式,由 evals/run_evals.py 入口统一调度:

4.1 Pydantic-AI 模型 — `provider:model[@flags]`¶

使用 pydantic-ai 统一封装,每个 Agent 单纯做 agent.run(question)。

@tools → 启用该 provider 的所有内置工具 (WebSearch + Code Execution + WebFetch)
@search → 仅 WebSearch
@code → 仅 Code Execution
@high / @medium / @low → 推理努力度
多 flag 可组合: @tools,high

4.2 原生 SDK 运行器 — `native:provider:model[@flags]`¶

绕过 pydantic-ai,直接用各家 SDK,以获得更好的文件/工具支持。

Runner	位置	能力要点
`AnthropicAgentRunner`	`evals/runners/anthropic.py`	内置 `code_execution_20250825`、`web_search_20250305`、`web_fetch_20250910`;betas: `files-api-2025-04-14`/`code-execution-2025-08-25`/`web-fetch-2025-09-10`;Haiku 输出上限 8192,Sonnet/Opus 上限 64000;扩展思考默认预算 32000;处理 `pause_turn` 循环 (最多 20 轮 ≈200 次工具调用);智能文件路由 (PDF/图→上下文,文本文件→沙盒文件系统 if code exec)
`OpenAIResponsesRunner`	`evals/runners/openai.py`	`openai-responses:` Responses API
`OpenAICompletionsRunner`	`evals/runners/openai_completions.py`	`openai-completions:` Chat Completions API
`GoogleVertexRunner`	`evals/runners/google.py`	`google-vertex:` (OAuth via `setup_google_vertex_env`)

文件系统支持对照表:

运行器	文件系统 (非图片/PDF 附件)
Anthropic 原生 (`@tools`/`@code`)	✅
OpenAI 原生 (`@tools`/`@code`)	✅
Google 原生	❌ (仅上下文)
Pydantic-AI	❌ (仅上下文)

4.3 自定义运行器 — `external:path/to/file.py:ClassName`¶

实现 AgentRunner 协议即可 (evals/runners/base.py:24),核心方法:

async def upload_files(files: list[Path], gcs_prefix: str | None) -> dict[str, str]
async def execute(question: str, file_refs: dict[str, str] | None) -> AgentResponse
def extract_answer(response: AgentResponse) -> str
async def download_outputs(dest_dir: Path) -> Path | None   # 输出文件 (引物/FASTA)
async def cleanup() -> None

仓库自带一份参考实现: external_runners/edison_analysis_runner.py。

4.4 任务封装 (`create_agent_runner_task`)¶

对 mode == "file",先把 files_path 下的文件批量上传,拿到 file_refs。
调 runner.execute,计入 UsageStats。
调用 runner.download_outputs(tmp_dir) 下载 Agent 生成的文件 (如合成的质粒 FASTA),并回写 inputs["files_path"],供 cloning reward 后续使用。

5. 评估器架构¶

evals/evaluators.py 定义三大类:

5.1 `LLMJudgeEvaluator`¶

默认 anthropic:claude-sonnet-4-5,temperature=0,timeout=120。
结构化输出 EvaluationResult{rationale, result ∈ {correct, incorrect, unsure}}。
correct → 1.0,其余均记 0.0。

5.2 三套 prompt 模板 (`evals/prompts.py`)¶

模板	用途	关键规则
`STRUCTURED_EVALUATION_PROMPT`	通用语义判定	接受语义等价、合理近似
`STRUCTURED_EVALUATION_PROMPT_EXACT_MATCH`	数值精确匹配	绝对或相对误差 < 1e-6 方算对
`STRUCTURED_EVALUATION_PROMPT_DATA_ACCESS_BENCH_RECALL`	数据检索召回	解析原子声明 → 对每个期望字段匹配 → recall ≥ 0.95 算对;数字容差 ±5%

5.3 `RewardFunctionEvaluator`¶

cloning: 调 cloning_reward(answer, base_dir=files_path, reference_path={id}_assembled.fa, validator_params=...),返回 (score, reason)。
seqqa2:
按 answer_regex 从 LLM 输出中 <answer>...</answer> 抽取命名捕获组。
在 VALIDATORS 中按 question.type 找子 validator;若 validator 要求的参数名不是 "answer",自动改名 (如 oligo / optimized_dna)。
合并 validator_params 静态参数 + 抽取到的答案 + 必要的文件路径 (后缀 _path 的参数会走 resolve_file_path 再落盘)。
调 validator.func(**kwargs),输出 1.0/0.0 + 原因。

5.4 `HybridEvaluator`¶

运行评估的唯一入口,按 tag 分流到上面四类 (程序化 / recall judge / exact judge / 通用 judge)。

6. seqqa2 的 23 个程序化 Validator¶

src/labbench2/seqqa2/registry.py 的 VALIDATORS 字典:

Key	Reward 函数	抽取的答案参数名	校验目标
`gc_content`	`gc_content_reward`	`answer`	全序列 GC 含量
`amplicon_gc`	`amplicon_gc_reward`	`answer`	扩增子 GC 含量
`primer_design`	`restriction_cloning_reward`	`answer`	带酶切位点的引物设计
`amplicon_length`	`cds_primers_reward`	`answer`	PCR 扩增长度
`gibson_primers`	`gibson_primers_reward`	`answer`	Gibson 拼接用的重叠引物
`cds_oligo`	`cds_oligo_reward`	`oligo`	CDS 寡核苷酸合成
`cds_primers`	`cds_primers_reward`	`answer`	CDS 区扩增引物
`mutation_restriction`	`mutation_restriction_reward`	`answer`	突变引入酶切位点
`mutation_synonymous`	`mutation_synonymous_reward`	`answer`	同义突变
`oligo_design`	`cds_oligo_reward`	`oligo`	寡核苷酸通用设计
`orf_amino_acid`	`orf_amino_acid_reward`	`answer`	ORF 翻译产物氨基酸
`molecular_weight`	`molecular_weight_reward`	`answer`	DNA/蛋白分子量
`protein_hydrophobicity`	`protein_hydrophobicity_reward`	`answer`	蛋白疏水性指标
`enzyme_kinetics`	`enzyme_kinetics_reward`	`answer`	Michaelis–Menten 等动力学参数
`msa_scoring`	`msa_scoring_reward`	`answer`	多重序列比对打分
`pairwise_distances`	`pairwise_distances_reward`	`answer`	成对距离矩阵
`primer_interactions`	`primer_interactions_reward`	`answer`	二聚体/发夹打分
`restriction_counts`	`restriction_counts_reward`	`answer`	酶切位点计数
`restriction_digest`	`restriction_digest_reward`	`answer`	酶切片段列表
`restriction_cloning`	`restriction_cloning_reward`	`answer`	限制性酶切克隆设计
`sequence_complexity`	`sequence_complexity_reward`	`answer`	低复杂度/熵指标
`tm_calculations`	`tm_calculations_reward`	`answer`	引物/双链熔解温度
`codon_optimization`	`codon_optimization_reward`	`optimized_dna`	密码子优化后的 DNA

每个 validate_*.py 都是一个独立脚本,接收 LLM 抽取出的答案和题目指定的 validator_params,返回 0/1。

7. 克隆模块 (`src/labbench2/cloning/`)¶

独立可复用的分子克隆 DSL 解析 + 执行 + 打分工具。

7.1 DSL 语法¶

<protocol>
gibson(
    pcr(backbone.gb, fwd_primer.txt, rev_primer.txt),
    pcr(insert.gb, "ATGCATGC", "GCATGCAT")
)
</protocol>

操作	写法	含义
PCR	`pcr(template, fwd_primer, rev_primer)`	调 Go 二进制模拟扩增
Gibson Assembly	`gibson(seq1, seq2, ...)`	通过重叠区无缝拼接
Golden Gate	`goldengate(seq1, seq2, ..., enzymes="BsaI,BsmBI")`	IIS 型酶组装
限制性酶切组装	`restriction_assemble(frag1, frag2)`	按粘性末端连接
酶切	`enzyme_cut(sequence, "EnzymeName")`	单酶切

输入可是文件引用 (.gb / .gbk / .fa / .fasta / .txt)、字符串字面量 ("ATGC")、或嵌套 DSL 表达式。

7.2 Python API 与模块¶

src/labbench2/cloning/: - cloning_protocol.py — CloningProtocol, Tokenizer, Parser, PROTOCOL_TAG_OPEN/CLOSE - gibson.py — Gibson 拼接 - goldengate.py — Golden Gate 拼接 - restriction_enzyme.py — 酶切逻辑 - enzyme_cut.py — 对外暴露的 enzyme_cut - simulate_pcr.py — 调用 _go/ 中的 Go 二进制模拟 PCR - sequence_alignment.py — sequence_similarity, compare_sequences - sequence_models.py — BioSequence (统一 DNA/蛋白数据结构) - rewards.py — 4 个奖励函数 (见下) - utils.py — extract_between_tags 等 - protocol_examples/, sys_prompt.txt — 可直接塞进 LLM 系统 prompt 的示例 - _go/ — Go 1.21+ 子项目,首次调用时自动编译

7.3 四个奖励函数 (`rewards.py`)¶

函数	1.0 条件	0.0 条件
`cloning_format_reward(text)`	DSL 语法可解析 (+ 可选的 `required_files` 全部出现)	`<protocol>` 缺失 / 语法错 / 文件引用缺失
`cloning_execution_reward(text, base_dir)`	解析成功 → 执行产出 ≥ 1 条序列	运行异常或产出为空
`cloning_similarity_reward(text, reference, base_dir, threshold=0.95)`	产出首条序列与参考序列比对相似度 ≥ 阈值	执行失败或相似度低
`cloning_digest_reward(text, reference, base_dir, enzymes, threshold)`	产出与参考分别用 `enzymes` 酶切后,片段数量一致且按长度排序后逐一相似度 ≥ 阈值	任一条件不满足

组合入口 cloning_reward(answer, base_dir, reference_path, threshold, validator_params) 返回 (score, reason): 1. 先查格式,失败直接 0.0。 2. 执行 DSL。 3. 如提供 reference,做相似度校验。 4. 若 validator_params 含 enzyme_1, enzyme_2, ...,再做酶切一致性校验;edit_distance_threshold 可覆盖默认阈值。

仓库顶层 labbench2/__init__.py 还向外导出兼容旧命名的别名 (format_reward、execution_reward、accuracy_reward、similarity_reward)。

8. CLI 使用¶

8.1 快速开始¶

export HF_TOKEN=your-huggingface-token
export ANTHROPIC_API_KEY=your-key
uv run python -m evals.run_evals \
    --agent anthropic:claude-opus-4-5 \
    --tag seqqa2 \
    --limit 5

8.2 常用参数 (`evals/run_evals.py:213`)¶

参数	含义
`--agent`	Agent 规范 (三种格式)
`--tag`	14 个标签之一,不填则跑全量
`--mode`	`file` / `inject` / `retrieve`
`--limit N`	仅跑前 N 题
`--parallel N`	并行 worker 数,默认 30
`--ids` / `--ids-file`	精确复跑
`--report-path`	自定义报告路径
`--retry-from FILE`	从历史报告中提取 `failures`,仅重跑失败题目,写入 `*_retry.json`

失败重试: Tenacity,5 次,指数抖动 (initial=1, max=60)。
运行结束打印 "Completed-only accuracy" 与 "Overall accuracy (含失败)"、平均耗时、Token 用量。

8.3 示例¶

# Anthropic + 全工具 + 高推理努力
uv run python -m evals.run_evals --agent anthropic:claude-opus-4-5@tools,high --tag seqqa2

# OpenAI Responses API + 工具
uv run python -m evals.run_evals --agent openai-responses:gpt-5.2@tools --tag seqqa2

# Google Vertex Gemini + 搜索
uv run python -m evals.run_evals --agent google-vertex:gemini-3-pro-preview@search --tag seqqa2

# 原生 SDK + figqa2 (论文图片)
uv run python -m evals.run_evals --agent native:anthropic:claude-opus-4-5 --tag figqa2

# 自定义 runner
uv run python -m evals.run_evals \
    --agent external:./external_runners/edison_analysis_runner.py:EdisonAnalysisRunner \
    --tag seqqa2

8.4 一键复跑论文矩阵¶

./run_evals.sh native:anthropic:claude-opus-4-5
./run_evals.sh native:openai-responses:gpt-5.2 --limit 1
./run_evals.sh 'external:./my_runner.py:MyAgent' -j 4 -w 10

该脚本会枚举论文所用的 tag × mode 组合。

8.5 报告汇总¶

uv run python evals/summarize_report.py assets/reports_paper/seqqa2/file/claude-opus-4-5.json
# 多份合并 (后来者覆盖前者),常用于合并 retry:
uv run python evals/summarize_report.py original.json original_retry.json
# 查看失败任务去重后的错误信息:
uv run python evals/summarize_report.py <report> --show-failed-outputs

9. 目录结构速查¶

labbench2/
├── README.md                       # 顶层说明
├── run_evals.sh                    # 复跑论文矩阵脚本
├── pyproject.toml                  # uv/pip 配置
├── assets/
│   ├── reports/                    # 用户运行产物 (按 tag/mode/model 分层)
│   ├── reports_paper/              # 论文官方报告
│   └── coverage.svg, overview.png
├── evals/
│   ├── run_evals.py                # CLI 入口
│   ├── loader.py                   # HF → Case 转换 + 下载附件
│   ├── evaluators.py               # HybridEvaluator / RewardFunctionEvaluator / LLMJudge
│   ├── prompts.py                  # 3 套 LLM judge prompt
│   ├── models.py                   # LabBenchQuestion / Mode / QuestionMode
│   ├── report.py                   # 报告落盘 + UsageStats
│   ├── llm_configs.py              # 每家 provider 的默认设置 / 内置工具列表
│   ├── utils.py                    # GCS、文件判定、Vertex 鉴权
│   └── runners/                    # 五类 agent runner
│       ├── base.py                 # AgentRunner 协议 + create_agent_runner_task
│       ├── anthropic.py
│       ├── openai.py               # Responses API
│       ├── openai_completions.py   # Chat Completions API
│       └── google.py               # Google Vertex
├── external_runners/
│   └── edison_analysis_runner.py   # 官方示例自定义 runner
├── src/labbench2/
│   ├── __init__.py
│   ├── cloning/                    # 克隆 DSL + 4 个奖励 + Go PCR 模拟
│   │   ├── cloning_protocol.py
│   │   ├── gibson.py
│   │   ├── goldengate.py
│   │   ├── restriction_enzyme.py
│   │   ├── enzyme_cut.py
│   │   ├── simulate_pcr.py
│   │   ├── sequence_alignment.py
│   │   ├── sequence_models.py
│   │   ├── rewards.py
│   │   ├── utils.py
│   │   ├── sys_prompt.txt
│   │   ├── protocol_examples/
│   │   └── _go/                    # 运行时编译的 Go 子项目
│   └── seqqa2/
│       ├── registry.py             # 23 个 validator 注册表
│       ├── utils.py
│       └── validate_*.py           # 每个题型一个独立校验模块
└── tests/                          # unit / e2e / cloning / seqqa2 / fixtures

10. 环境变量 / 凭据清单¶

变量	用途
`HF_TOKEN`	下载 HuggingFace 数据集 (私有/受限读取)
`ANTHROPIC_API_KEY`	`anthropic:` / `native:anthropic:` + 默认 LLM judge
`OPENAI_API_KEY`	`openai-*` 运行器
`GOOGLE_CLOUD_PROJECT` + `GOOGLE_CLOUD_LOCATION` + `gcloud auth application-default login`	`google-vertex:` 运行器

11. 最近更新 (Changelog 节选)¶

2026-03-23: 修复 Anthropic 原生运行器: 正确处理 pause_turn (服务端工具达到迭代上限时继续对话);@high/@medium/@low 启用扩展思考 (4.6+ 用 adaptive thinking,4.5 走 budget-based)。官方报告已随之更新。
2026-03-13: 修正 sourcequality 数据问题,整体替换为 150 个新任务,代码做相应兼容。官方报告已更新。

12. 小结¶

LABBench2 的核心工程观察:

评估程序化 vs LLM 判定二分。仅 cloning / seqqa2 用确定性 reward,其它 12 个标签靠 Claude Sonnet 4.5 作判官 (按任务类型细分 3 套 prompt)。这使基准既有"硬"得分,也有"软"得分。
seqqa2 是 23 个子技能的组合,每个子技能都有专属 validator 和题目特定的 validator_params,适合做细粒度的训练/消融。
cloning 是少见的 "LLM + 编程语言 + Go 模拟器 + 真实酶切几何" 的端到端评估,通过一个 DSL 让模型直接产出可执行的湿实验方案。
三种文件处理模式 让同一题目能被适配到原生文件上传、文本注入和外部检索三种 Agent 范式,便于横向对比不同架构。
自定义 runner 协议干净 (5 个 async 方法),既支持简单包装 REST API,也能把重量级 Agent 框架塞进来。
结果可追溯: assets/reports_paper/ 提供论文基线,--retry-from + summarize_report.py 支持增量修复并合并报告。