Abstract

Scientific Workflow Agent 是由 AGH University of Krakow 與 Sano Centre 提出的 agentic AI 架構，自動化將自然語言 research questions 翻譯為可執行 scientific workflows，實現從研究問題到 Kubernetes 叢集執行的端到端自動化。核心創新是三層分解架構（Semantic Layer、Deterministic Layer、Knowledge Layer）將 LLM non-determinism 僅保留在意圖萃取階段，確保 identical intents 必定產生 identical DAG，並將 30-50 分鐘的專家操作縮短為 106 秒。

Scientific Workflow Agent

Overview

Scientific Workflow Agent 是由 AGH University of Krakow 與 Sano Centre for Computational Medicine 團隊提出的一套 agentic AI 架構，旨在自動化將自然語言 research questions 翻譯為可執行 scientific workflows 的整個過程。該系統發表於 arXiv:2604.21910 (2026)，以 1000 Genomes 族群遺傳學工作流程與 HyperFlow WMS 為評測場景，實現了從研究問題到 Kubernetes 叢集執行的端到端自動化，是 Science Automation 領域的重要突破。

傳統 scientific workflow management systems（如 Pegasus、Nextflow、Snakemake）已能良好地處理執行層面的問題——task scheduling、fault tolerance、resource management。然而，将研究者的 semantic intent（如「比較 European 與 African 族群的突變模式」）轉換為具體的 workflow specification（DAG 圖）仍需人工完成，這要求科學家同時具備 domain knowledge（如 population codes、genomic region naming conventions）與 infrastructure expertise（如 available vCPUs、task sizing），形成所謂的 semantic gap。這個鴻溝造成三個問題：進入門檻高（缺乏基礎設施知識的科學家無法獨立使用工作流程系統）、容易出錯（不正確的詞彙映射會傳播至最終結果）、以及可重現性受損（翻譯邏輯隱式存在，每次規格化都需重建）。

Scientific Workflow Agent 的核心貢獻即在於透過三層分解關閉此鴻溝，同時將 LLM non-determinism 僅保留在意圖萃取階段，確保 identical intents 必定產生 identical workflows，維持科學研究所需的可重現性。^[raw/papers/scientific-workflow-agent.md]

Core Contributions

該論文提出四項核心貢獻，每項都直指科學工作流程自動化的關鍵痛點：

Hybrid Agentic Architecture：三層分解（Semantic Layer、Deterministic Layer、Knowledge Layer），將 LLM non-determinism 限制在意圖萃取階段。相同的 ResearchIntent 必定產生相同的 DAG——這是科學可重現性的關鍵保障。架構將 LLM 的角色限定為「理解使用者想做什么」，而「如何執行」則交由 deterministic code 處理。
Skills as Domain-Expert-Authored Knowledge：Skills 是以 markdown 格式編寫的文字檔案，無需任何 ML expertise，可 version-controlled 並由領域專家直接審計。與傳統 few-shot prompting（將範例以臨時方式注入 prompt）相比，Skills 是持久化、可審計、可共享的知識載體。涵蓋 vocabulary mappings（如「European」→ EUR）、parameter constraints、以及 data transfer optimization strategies。
End-to-End Agentic Pipeline：從自然語言查詢到 Kubernetes 執行的完整路徑，透過四個專門 Agent（Conductor、Workflow Composer、Deployment Service、Execution Sentinel）協作，並在關鍵節點保留 human-in-the-loop validation gates——科學家在 workflow 提交前與執行前各有一次審查機會，確保自動化不失控制。
Skill-Driven Execution-Time Optimization：Skills 不僅提供正確翻譯所需的詞彙，還編碼在執行期應用的優化策略。例如，Data sources Skill 指定 Tabix region extraction 只需傳輸 50 MB 而非完整 chromosome 6 VCF 的 943 MB——這種 selective data extraction 最終減少 92% 的資料傳輸量，且無需人工干預。^[raw/papers/scientific-workflow-agent.md]

Architecture / Approach

三層分解

架構由三層組成，每層有明確的職責邊界與互動契約：

Semantic Layer（語意層）：由 LLM 驅動的 Workflow Composer 負責將自然語言查詢詮釋為結構化的 ResearchIntent——一組包含 analysis_type（single population、population comparison、multi-population、region analysis）、populations（list of PopulationCode）、chromosomes、regions（GenomicRegion）、focus（all variants、deleterious、common、rare）的參數集。此階段為唯一涉及 LLM 的環節，因此 non-determinism 被嚴格限制在此。LLM 的輸出必須符合 ResearchIntent schema，Validation 確保結構正確性後才進入下一階段。
Deterministic Layer（決定論層）：validated generators 將 ResearchIntent 轉換為可執行的 DAG。Workflow Composer 在基礎設施測量完成後生成最終的 workflow.json（HyperFlow 格式），其中包含 resolved parallelism levels（Jobs per chromosome, J）與 resource allocations。Deployment Service 負責 Kubernetes namespace 建立、data download 至 persistent volume，並測量 actual data sizes 與 available vCPUs。整個決定論層不含任何 LLM 呼叫，相同的意圖輸入必定產生位元組層面相同的 workflow.json。
Knowledge Layer（知識層）：領域專家編寫的 Skills——markdown 文件——編碼 vocabulary mappings、parameter constraints 與 optimization strategies。在 1000 Genomes 實現中，五個 Skills 分別處理：Populations（26 個族群代碼映射，含同義詞解析，如「British」→ GBR、「Han Chinese」→ CHB）、Genomic regions（基因名稱至 GRCh37 座標，如 HLA → chr6:28477797–33448354）、Research contexts（研究主題至分析類型與區域的橋接）、Data sources（涵蓋 HTTPS、S3、GCS 的數據位置，以及 full download 與 Tabix extraction 的取捨）、Workflow Composer（工具定義與 LLM 指引）。^[raw/papers/scientific-workflow-agent.md]

四個 Agent

Conductor：使用者入口與編排協調者。接收自然語言查詢、分類領域、路由至適當的 Workflow Composer、管理多輪對話（當查詢模糊時觸發 clarification loop）、並在 human-in-the-loop gates（provisioning 前與 execution 前）強制暫停。科學家只與 Conductor 互動，完全不需直接接觸底層基礎設施。
Workflow Composer：處理意圖解讀與 DAG 生成的心臟。在規劃階段諮詢 Skills 與 LLM 萃取結構化意圖，返回 human-readable workflow plan；基礎設施 provision 後接收 Deployment Service 的 actual measurements（data sizes、available vCPUs），方才生成最終 workflow.json。Deferred Generation 的設計是刻意為之——task parallelism 取決於 actual data volume，而非 estimates。
Deployment Service：依據 approved plan，provision Kubernetes 環境（create namespace、create persistent volume）、執行資料下載（使用 Skill 中指定的 extraction pattern）、測量 actual data sizes 與 vCPUs。這些測量值以回饋形式傳給 Composer，用於生成經過校準的 DAG。
Execution Sentinel：非同步監控執行中的工作流程。追蹤 pod status、collect logs、detect anomalies（stalled tasks、repeated failures）並向 Conductor 報告 progress 與 completion summaries。^[raw/papers/scientific-workflow-agent.md]

六階段管道路徑

完整流程依序為六個階段，每個階段都有明確的輸入輸出與決策點：

Routing（路由階段）：Conductor 接收自然語言 research query，分類其所屬領域以選擇適當的 Workflow Composer 與關聯 Skills。
Workflow Planning（工作流程規劃階段）：Workflow Composer 使用 Skills 與 LLM 解讀查詢，萃取結構化意圖（populations、chromosomes、regions）。若查詢模糊，透過 Conductor 觸發 clarification loop。
User Validation（使用者驗證階段）：Conductor 展示 workflow plan 供科學家審查。科學家可 approve、revise 或 reject。修訂會觸發 re-plan loop。
Infrastructure Provisioning（基礎設施配置階段）：批准後 Deployment Service create Kubernetes namespace、download data to persistent storage、measure actual data sizes 與 vCPUs。
Deferred Workflow Generation（延後工作流程生成階段）：Deployment Service 將 measurements 回傳 Composer，Composer 根據 actual data volume 校正 parallelism，生成最終 workflow.json。
Execution Approval（執行批准階段）：Conductor 呈現 summary（task count、estimated peak storage、projected runtime）供科學家最終批准後，submit to HyperFlow engine 執行。^[raw/papers/scientific-workflow-agent.md]

此架構與同時期的 openhands（軟體工程 Agent 平台）共享 multi-agent collaboration 與 human-in-the-loop 的設計理念，但 Scientific Workflow Agent 針對科學工作流程領域進行了深度定製，特別是在 domain knowledge encoding（Skills 機制）與 reproducibility guarantee（LLM non-determinism 限定於 Semantic Layer）方面。與 claude-code-analysis 所分析的 Claude Code 子代理邊界設計相比，Scientific Workflow Agent 的四個 Agent 各自有更明確的職責分離與資料傳遞契約。^[raw/papers/scientific-workflow-agent.md]

Key Results

在 150 個自然語言查詢（分為五個難度層級：T1 顯式代碼、T2 同義詞、T3 隱式領域推斷、T4 規格不足、T5 對抗性）的 ablation 研究中，系統展現出顯著的有效性：

意圖萃取準確率（Intent Extraction Accuracy）：Claude Opus 在無 Skills 基準（S0）為 44%，全 Skills 配置（S3）提升至 83.3%（+39.3pp）；GPT-5.4 從 39.3% 提升至 80%（+40.7pp）。Vocabulary Skills（S1）貢獻最大——對 Opus 貢獻 +36pp，顯示 population codes 與 genomic coordinate lookup tables 是解題關鍵。T1 與 T2 在全模型下均達 100% 準確率。
資料傳輸節省（Data Transfer Savings）：Deferred Generation 策略透過 Tabix region extraction，實測下載 1.69 GB 而非 21.6 GB，整體減少 92% 資料傳輸。對於小型基因區域（HBB 136 rows、APOE 113 rows），節省超過 99.9%——從 1.0-2.1 GB 降至 1.1-1.4 MB。這種節省來自 Skills 中編碼的 extraction pattern knowledge，而非 LLM 的即興發揮。
端到端延遲（End-to-End Latency）：LLM overhead 始終低於 15 秒（Gemini 2.0 Flash），每次查詢成本 低於 $0.001。語意層幾乎不增加延遲，執行時間（82-97% 占比）才是決定總耗時的關鍵因素。Q1（HLA+BRCA1，166K rows）總耗時 145 分鐘，而 Q2（BRCA2+BRCA1，小型基因）僅需 10 分鐘——差異來自 data volume，非 system overhead。
並行層級校準（Parallelism Calibration）：Deferred Generation 根據 actual row counts 動態調整並行度（Jobs per chromosome, J）。HBB 從 advisory plan 的 J=66 降至 actual measurement 後的 J=1——為 136 列資料建立 66 個平行任務是極大的資源浪費。HLA（166K rows）從 J=100 降至 J=51，精準匹配 actual data volume。
意圖解讀完整性（Intent Interpretation Completeness）：三個端到端查詢（Q1-Q3）在所有五個欄位（populations、chromosomes、regions、analysis type、focus）均完全正確，零失敗任務（0 failed tasks）。^[raw/papers/scientific-workflow-agent.md]

與手動規格化相比，該系統將 30-50 分鐘的專家操作（需查找 GRCh37 座標、撰寫 Tabix extraction commands、計算 parallelism、生成 DAG、deploy 至 Kubernetes via Helm）縮短為 106 秒（11 秒 LLM + 95 秒基礎設施）。更關鍵的是，手動路徑通常需要兩類專業知識——domain knowledge（基因座標、population codes）與 infrastructure knowledge（Tabix、HyperFlow、Kubernetes）——這意味著往往需要兩位專家協作。Scientific Workflow Agent 將這兩種知識編碼為 Skills，讓單一科學家即可完成全流程。^[raw/papers/scientific-workflow-agent.md]

Limitations

論文坦承以下三項限制，為後續研究指明方向：

領域範圍有限（Limited Domain Scope）：架構僅在 1000 Genomes（族群遺傳學）單一領域驗證。雖然設計本身具通用性（Skills + deterministic generator 的框架可適用於任何科學領域），但每個新領域仍需領域專家具體建構 Skills 集合與對應的 deterministic generator。無法像通用 LLM 那樣 zero-shot 遷移至新領域。
隱式領域推理的瓶頸（Bottleneck in Implicit Domain Reasoning）：T3 層級（疾病名稱推斷基因座標，如「乳癌易感」→ BRCA1/BRCA2、疾病類型）的表現即使在全 Skills 配置下仍不可靠：Opus 達 86.7%，但 GPT-5.4 與 GPT-4.1-mini 分別僅有 63-70% 與 70%。在零 Skills 基準下，T3 準確率甚至接近 0%——顯示 disease-to-gene mapping 仍是 LLMs 的弱點。論文指出這暗示無論是更豐富的 Skill formats（如編碼 disease-gene 對應關係的顯式圖譜）還是 more capable models，都是必要的改進方向。
技能編寫的進入金鑰（Skill Authoring Entry Barrier）：雖然 Skills 以 markdown 編寫、無需 ML expertise，但建立高品質的 vocabulary mapping tables（需覆蓋 synonyms、abbreviations、misspellings）與有意義的 optimization strategies（需理解 extraction tools 與 transfer trade-offs）仍需要領域專家具備一定程度的系統理解。這是一個可降低但無法完全消除的進入金鑰。^[raw/papers/scientific-workflow-agent.md]

openhands — 通用 AI Agent 平台，專注軟體工程任務，與 Scientific Workflow Agent 同樣探討 multi-agent collaboration 與 human-in-the-loop 設計，兩者皆強調 reproducibility 與安全性。Scientific Workflow Agent 的 Conductor 與 Execution Sentinel 分離模式，與 OpenHands 的 AgentDelegateAction 子代理委託設計分享相似的分工理念。
optimat-alloys-agent — 材料科學領域的 AI Agent，同樣涉及科學工作流程的自動化概念，可與 Scientific Workflow Agent 的領域遷移可能性（每個新領域需建構 Skills + deterministic generators）相互參照。
claude-code-analysis — 透過 reverse-engineering 分析 Claude Code 架構，揭示子代理職責邊界設計，與 Scientific Workflow Agent 的四 Agent 職責分離設計分享相似的系統性思考。Claude Code 的六種內建子代理類型（Explore、Plan、General-purpose、Claude Code Guide、Verification、Statusline-setup）與 Scientific Workflow Agent 的 Conductor/Composer/Deployment/Execution Sentinel 四元件都展示了 multi-agent 系統中專業化分工的價值。
agent-readmes — AI Agent 系統的 evaluation 與 safety 議題，與 Scientific Workflow Agent 的 reproducibility guarantee（identical intents → identical DAGs）與 validation gates 設計高度相關。兩者都關注如何在保持 agent 靈活性的同時建立安全可控的執行框架。

Practical Deployment Considerations

在實際部署 Scientific Workflow Agent 時，需要特別注意與現有 multi-agent-systems 基礎設施的整合。雖然系統的 deterministic layer 確保了 reproducibility，但 Semantic Layer 的 LLM 依賴意味著部署環境需要穩定的模型存取與適當的 rate limiting 策略。在生產環境中，建議將 Conductor 的 clarification loop 設計為可選配置——對於清晰、格式良好的 research queries，可以繞過多輪澄清直接進入 workflow planning 階段，以減少延遲。

Skills 的維護是另一個關鍵的 operational 考量。雖然 Skills 本身以 markdown 編寫、無需 ML expertise，但每個新領域都需要領域專家具建完整的 vocabulary mappings、parameter constraints 與 optimization strategies。在 agent-memory 的框架下，Skills 可以被視為一種結構化的領域知識外部化機制——類似於 agent 系統中的長期記憶，只是這些「記憶」是由領域專家預先編碼而非從互動中學習。這種設計確保了知識的穩定性，但也意味著 Skills 集合需要定期更新以反映領域知識的演進。

最後，系統的 deferred generation 策略對基礎設施監控提出了特殊要求。Deployment Service 需要能夠在 provision 後準確測量 actual data sizes 與 vCPU 可用性，這在共享 Kubernetes 環境中可能面臨資源競爭的挑戰。建議在 Execution Sentinel 中實現異常檢測機制，監控实际執行時間與預估時間的偏差，以便在资源配置不足時及時觸發重試或重新規劃。

Quartz 4

Explorer

Scientific Workflow Agent — 科學工作流代理：AI 自動化研究

Scientific Workflow Agent

Overview

Core Contributions

Architecture / Approach

三層分解

四個 Agent

六階段管道路徑

Key Results

Limitations

Practical Deployment Considerations

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Scientific Workflow Agent — 科學工作流代理：AI 自動化研究

Scientific Workflow Agent

Overview

Core Contributions

Architecture / Approach

三層分解

四個 Agent

六階段管道路徑

Key Results

Limitations

Related Entities

Practical Deployment Considerations

Graph View

Table of Contents

Backlinks