2603.01241v1 Mar 01, 2026 cs.IR

TARSE: 추론 에이전트를 위한 기술 및 경험 검색을 통한 테스트 시간 적응

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

Junda Wang

Citations: 388

h-index: 10

Zonghai Tao

Citations: 3

h-index: 1

Hansi Zeng

Citations: 40

h-index: 3

Zhichao Yang

Citations: 354

h-index: 10

Hamed Zamani

Citations: 1,277

h-index: 8

Hong Yu

Citations: 45

h-index: 3

복잡한 임상 의사 결정은 모델이 사실 부족해서가 아니라, 적절한 절차적 지식과 이전 사례를 정확하게 선택하고 적용하는 능력 부족에서 비롯되는 경우가 많습니다. 본 연구에서는 임상 질문 응답 문제를 두 가지 명시적인 검색 가능한 자원을 활용하는 에이전트 문제로 정의합니다. 이 자원들은 각각 다음과 같습니다. 첫째, 지침, 프로토콜, 약리학적 메커니즘과 같은 재사용 가능한 임상 절차를 의미하는 '기술(skills)'; 둘째, 이전에 해결된 사례에서 얻은 검증된 추론 경로(예: 사고 과정(chain-of-thought) 솔루션 및 단계별 분해)를 의미하는 '경험(experience)'. 테스트 시간에, 에이전트는 선별된 라이브러리에서 관련 기술과 경험을 검색하고, 언어 모델의 중간 추론 과정을 임상적으로 유효한 논리와 일치시키기 위한 경량 테스트 시간 적응을 수행합니다. 구체적으로, (i) 실행 가능한 의사 결정 규칙으로 구성된 지침 형식 문서를 기반으로 한 기술 라이브러리, (ii) 단계별 전환을 기준으로 색인된 우수 임상 추론 사례 라이브러리, 그리고 (iii) 현재 사례에 가장 유용한 기술 및 경험 항목을 선택하는 단계 인식 검색기를 구축합니다. 그런 다음, 검색된 항목을 사용하여 모델을 적응시켜 개별 단계 불일치를 줄이고, 추론이 뒷받침되지 않는 단축 경로로 벗어나는 것을 방지합니다. 의료 질문 응답 벤치마크에서 수행한 실험 결과, 강력한 의료 RAG 기반 모델 및 프롬프트 기반 추론 방법보다 일관되게 성능이 향상되었습니다. 본 연구 결과는 임상 기술과 경험을 명시적으로 분리하고 검색한 후, 테스트 시간에 모델을 정렬하는 것이 보다 안정적인 의료 에이전트를 개발하는 실용적인 접근 방식임을 시사합니다.

Original Abstract

Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model's intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.

3 Citations

0 Influential

5 Altmetric

28.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!