2604.07669v1 Apr 09, 2026 cs.LG

LLM 기반 액션 공간을 활용한 강화 학습을 통한 합성 가능성을 고려한 주요 화합물 최적화

Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Zhichun Guo

Citations: 28

h-index: 3

Tuan Vinh

Citations: 2

h-index: 1

Tao Li

Citations: 1

h-index: 1

Monika Raj

Citations: 2

h-index: 1

Carl Yang

Citations: 1

h-index: 1

Kaiyuan Hou

Citations: 0

h-index: 0

신약 개발 과정의 주요 화합물 최적화는 치료 효능을 향상시키는 동시에 제안된 분자 변형이 실현 가능한 합성 경로와 일치하도록 하는 것을 요구합니다. 기존 방법들은 합성 가능성을 고려하지 않고 효능 점수만을 우선시하거나, 방대한 반응 네트워크에 대한 비용이 많이 드는 열거 방식을 사용합니다. 반면, 대규모 언어 모델(LLM)을 직접 적용하는 경우, 화학적으로 유효하지 않은 구조가 생성되는 경우가 빈번합니다. 본 연구에서는 주요 화합물 최적화를 마르코프 의사 결정 문제로 정의하고, 검증된 반응 템플릿으로 정의된 합성 제약 조건이 적용된 액션 공간을 활용하는 프레임워크인 MolReAct을 제안합니다. 도구 기반 LLM 에이전트는 동적인 반응 환경으로 작동하며, 특수 화학 분석 도구를 호출하여 반응 부위를 식별하고, 매칭된 템플릿에서 화학적으로 타당한 변환을 제안합니다. 그룹 상대 정책 최적화(GRPO)를 통해 학습된 정책 모델은 이러한 제약 조건이 적용된 액션 중에서 선택하여 다단계 반응 경로 전체에서 장기적인 Oracle 보상을 극대화합니다. SMILES 기반 캐싱 메커니즘은 전체 최적화 시간을 약 43%까지 줄여줍니다. Therapeutic Data Commons에서 가져온 13개의 효능 최적화 작업과 하나의 구조 기반 도킹 작업에서 MolReAct은 평균 Top-10 점수 0.563을 달성했으며, 이는 가장 강력한 합성 가능성을 고려한 기준 모델보다 10.4% 향상된 성능입니다. 또한, 14개의 작업 중 10개에서 가장 높은 샘플 효율성을 달성했습니다. 추가 실험 결과, 도구 기반 반응 제안과 경로 수준의 정책 최적화는 상호 보완적인 성능 향상을 가져옴을 확인했습니다. MolReAct은 검증된 반응 템플릿을 기반으로 각 단계를 수행함으로써, 효능이 향상된 분자와 함께 명시적인 합성 경로를 제공합니다.

Original Abstract

Lead optimization in drug discovery requires improving therapeutic properties while ensuring that proposed molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment that invokes specialized chemical analysis tools to identify reactive sites and propose chemically grounded transformations from matched templates. A policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step reaction trajectories. A SMILES-based caching mechanism further reduces end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.563, outperforming the strongest synthesizable baseline by 10.4% in relative improvement, and attains the best sample efficiency on 10 of 14 tasks. Ablations confirm that both tool-augmented reaction proposals and trajectory-level policy optimization contribute complementary gains. By grounding every step in validated reaction templates, MolReAct produces molecules that are property-improved and each accompanied by an explicit synthetic pathway.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!