2604.17175v1 Apr 19, 2026 cs.LG

RosettaSearch: 단백질 서열 설계 위한 다중 목표 추론 시간 기반 탐색

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

Allen Nie

Citations: 129

h-index: 5

Ching-An Cheng

Citations: 185

h-index: 6

Meghana Kshirsagar

Citations: 431

h-index: 8

Fanglei Xue

Citations: 551

h-index: 7

R. Dodhia

Citations: 888

h-index: 15

J. Ferres

Citations: 1,612

h-index: 22

Kevin K. Yang

Citations: 33

h-index: 2

F. Dimaio

Citations: 1,600

h-index: 10

본 논문에서는 단백질 서열 최적화를 위한 추론 시간 기반 다중 목표 최적화 방법인 RosettaSearch를 소개합니다. 우리는 제어된 탐색 및 활용 기능을 갖춘 검색 알고리즘 내에서 생성형 최적화기로 대규모 언어 모델(LLM)을 사용하며, 구조 예측 모델인 RosettaFold3로부터 계산된 보상을 활용합니다. 대규모 평가에서, 우리는 RosettaSearch를 LigandMPNN(최첨단 단백질 서열 설계 모델)이 생성한 400개의 최적화되지 않은 서열에 적용하여, LigandMPNN의 단일 패스 디코딩으로는 생성할 수 없는 고품질 설계를 복구했습니다. RosettaSearch가 생성한 설계는 구조 충실도 지표에서 18%에서 68% 사이의 개선을 보여주며, 이는 설계 성공률을 2.5배 향상시키는 결과를 가져왔습니다. 독립적인 구조 예측 모델(Chai-1)을 사용하여 RosettaSearch가 설계한 서열을 평가한 결과, 성공률 향상이 견고하게 나타났으며, 두 가지 서로 다른 LLM 계열(o4-mini 및 Gemini-3)에서도 일관된 성능을 보였습니다. 또한, RosettaSearch가 단백질 서열 설계 모델에서 생성된 *de novo* 백본을 사용하는 ProteinMPNN 설계 서열의 서열 충실도를 향상시킨다는 것을 보여주었으며, 이는 접근 방식이 자연 단백질 구조를 넘어 계산적으로 생성된 백본에도 적용될 수 있음을 시사합니다. 또한, 예측된 단백질 구조의 이미지를 피드백으로 사용하여 구조적 맥락을 통합하고 단백질 서열 생성을 안내하는 비전-언어 모델을 활용한 RosettaSearch의 다중 모드 확장을 시연했습니다. 본 연구에서 생성된 서열 경로는 서열 설계 모델의 학습 데이터 또는 사후 학습 데이터로 사용될 수 있으며, 코드 및 데이터 세트와 함께 논문 발표 시 공개될 예정입니다.

Original Abstract

We introduce RosettaSearch, an inference-time multi-objective optimization approach for protein sequence optimization. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18\% to 68\%, translating to a 2.5$\times$ improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves sequence fidelity for ProteinMPNN-designed sequences on \textit{de novo} backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. The sequence trajectories generated by our approach can be used as training data in sequence design models or in post-training and will be released along with the code and datasets upon publication.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!