2603.14405v1 Mar 15, 2026 cs.LG

ES-Merging: 임베딩 공간 신호를 활용한 생물학적 MLLM 병합

ES-Merging: Biological MLLM Merging via Embedding Space Signals

S. Hwang

Citations: 92

h-index: 3

Won-Pil Lee

Citations: 1

h-index: 1

Dongki Kim

Citations: 228

h-index: 6

생물학적 다중 모드 대규모 언어 모델(MLLM)은 과학적 발견을 위한 강력한 기반 모델로 부상했습니다. 그러나 기존 모델은 단일 모드에 특화되어 있어, 본질적으로 다중 모드 문제를 해결하는 데 한계가 있습니다. 모델 병합은 다양한 모드를 통합된 MLLM으로 결합하는 효율적인 방법이지만, 기존 방법은 입력에 독립적인 파라미터 공간 휴리스틱에 의존하여 모드 특수성을 정확하게 반영하지 못합니다. 이러한 한계를 극복하기 위해, 우리는 임베딩 공간 신호로부터 병합 계수를 추정하는 표현 인식 병합 프레임워크를 제안합니다. 먼저, 다양한 모드 토큰으로 구성된 탐색 입력을 설계하고, 이를 각 특수화된 MLLM에 전달하여 모드별 표현 변화를 반영하는 레이어별 임베딩 응답을 얻습니다. 그런 다음, 우리는 임베딩 공간에서 두 가지 수준의 보완적인 병합 계수를 추정합니다. 즉, 거친 신호로부터 레이어별 계수와 미세한 신호로부터 요소별 계수를 추정하고, 이를 결합하여 강력한 계수 추정을 수행합니다. 상호 작용 효과 예측 벤치마크 실험 결과, 우리 방법은 기존 병합 방법보다 성능이 뛰어나며, 심지어 특정 작업에 맞게 조정된 모델보다도 우수한 성능을 보여주었습니다. 이는 임베딩 공간 신호가 다중 모드 MLLM 병합을 위한 원칙적이고 효과적인 기반을 제공한다는 것을 입증합니다.

Original Abstract

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to faithfully capture modality specialization. To overcome this limitation, we propose a representation-aware merging framework that estimates merging coefficients from embedding space signals. We first design a probe input that consists of different modality tokens and forward it through each specialized MLLM to obtain layer-wise embedding responses that reflect modality-specific representation changes. We then estimate complementary merging coefficients at two granularities from the embedding space: layer-wise coefficients from coarse-grained signals and element-wise coefficients from fine-grained signals, which are jointly combined for robust coefficient estimation. Experiments on interactive effect prediction benchmarks show that our method outperforms existing merging methods and even surpasses task-specific fine-tuned models, establishing that embedding space signals provide a principled and effective foundation for cross-modal MLLM merging.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!