2602.05993v1 Feb 05, 2026 cs.LG

다이아몬드 맵: 확률적 흐름 맵을 이용한 효율적인 보상 정렬

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

N. Boffi

Citations: 2,007

h-index: 15

T. Jaakkola

Citations: 56,729

h-index: 110

Peter Holderrieth

Citations: 355

h-index: 5

Douglas Chen

Citations: 482

h-index: 1

L. Eyring

Citations: 159

h-index: 4

Ishin Shah

Citations: 1

h-index: 1

Giri Anantharaman

Citations: 16

h-index: 1

Yutong He

Citations: 21

h-index: 2

Zeynep Akata

Citations: 1,547

h-index: 21

Max Simchowitz

Citations: 4

h-index: 1

흐름 및 확산 모델은 고품질의 샘플을 생성하지만, 학습 후 사용자 선호도나 제약 조건에 맞게 이를 조정하는 것은 비용이 많이 들고 불안정하며, 이는 일반적으로 '보상 정렬'이라는 문제로 불립니다. 우리는 효율적인 보상 정렬이 생성 모델 자체의 특성이 되어야 하며, 사후 조작이 아니라 설계 단계부터 고려되어야 한다고 주장합니다. 따라서, 우리는 추론 시에 임의의 보상에 효율적이고 정확하게 정렬할 수 있도록 설계된 확률적 흐름 맵 모델인 '다이아몬드 맵'을 제안합니다. 다이아몬드 맵은 흐름 맵과 마찬가지로 많은 시뮬레이션 단계를 단일 단계 샘플러로 통합하면서, 최적의 보상 정렬에 필요한 확률성을 유지합니다. 이러한 설계는 가치 함수의 효율적이고 일관된 추정을 가능하게 하여, 검색, 순차적 몬테카를로, 그리고 가이딩을 확장 가능하게 만듭니다. 우리의 실험 결과는 다이아몬드 맵이 GLASS Flows에서 증류 학습을 통해 효율적으로 학습될 수 있으며, 기존 방법보다 더 강력한 보상 정렬 성능을 달성하고 더 잘 확장된다는 것을 보여줍니다. 우리의 결과는 추론 시에 생성 모델이 임의의 선호도와 제약 조건에 빠르게 적응할 수 있는 실용적인 방법을 제시합니다.

Original Abstract

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

1 Citations

0 Influential

30 Altmetric

151.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!