2603.04366v1 Mar 04, 2026 cs.SD

제어 가능한 잠재 오디오 확산 모델을 위한 저자원 가이드 방법

Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack

Citations: 11

h-index: 1

Zack Zukowski

Citations: 557

h-index: 9

CJ Carr

Citations: 740

h-index: 9

Julian Parker

Citations: 482

h-index: 5

Zach Evans

Citations: 666

h-index: 5

Josiah Taylor

Citations: 575

h-index: 4

Taylor Berg-Kirkpatrick

Citations: 693

h-index: 10

Julian J. McAuley

Citations: 26

h-index: 1

Jordi Pons

Citations: 657

h-index: 5

생성형 오디오는 미세한 제어가 가능한 출력을 요구하지만, 대부분의 기존 방법은 특정 제어를 위한 모델 재학습을 필요로 하거나, 계산 비용이 높은 추론 시 제어(예: 가이드) 방식을 사용합니다. 본 연구에서는 기존 가이드 기반 제어 방식의 병목 현상, 특히 디코더 역전파로 인한 높은 단계별 비용을 분석하고, 선택적인 TFG (Transformer-based Guided Sampling) 및 잠재 제어 헤드(Latent-Control Heads, LatCHs)를 활용한 가이드 기반 접근 방식을 제안합니다. 이 방식은 낮은 계산 오버헤드로 잠재 오디오 확산 모델을 제어할 수 있도록 합니다. LatCHs는 비싼 디코더 단계를 거치지 않고 잠재 공간에서 직접 작동하며, 최소한의 학습 자원(7M 파라미터 및 약 4시간의 학습 시간)만을 필요로 합니다. Stable Audio Open 데이터셋에 대한 실험 결과, 강도, 음높이, 박자(및 이들의 조합)에 대한 효과적인 제어가 가능하며, 생성 품질을 유지하는 것을 확인했습니다. 본 방법은 표준 엔드-투-엔드 가이드 방식보다 훨씬 낮은 계산 비용으로 정밀도와 오디오 충실도를 균형 있게 제공합니다. 데모 예제는 다음 링크에서 확인할 수 있습니다: https://zacharynovack.github.io/latch/latch.html.

Original Abstract

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!