2605.05626v1 May 07, 2026 cs.CL

When2Speak: 대규모 언어 모델을 위한 다자간 대화에서의 발화 시점 및 발언 순서 결정 데이터셋

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

Ziang Ye

Citations: 17

h-index: 3

Brinnae Bent

Citations: 2,065

h-index: 18

Vihaan Nama

Citations: 9

h-index: 1

Shreya Mendi

Citations: 0

h-index: 0

대규모 언어 모델(LLM)은 문맥에 적합한 응답을 생성하는 데 뛰어나지만, 발화 시점을 결정하는 것이 무엇을 말하는 것만큼 중요한 다자간 대화에서는 성능이 부족합니다. 이러한 환경에서 무조건적으로 모든 발언에 응답하면 과도한 중단이 발생하고 대화의 일관성이 저하될 수 있습니다. 본 논문에서는 그룹 상호작용에서 개입 시점을 학습하기 위한 실제 기반의 합성 데이터셋인 When2Speak과 4단계 생성 파이프라인을 소개합니다. 이 데이터셋은 2~6명의 화자가 참여하는 16,000개의 대화에서 파생된 215,000개 이상의 예제를 포함하며, 다양한 대화 스타일, 어조 및 참여자 역학을 포괄하고, 각 발언 시점에 SPEAK (발화) 또는 SILENT (침묵) 결정을 명시적으로 모델링합니다. 저희의 파이프라인은 실제 기반의 데이터, 구조화된 증강, 제어된 스크립트 합성 및 미세 조정 준비된 지도 학습을 결합하며, 재현 가능성과 특정 도메인의 대화 규범에 대한 적응을 지원하기 위해 모든 코드가 공개되어 있습니다. 다양한 모델 아키텍처에서, When2Speak 데이터셋을 사용한 지도 학습(SFT)은 제로샷 기반 모델보다 훨씬 우수한 성능을 보였습니다 (40억 개 이상의 파라미터를 가진 모델에서 평균 Macro F1 점수가 60% 향상되었으며, 최대 120% 향상을 보였습니다). 그러나 SFT로 학습된 모델은 여전히 지나치게 보수적인 경향을 보이며, Missed Intervention Rate (MIR)를 통해 확인된 바와 같이, 적절한 개입의 절반 이상을 놓치고 있습니다 (평균 0.50, 큰 모델 크기에서도 동일). 이러한 한계를 해결하기 위해, 비대칭 보상 형상을 사용한 강화 학습을 적용하여 MIR을 0.186-0.218로 줄이고, 재현율을 0.479에서 0.78-0.81로 높였습니다. 저희의 연구 결과는 시간적 참여가 대화 지능의 뚜렷하고 학습 가능한 측면이며, 실제 기반의 합성 데이터가 LLM이 다자간 상호작용에서 보다 자연스럽고 적절하게 참여할 수 있도록 하는 효과적이고 확장 가능한 방법을 제공한다는 것을 보여줍니다.

Original Abstract

Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!