2603.22765v1 Mar 24, 2026 cs.CL

LLM-페르소나를 활용한 법률 분야의 어휘 및 의미 다양성 데이터 증강 기법: DALDALL

DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

Jaewon Lee

Citations: 16,262

h-index: 9

J. Choi

Citations: 0

h-index: 0

Sungzoon Cho

Citations: 35

h-index: 3

데이터 부족은 자원 부족 분야에서 지속적인 과제입니다. 기존의 데이터 증강 방법은 대규모 언어 모델(LLM)의 생성 능력을 활용하여 대량의 합성 데이터를 생성하지만, 이러한 접근 방식은 종종 품질보다 양을 우선시하며, 도메인 특화 전략이 부족합니다. 본 연구에서는 법률 정보 검색(IR)에 특화된 페르소나 기반 데이터 증강 프레임워크인 DALDALL을 소개합니다. 저희 방법은 변호사, 검사, 판사 등과 같은 도메인 특화 전문가 페르소나를 활용하여 합성 쿼리를 생성하며, 이는 일반적인 프롬프트 기반 접근 방식보다 훨씬 더 높은 어휘 및 의미 다양성을 나타냅니다. CLERC 및 COLIEE 벤치마크에 대한 실험 결과, 페르소나 기반 증강은 Self-BLEU 점수를 통해 측정되는 어휘 다양성 측면에서 개선을 달성했으며, 동시에 원래 쿼리의 의미적 충실도를 유지했습니다. 또한, 페르소나 기반으로 증강된 데이터로 미세 조정된 검색 모델은 원래 데이터 또는 일반적인 증강 데이터를 사용한 모델과 비교하여 일관되게 경쟁력 있는 또는 더 우수한 재현율 성능을 달성했습니다. 이러한 결과는 페르소나 기반 프롬프팅이 특화된 자원 부족 분야에서 고품질 훈련 데이터를 생성하는 효과적인 전략임을 입증합니다.

Original Abstract

Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!