2603.03072v1 Mar 03, 2026 cs.AI

TikZilla: 고품질 데이터와 강화 학습을 활용한 텍스트-TikZ 변환 확장

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Citations: 170

h-index: 7

Citations: 51

h-index: 3

대규모 언어 모델(LLM)은 다양한 과학 연구 워크플로우에서 과학자들을 지원하는 데 점점 더 많이 사용되고 있습니다. 핵심적인 과제는 텍스트 설명을 기반으로 고품질의 그림을 생성하는 것으로, 이는 과학 이미지로 렌더링될 수 있는 TikZ 프로그램으로 표현되는 경우가 많습니다. 이전 연구에서는 이러한 작업을 위한 다양한 데이터셋과 모델링 접근 방식이 제안되었습니다. 그러나 기존의 텍스트-TikZ 데이터셋은 TikZ의 복잡성을 충분히 반영하지 못할 정도로 작고 노이즈가 많아, 텍스트와 렌더링된 그림 간의 불일치를 초래합니다. 또한, 기존 접근 방식은 지도 학습(SFT)에만 의존하며, 모델이 그림의 렌더링된 의미를 학습하지 못해 루프, 관련 없는 내용, 부정확한 공간 관계와 같은 오류가 발생할 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 LLM이 생성한 그림 설명을 추가하여 DaTikZ-V3보다 4배 이상 크고 품질이 높은 데이터셋인 DaTikZ-V4를 구축했습니다. 이 데이터셋을 사용하여, 우리는 지도 학습(SFT)을 거친 후 강화 학습(RL)을 적용한 소규모 오픈 소스 Qwen 모델(3B 및 8B)인 TikZilla 모델 패밀리를 학습했습니다. RL에서는, 역 그래픽을 통해 학습된 이미지 인코더를 활용하여 의미적으로 정확한 보상 신호를 제공합니다. 1,000건 이상의 평가를 포함한 광범위한 사용자 평가 결과, TikZilla는 기본 모델보다 1.5~2점 향상되었으며, GPT-4o보다 0.5점 높고 이미지 기반 평가에서는 GPT-5와 동등한 성능을 보였으며, 훨씬 작은 모델 크기로 작동합니다. 코드, 데이터, 모델은 공개될 예정입니다.

Original Abstract

Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!