2603.12597v1 Mar 13, 2026 cs.LG

페인만: 지식 기반 다이어그램 생성 에이전트 - 확장 가능한 시각 디자인을 위한 시스템

Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs

Zixin Wen

Citations: 121

h-index: 4

Aarti Singh

Citations: 3

h-index: 1

Yifu Cai

Citations: 569

h-index: 4

Kyle Lee

Citations: 10

h-index: 1

S. Estep

Citations: 37

h-index: 3

Joshua Sunshine

Citations: 148

h-index: 3

Yuejie Chi

Citations: 33

h-index: 2

Wode Ni

Carnegie Mellon University

Citations: 121

h-index: 5

시각 디자인은 최첨단 멀티모달 AI 시스템의 필수적인 응용 분야입니다. 이러한 시스템의 성능 향상을 위해서는 대규모의 고품질 시각-언어 데이터가 필요합니다. 인터넷에는 방대한 이미지 및 텍스트 데이터가 존재하지만, 지식이 풍부하고 잘 정렬된 이미지-텍스트 쌍은 드뭅니다. 본 논문에서는 저희가 개발한 에이전트, 페인만을 기반으로 구축된 확장 가능한 다이어그램 생성 파이프라인을 소개합니다. 페인만은 다이어그램을 생성하기 위해 먼저 도메인별 지식 구성 요소(‘아이디어’)를 열거하고, 이러한 아이디어를 기반으로 코드 계획을 수행합니다. 계획에 따라, 페인만은 아이디어를 간단한 선언적 프로그램으로 변환하고, 피드백을 받아 다이어그램을 시각적으로 개선하는 과정을 반복합니다. 마지막으로, 선언적 프로그램은 Penrose 다이어그램 시스템에 의해 렌더링됩니다. Penrose의 최적화 기반 렌더링은 시각적 의미를 유지하면서 동시에 새로운 무작위성을 레이아웃에 주입하여 시각적 일관성과 다양성을 갖는 다이어그램을 생성합니다. 결과적으로, 페인만은 매우 적은 비용과 시간으로 다이어그램과 함께 관련 설명을 생성할 수 있습니다. 페인만을 사용하여 10만 개 이상의 잘 정렬된 다이어그램-설명 쌍으로 구성된 데이터셋을 생성했습니다. 또한, 새로 생성된 데이터를 기반으로 시각-언어 벤치마크인 Diagramma를 구축했습니다. Diagramma는 시각-언어 모델의 시각적 추론 능력을 평가하는 데 사용될 수 있습니다. 저희는 데이터셋, 벤치마크, 그리고 전체 에이전트 파이프라인을 오픈 소스 프로젝트로 공개할 계획입니다.

Original Abstract

Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!