2605.05499v1 May 06, 2026 cs.AI

FoodCHA: 미세한 음식 분석을 위한 멀티모달 LLM 에이전트

FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

Onat Gungor

Citations: 15

h-index: 1

Tajana Rosing

Citations: 38

h-index: 3

Ye Tian

Citations: 33

h-index: 4

Woojin Lee

Citations: 3

h-index: 1

Pranav Mekkoth

Citations: 0

h-index: 0

카메라가 장착된 모바일 기기와 웨어러블 기기의 널리 보급으로 식사 이미지를 쉽게 캡처할 수 있게 되면서, 음식 인식은 실시간 식이 모니터링의 핵심 요소가 되었습니다. 그러나 실제 음식 이미지는 높은 클래스 내 유사성과 단일 이미지 내 다수의 음식 항목 존재로 인해 어려움을 야기합니다. 딥러닝 모델은 거칠고 일반적인 분류에서 뛰어난 성능을 보이지만, 조리 스타일과 같은 미세한 속성을 파악하는 데 어려움을 겪는 경우가 많습니다. 또한, 최신 시각-언어 모델의 개방형 생성 방식은 비표준적인 라벨을 생성하여 실제 적용을 제한할 수 있습니다. 본 연구에서는 FoodCHA라는 멀티모달 에이전트 프레임워크를 제안합니다. FoodCHA는 음식 인식을 계층적인 의사 결정 프로세스로 재구성합니다. FoodCHA는 예측을 점진적으로 고정하여, 상위 범주를 사용하여 하위 범주 식별을 안내하고, 하위 범주를 사용하여 조리 스타일 인식을 안내함으로써 의미적 일관성과 속성 수준의 구분을 향상시킵니다. FoodCHA는 실제 적용 가능성을 보장하기 위해 Moondream-2B라는 소형 시각-언어 모델을 사용합니다. Moondream-2B는 강력한 추론 능력을 제공하면서도 낮은 계산 및 메모리 오버헤드를 유지합니다. FoodNExTDB 데이터셋에 대한 실험 결과, FoodCHA는 범주 및 하위 범주 인식 정확도에서 Food-Llama-3.2-11B보다 각각 13.8% 및 38.2% 더 높은 성능을 보였으며, 조리 스타일 분류 정확도에서 153.2%라는 놀라운 성능 향상을 달성했습니다.

Original Abstract

The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream-2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!