2603.03280v1 Mar 03, 2026 cs.RO

칼을 이용한 껍질 벗기기: 미세 조작을 인간의 선호도에 맞추기

How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Toru Lin

Citations: 235

h-index: 5

Shuying Deng

Citations: 106

h-index: 2

Zhao-Heng Yin

Citations: 274

h-index: 7

Pieter Abbeel

Citations: 71

h-index: 3

Jitendra Malik

Citations: 965

h-index: 15

음식 준비, 수술, 공예 등 많은 필수적인 조작 작업은 자율 로봇에게 여전히 어려운 과제입니다. 이러한 작업은 접촉이 빈번하고 힘에 민감한 역학적 특성을 가지는 것뿐만 아니라, '암묵적인' 성공 기준을 가지고 있습니다. 픽 앤 플레이스와 달리, 이러한 영역에서의 작업 품질은 연속적이고 주관적입니다 (예: 감자가 얼마나 잘 껍질이 벗겨졌는지). 따라서 정량적인 평가와 보상 설계가 어렵습니다. 본 논문에서는 칼을 이용한 껍질 벗기기를 대표적인 예시로 사용하여, 이러한 작업에 대한 학습 프레임워크를 제시합니다. 저희의 접근 방식은 두 단계 파이프라인을 따릅니다. 첫째, 힘을 고려한 데이터 수집 및 모방 학습을 통해 강력한 초기 정책을 학습하여 객체 변형에 대한 일반화 능력을 확보합니다. 둘째, 학습된 보상 모델을 사용하여 정량적인 작업 지표와 질적인 인간 피드백을 결합하여 정책을 미세 조정함으로써, 정책의 동작을 인간이 인지하는 작업 품질에 맞춥니다. 저희 시스템은 50-200개의 껍질 벗기기 궤적만을 사용하여 오이, 사과, 감자를 포함한 어려운 품목에 대해 평균 90% 이상의 성공률을 달성했으며, 선호도 기반 미세 조정을 통해 성능을 최대 40%까지 향상시켰습니다. 놀랍게도, 단일 품목 범주에서 훈련된 정책은 90% 이상의 성공률을 유지하면서, 새로운 품목 범주 내의 인스턴스와 다른 범주에서 가져온 품목에 대한 강력한 제로샷 일반화 능력을 보입니다.

Original Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!