2603.11950v1 Mar 12, 2026 cs.AI

언어 정보를 활용한 사전 학습을 통한 일반화 가능한 센서 모델 학습

Learning Transferable Sensor Models via Language-Informed Pretraining

Yu Wu

Citations: 0

h-index: 0

Yuliang Chen

Citations: 50

h-index: 3

Arvind Pillai

Dartmouth College

Citations: 334

h-index: 10

Tess Z. Griffin

Citations: 237

h-index: 5

LisaA Marsch

Citations: 7

h-index: 2

Michael V. Heinz

Citations: 356

h-index: 4

Nicholas C. Jacobson

Citations: 2

h-index: 1

Andrew T. Campbell

Citations: 178

h-index: 5

최신 센서 시스템은 방대한 양의 비표시된 다변량 시계열 데이터를 생성합니다. 이러한 풍부한 비표시 데이터는 일반화 가능한 표현을 학습하는 데 자연스러운 접근 방식인 자기 지도 학습(SSL)에 적합합니다. 그러나 대부분의 기존 방법은 재구성 또는 예측 목표에 최적화되어 있으며, 종종 다운스트림 분류 및 추론 작업에 필요한 의미 구조를 제대로 포착하지 못합니다. 최근의 센서-언어 정렬 방법은 캡셔닝 및 제로샷 전이를 통해 의미론적 일반화를 향상시키지만, 미리 정의된 채널 집합, 신호 길이 또는 시간 해상도와 같은 고정된 센서 구성에 제한되어 있어 다양한 도메인에서의 적용성을 저해합니다. 이러한 격차를 해결하기 위해, 우리는 extbf{SLIP} ( extbf{S}ensor extbf{L}anguage- extbf{I}nformed extbf{P}retraining)을 소개합니다. SLIP은 다양한 센서 환경에서 일반화되는 언어 정렬 표현을 학습하기 위한 오픈 소스 프레임워크입니다. SLIP은 대조 학습과 센서 조건 캡셔닝을 통합하여, 판별적인 이해와 생성적인 추론을 모두 가능하게 합니다. SLIP은 크로스 어텐션을 활용하여 사전 학습된 디코더 전용 언어 모델을 재활용하고, 우아하고 유연한 패치 임베더를 도입함으로써, 추가적인 재학습 없이 다양한 시간 해상도와 가변 길이 입력을 추론 시간에 지원합니다. 11개의 데이터 세트에서 SLIP은 제로샷 전이, 신호 캡셔닝 및 질문 답변에서 뛰어난 성능을 보여줍니다. 평균 선형 프로빙 정확도가 77.14%로, 강력한 기준 모델보다 5.93% 향상되었으며, 센서 기반 질문 답변에서 64.83%의 정확도를 달성했습니다.

Original Abstract

Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!