2604.03121v1 Apr 03, 2026 cs.CR

Kimi K2.5의 독립적인 안전성 평가

An Independent Safety Evaluation of Kimi K2.5

Ida Caspary

Citations: 6

h-index: 1

Zheng-Xin Yong

Citations: 9

h-index: 2

Parv Mahajan

Citations: 3

h-index: 1

Andy Y. T. Wang

Citations: 2

h-index: 1

Yernat Yestekov

Citations: 1

h-index: 1

Zora Che

Citations: 164

h-index: 7

Mosh Levy

Citations: 251

h-index: 5

Elle Najt

Citations: 14

h-index: 3

Dennis M. Murphy

Citations: 1,295

h-index: 7

Prashant Kulkarni

Citations: 58

h-index: 3

Lev E McKinney

Citations: 43

h-index: 2

Kei Nishimura-Gasparian

Citations: 71

h-index: 2

Ram Potham

Citations: 13

h-index: 2

Aengus Lynch

Citations: 622

h-index: 6

Michael Chen

Citations: 124

h-index: 4

Kimi K2.5는 코딩, 다중 모드, 에이전트 기반 벤치마크에서 폐쇄형 모델과 경쟁하는 오픈 소스 LLM이지만, 안전성 평가 없이 출시되었습니다. 본 연구에서는 Kimi K2.5의 잠재적 위험에 초점을 맞춘 예비 안전성 평가를 수행합니다. 특히, 강력한 오픈 소스 모델에서 발생할 가능성이 높은 CBRNE 오용 위험, 사이버 보안 위험, 정렬 문제, 정치적 검열, 편향, 그리고 무해성 측면에서 모델을 평가합니다. 그 결과, Kimi K2.5는 GPT 5.2 및 Claude Opus 4.5와 유사한 이중 용도 능력을 보이지만, CBRNE 관련 요청에 대한 거부율이 현저히 낮아 악의적인 행위자가 무기를 만드는 데 활용할 수 있음을 시사합니다. 사이버 관련 작업에서는 Kimi K2.5가 경쟁력 있는 사이버 보안 성능을 보이지만, 취약점 발견 및 악용과 같은 최첨단 수준의 자율적인 사이버 공격 능력을 갖추고 있는 것으로 보이지 않습니다. 또한, Kimi K2.5는 우려할 만한 수준의 파괴 능력과 자기 복제 경향을 보이지만, 장기적인 악의적인 목표를 가지고 있는 것으로 보이지 않습니다. 더욱이, Kimi K2.5는 특히 중국어에서 좁은 범위의 검열 및 정치적 편향을 나타내며, 허위 정보 유포 및 저작권 침해와 관련된 유해한 요청에 대해 더 높은 수준의 준거성을 보입니다. 마지막으로, 모델은 사용자의 망상을 대처하려 노력하며, 전반적으로 과도한 거부율을 보이지 않습니다. 본 연구의 예비적인 결과는 최첨단 오픈 소스 모델에서 안전 관련 위험이 존재하며, 오픈 소스 배포의 규모와 접근성으로 인해 이러한 위험이 증폭될 수 있음을 보여줍니다. 따라서, 오픈 소스 모델 개발자는 책임감 있는 배포를 위해 더욱 체계적인 안전성 평가를 수행하고 공개할 것을 강력히 권고합니다.

Original Abstract

Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying safety evaluation. In this work, we conduct a preliminary safety assessment of Kimi K2.5 focusing on risks likely to be exacerbated by powerful open-weight models. Specifically, we evaluate the model for CBRNE misuse risk, cybersecurity risk, misalignment, political censorship, bias, and harmlessness, in both agentic and non-agentic settings. We find that Kimi K2.5 shows similar dual-use capabilities to GPT 5.2 and Claude Opus 4.5, but with significantly fewer refusals on CBRNE-related requests, suggesting it may uplift malicious actors in weapon creation. On cyber-related tasks, we find that Kimi K2.5 demonstrates competitive cybersecurity performance, but it does not appear to possess frontier-level autonomous cyberoffensive capabilities such as vulnerability discovery and exploitation. We further find that Kimi K2.5 shows concerning levels of sabotage ability and self-replication propensity, although it does not appear to have long-term malicious goals. In addition, Kimi K2.5 exhibits narrow censorship and political bias, especially in Chinese, and is more compliant with harmful requests related to spreading disinformation and copyright infringement. Finally, we find the model refuses to engage in user delusions and generally has low over-refusal rates. While preliminary, our findings highlight how safety risks exist in frontier open-weight models and may be amplified by the scale and accessibility of open-weight releases. Therefore, we strongly urge open-weight model developers to conduct and release more systematic safety evaluations required for responsible deployment.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!