Ruixiao Lin
Publications
Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning
Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.
Shattering the Echo Chamber: Hidden Safeguards in Manuscripts Against the AI Takeover of Peer Review
As LLMs become increasingly capable, editorial boards and program committees are growing concerned about reviewers who fully outsource peer review to commercial chatbots. This concern stems from prior findings that current chatbots lack the independent critical thinking and depth of reasoning required to assess scientific novelty. One promising direction for mitigating this concern is to embed hidden instructions into manuscripts that disrupt or alter chatbot-generated reviews. However, existing methods remain intuitive and fragile, as they typically rely on homogeneous payloads injected in an inter-stream manner, rendering them susceptible to sanitization or neutralization. In this paper, we identify End-to-End Review Outsourcing as an emerging threat and propose IntraGuard, a black-box, venue-agnostic defense framework grounded in the structural--visual decoupling inherent to the PDF. Designed for committee-side deployment, IntraGuard supports both explicit strategies that trigger refusal or warning signals, and implicit strategies that embed predefined textual markers into the generated review. These strategies can be deployed via any of three intra-stream injection mechanisms, each of which seamlessly embeds heterogeneous defensive text objects within the PDF's underlying structure without altering its visual presentation. Extensive evaluations across 7 real-world commercial chatbot settings and 12 venues spanning diverse disciplines show that IntraGuard achieves a defense success rate of up to 84%, while preserving peer-review invariance for human reviewers. IntraGuard is lightweight and hardware-independent, incurring an average overhead of only one second per manuscript on a commodity personal computer. We further evaluate 11 adaptive attacks spanning manuscript sanitization and instruction interference, and discuss the implications of constructing ensemble defenses.