Gang Chen
Publications
FastBUS: A Fast Bayesian Framework for Unified Weakly-Supervised Learning
Machine Learning often involves various imprecise labels, leading to diverse weakly supervised settings. While recent methods aim for universal handling, they usually suffer from complex manual pre-work, ignore the relationships between associated labels, or are unable to batch process due to computational design flaws, resulting in long running times. To address these limitations, we propose a novel general framework that efficiently infers latent true label distributions across various weak supervisions. Our key idea is to express the label brute-force search process as a probabilistic transition of label variables, compressing diverse weakly supervised DFS tree structures into a shared Bayesian network. From this, we derived a latent probability calculation algorithm based on generalized belief propagation and proposed two joint acceleration strategies: 1) introducing a low-rank assumption to approximate the transition matrix, reducing time complexity; 2) designing an end-to-end state evolution module to learn batch-scale transition matrices, facilitating multi-category batch processing. In addition, the equivalence of our method with the EM algorithm in most scenarios is further demonstrated. Extensive experiments show that our method achieves SOTA results under most weakly supervised settings, and achieves up to hundreds of times faster acceleration in running time compared to other general methods.
Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority
The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.
kNN-Graph: An adaptive graph model for $k$-nearest neighbors
The k-nearest neighbors (kNN) algorithm is a cornerstone of non-parametric classification in artificial intelligence, yet its deployment in large-scale applications is persistently constrained by the computational trade-off between inference speed and accuracy. Existing approximate nearest neighbor solutions accelerate retrieval but often degrade classification precision and lack adaptability in selecting the optimal neighborhood size (k). Here, we present an adaptive graph model that decouples inference latency from computational complexity. By integrating a Hierarchical Navigable Small World (HNSW) graph with a pre-computed voting mechanism, our framework completely transfers the computational burden of neighbor selection and weighting to the training phase. Within this topological structure, higher graph layers enable rapid navigation, while lower layers encode precise, node-specific decision boundaries with adaptive neighbor counts. Benchmarking against eight state-of-the-art baselines across six diverse datasets, we demonstrate that this architecture significantly accelerates inference speeds, achieving real-time performance, without compromising classification accuracy. These findings offer a scalable, robust solution to the long-standing inference bottleneck of kNN, establishing a new structural paradigm for graph-based nonparametric learning.
Data-Free Privacy-Preserving for LLMs via Model Inversion and Selective Unlearning
Large language models (LLMs) exhibit powerful capabilities but risk memorizing sensitive personally identifiable information (PII) from their training data, posing significant privacy concerns. While machine unlearning techniques aim to remove such data, they predominantly depend on access to the training data. This requirement is often impractical, as training data in real-world deployments is commonly proprietary or inaccessible. To address this limitation, we propose Data-Free Selective Unlearning (DFSU), a novel privacy-preserving framework that removes sensitive PII from an LLM without requiring its training data. Our approach first synthesizes pseudo-PII through language model inversion, then constructs token-level privacy masks for these synthetic samples, and finally performs token-level selective unlearning via a contrastive mask loss within a low-rank adaptation (LoRA) subspace. Extensive experiments on the AI4Privacy PII-Masking dataset using Pythia models demonstrate that our method effectively removes target PII while maintaining model utility.