Lingfeng Li
Publications
System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games
Long-horizon tabletop games pose a distinct systems challenge for robotics: small perceptual or execution errors can invalidate accumulated task state, propagate across decision-making modules, and ultimately derail interaction. This paper studies how to maintain internal state consistency in turn-based, multi-human robotic tabletop games through deliberate system design rather than isolated component improvement. Using Mahjong as a representative long-horizon setting, we present an integrated architecture that explicitly maintains perceptual, execution, and interaction state, partitions high-level semantic reasoning from time-critical perception and control, and incorporates verified action primitives with tactile-triggered recovery to prevent premature state corruption. We further introduce interaction-level monitoring mechanisms to detect turn violations and hidden-information breaches that threaten execution assumptions. Beyond demonstrating complete-game operation, we provide an empirical characterization of failure modes, recovery effectiveness, cross-module error propagation, and hardware-algorithm trade-offs observed during deployment. Our results show that explicit partitioning, monitored state transitions, and recovery mechanisms are critical for sustaining executable consistency over extended play, whereas monolithic or unverified pipelines lead to measurable degradation in end-to-end reliability. The proposed system serves as an empirical platform for studying system-level design principles in long-horizon, turn-based interaction.
Decoupling Return-to-Go for Efficient Decision Transformer
The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
Adapting Rules of Official International Mahjong for Online Players
As one of the worldwide spread traditional game, Official International Mahjong can be played and promoted online through remote devices instead of requiring face-to-face interaction. However, online players have fragmented playtime and unfixed combination of opponents in contrary to offline players who have fixed opponents for multiple rounds of play. Therefore, the rules designed for offline players need to be modified to ensure the fairness of online single-round play. Specifically, We employ a world champion AI to engage in self-play competitions and conduct statistical data analysis. Our study reveals the first-mover advantage and issues in the subgoal scoring settings. Based on our findings, we propose rule adaptations to make the game more suitable for the online environment, such as introducing compensatory points for the first-mover advantage and refining the scores of subgoals for different tile patterns. Compared with the traditional method of rotating positions over multiple rounds to balance first-mover advantage, our compensatory points mechanism in each round is more convenient for online players. Furthermore, we implement the revised Mahjong game online, which is open for online players. This work is an initial attempt to use data from AI systems to evaluate Official Internatinoal Mahjong's game balance and develop a revised version of the traditional game better adapted for online players.