ArtI-Insight

#1 2602.07943v1 Feb 08, 2026

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non-trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two-stage evaluation framework. First, we test whether LLMs can recover well-established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.

Ivaxi Sheth Mario Fritz Zhijing Jin Bryan Wilder D. Janzing

0 Citations

#2 2602.01146v1 Feb 01, 2026

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

Conversational assistants are increasingly integrating long-term memory with large language models (LLMs). This persistence of memories, e.g., the user is vegetarian, can enhance personalization in future conversations. However, the same persistence can also introduce safety risks that have been largely overlooked. Hence, we introduce PersistBench to measure the extent of these safety risks. We identify two long-term memory-specific risks: cross-domain leakage, where LLMs inappropriately inject context from the long-term memories; and memory-induced sycophancy, where stored long-term memories insidiously reinforce user biases. We evaluate 18 frontier and open-source LLMs on our benchmark. Our results reveal a surprisingly high failure rate across these LLMs - a median failure rate of 53% on cross-domain samples and 97% on sycophancy samples. To address this, our benchmark encourages the development of more robust and safer long-term memory usage in frontier conversational systems.

Sidharth Pulipaka T. S. Bajwa Vyas Raina Ivaxi Sheth Oliver Chen +1

1 Citations

#3 2601.18483v1 Jan 26, 2026

Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs

Large Language Models (LLMs) offer strong generative capabilities, but many applications require explicit and \textit{fine-grained} control over specific textual concepts, such as humor, persuasiveness, or formality. Prior approaches in prompting and representation engineering can provide coarse or single-attribute control, but systematic evaluation of multi-attribute settings remains limited. We introduce an evaluation framework for fine-grained controllability for both single- and dual-concept scenarios, focusing on linguistically distinct concept pairs (e.g., persuasiveness vs.~humor). Surprisingly, across multiple LLMs and generative tasks, we find that performance often drops in the dual-concept setting, even though the chosen concepts should in principle be separable. This reveals a fundamental limitation of naive prompting-based control: models struggle with compositionality even when concepts are intuitively independent. Our framework provides systematic evidence of this gap and offers a principled approach for measuring the ability of future methods for multi-concept control.

Vyas Raina Ivaxi Sheth Mario Fritz Arya Labroo Amaani Ahmed

1 Citations

Ivaxi Sheth

Publications

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs