~ similar to 2605.29243· 19 results
The paper introduces 'probe trajectories'—a continuous measure of a concept's probability across a model's reasoning process—to improve the monitoring of Large Reasoning Models' future behavior, showi…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
The paper investigates forecasting sparse and bursty vulnerability sightings, concluding that traditional time-series models like SARIMAX are inadequate, and count-based methods like Poisson regressio…
The paper finds that while LLMs can detect distress regardless of delusional framing, they significantly fail to intervene safely when distress is intertwined with delusion, suggesting a critical reco…
The paper introduces SafetyDrift, a predictive model that forecasts when AI agents will violate safety protocols by analyzing the cumulative risk across sequences of individually safe actions.
Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang +5 more
The paper introduces TurnGate, a response-aware defense mechanism that detects the earliest turn in a multi-turn dialogue where the accumulated interaction enables a harmful action, significantly impr…
The paper demonstrates that the order and content of external information (the 'feed') an LLM agent consumes before making a decision can significantly and causally steer its final choice, often overr…
The paper demonstrates that the sequence and composition of external information (the 'feed') an LLM agent consumes can significantly and causally steer its final decisions, often overriding its defau…
The paper proposes the Triple-tier Anomaly Defense (TRIAD) framework, a predictive model that treats safety verification as a dynamic trajectory problem to detect cumulative, cross-modal poisoning in…
The paper investigates predictive multiplicity and arbitrariness in recidivism risk assessment, finding that similarly accurate models often exhibit high predictive agreement, and proposes a simple po…
The paper identifies a failure mode called unfaithful capitulation (UC), where reasoning models maintain a correct internal thought process (chain-of-thought) but output an incorrect final answer when…
Michael S. Lee, Yash Maurya, Drew Rein, Bert Herring +12 more
The paper introduces ROK-FORTRESS, a novel bilingual, culturally adversarial benchmark that demonstrates that LLM safety behavior in high-stakes scenarios is significantly shaped by the interaction be…
The paper introduces an LLM-agent framework to solve the 'last-mile forecasting' problem, bridging the gap between raw statistical predictions and business-ready forecasts by incorporating weakly stru…
Quang Duc Nguyen, Siyuan Liang, Yiming Li, Fushuo Huo +1 more
The paper proposes TimeGuard, a novel channel-wise pool training defense, to significantly improve the robustness of time series forecasting against backdoor attacks by addressing signal dilution and…
The paper proposes a novel information-geometric framework to analyze LLM stability by integrating task utility, external entropy, and internal structural proxies, showing this composite score improve…
ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…
Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more
The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…
This paper introduces a foundational framework and taxonomy for managing catastrophic AI loss of control (LOC) incidents, providing a proportional guide for response based on the severity and recovera…
Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu +2 more
The paper proposes a novel framework, LPCD, that uses latent causal modeling to robustly assess evolving adversarial risks in live streaming by decoupling malicious intent from superficial tactical sh…