The paper introduces 'probe trajectories'—a continuous measure of a concept's probability across a model's reasoning process—to improve the monitoring of Large Reasoning Models' future behavior, showing that analyzing the full trajectory is superior to single-point predictions.
Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect…
Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language…
The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framewo…
TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches
This paper introduces TRAP, an adversarial attack that demonstrates how physical…
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Fra…
The paper proposes a structured prompt engineering framework to enhance the inte…
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor
The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, tri…
Conflicts Make Large Reasoning Models Vulnerable to Attacks
The paper demonstrates that confronting Large Reasoning Models (LRMs) with confl…
MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces L…
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that…