Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

This paper introduces a novel framework, the Reasoning Safety Monitor, to detect…

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language…

The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framewo…

TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches

This paper introduces TRAP, an adversarial attack that demonstrates how physical…

Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Fra…

The paper proposes a structured prompt engineering framework to enhance the inte…

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, tri…

Conflicts Make Large Reasoning Models Vulnerable to Attacks

The paper demonstrates that confronting Large Reasoning Models (LRMs) with confl…

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces L…

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that…