Papers similar to 2603.25412v2

~ similar to 2603.25412v2· 20 results

cs.AIcs.CRRecentMay 11, 2026

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu +2 more

The paper introduces EditRisk-Bench, a novel benchmark designed to systematically evaluate the safety risks and downstream reasoning corruption caused by malicious knowledge editing in large language…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…

View →

cs.CRRecentMay 12, 2026

Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis

Zhenhao Xu, Wenhan Chang, Yichuan Chen, Yuxin Fang +2 more

The paper proposes Safety Context Injection (SCI), an inference-time framework that prepends a structured external risk report to protect Large Reasoning Models (LRMs) against sophisticated jailbreaks…

View →

cs.CLcs.AIRecentMay 27, 2026

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more

This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…

View →

cs.CRcs.AIRecentApr 12, 2026

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Vu Tuan Truong, Long Bao Le

The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…

View →

cs.CRcs.AIRecentApr 10, 2026

Conflicts Make Large Reasoning Models Vulnerable to Attacks

Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more

The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…

View →

cs.CRRecentApr 8, 2026

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao +2 more

MirageBackdoor introduces a novel, highly stealthy backdoor attack that forces Large Language Models to generate correct reasoning steps (Think Well) but output an incorrect final answer (Answer Wrong…

View →

cs.CLcs.CRRecentMay 18, 2026

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński +1 more

The paper introduces 'probe trajectories'—a continuous measure of a concept's probability across a model's reasoning process—to improve the monitoring of Large Reasoning Models' future behavior, showi…

View →

cs.CLRecentJun 1, 2026

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li +3 more

SentGuard introduces a novel sentence-level streaming guardrail that operates in parallel with LLM generation, achieving high detection rates of unsafe content early in the response while maintaining…

View →

cs.CRcs.AIRecentMay 12, 2026

CoT-Guard: Small Models for Strong Monitoring

Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi +5 more

The paper introduces CoT-Guard, a small, cost-effective 4B-parameter model that significantly outperforms large, expensive monitors like GPT-5 in detecting hidden objectives in code generation tasks.

View →

cs.CRcs.CVRecentMar 18, 2026

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.

View →

cs.CRcs.AIcs.CLRecentMay 5, 2026

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.

View →

cs.CRcs.CVRecentApr 17, 2026

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin +5 more

This paper introduces TwoHamsters, a new benchmark that rigorously tests Multi-Concept Compositional Unsafety (MCCU) in text-to-image models, demonstrating that current state-of-the-art models and saf…

View →

cs.CLRecentMay 29, 2026

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi +8 more

The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated in…

View →

cs.AIcs.LGRecentJun 1, 2026

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov +2 more

The paper analyzes the failure modes of aggressive 2-bit quantization in large reasoning models, proposing lightweight controls like FP16 planning and loop rescue to restore accuracy and achieve pract…

View →

cs.CRcs.AIRecentApr 6, 2026

Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala +1 more

The paper proposes a structured prompt engineering framework to enhance the integrity and reliability of Chain-of-Thought (CoT) reasoning in LLMs, demonstrating significant improvements in security-se…

View →

cs.CRcs.AIRecentApr 25, 2026

Semantic Denial of Service in LLM-controlled robots

Jonathan Steinberg, Oren Gal

The paper demonstrates a semantic denial-of-service attack against LLM-controlled robots by injecting short, safety-plausible phrases into the audio channel, causing the robot to halt or disrupt execu…

View →

cs.CRRecentApr 10, 2026

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan +1 more

The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, trigger-activated malicious behaviors by manipulating the observable Chain-of-Thought (CoT) process in Large Language Model…

View →

cs.CRcs.CLRecentMay 13, 2026

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more

The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.

View →