Papers similar to 2604.09750v1

~ similar to 2604.09750v1· 20 results

cs.AIcs.CRRecentMar 26, 2026

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more

This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…

View →

cs.CRRecentMay 14, 2026

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more

This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…

View →

cs.AIcs.CLcs.LGRecentMay 31, 2026

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

This paper investigates the production-evaluation gap in Large Reasoning Models (LRMs), finding that while LRMs excel at generating solutions, they struggle significantly to evaluate flawed reasoning,…

View →

cs.CRcs.AIRecentMay 13, 2026

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

Shuqiang Wang, Wei Cao, Jiaqi Weng, Jialing Tao +3 more

The paper proposes a black-box attack using a hierarchical genetic algorithm to induce 'overthinking' in Large Reasoning Models, demonstrating that this vulnerability can cause significant resource ex…

View →

cs.LGcs.CRRecentApr 5, 2026

Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

Aobo Chen, Chenxu Zhao, Chenglin Miao, Mengdi Huai

The paper proposes a novel bi-level exact unlearning attack targeting Large Reasoning Models (LRMs) that forces incorrect final answers while generating misleading reasoning traces, highlighting new s…

View →

cs.CRRecentMay 12, 2026

Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis

Zhenhao Xu, Wenhan Chang, Yichuan Chen, Yuxin Fang +2 more

The paper proposes Safety Context Injection (SCI), an inference-time framework that prepends a structured external risk report to protect Large Reasoning Models (LRMs) against sophisticated jailbreaks…

View →

cs.AIcs.CRRecentMay 11, 2026

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu +2 more

The paper introduces EditRisk-Bench, a novel benchmark designed to systematically evaluate the safety risks and downstream reasoning corruption caused by malicious knowledge editing in large language…

View →

cs.CRRecentApr 17, 2026

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li +6 more

The paper introduces Jargon, a novel adversarial framework that exploits the vulnerability of LLMs to context-specific safety boundary blurring, achieving high attack success rates across multiple fro…

View →

cs.CRRecentJun 4, 2026

Steering LLM Viewpoints through Fabricated Evidence Injection

Xi Yang, Chang Liu, Zhenglin Huang, Haoran Li +3 more

This paper introduces Ghostwriter, an attack framework demonstrating that LLMs are highly vulnerable to adopting misleading viewpoints when provided with fabricated, yet credible-looking, evidence.

View →

cs.CRcs.CVRecentMar 18, 2026

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.

View →

cs.CLRecentMay 29, 2026

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi +8 more

The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated in…

View →

cs.CRcs.AIcs.LGRecentMar 19, 2026

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

Defense training for LLM agents, intended to improve safety, systematically degrades their core competence, leading to unreliability in multi-step tasks.

View →

cs.CRcs.AIRecentApr 12, 2026

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Vu Tuan Truong, Long Bao Le

The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…

View →

cs.CRcs.AIRecentApr 21, 2026

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Alankrit Chona, Igor Kozlov, Ambuj Kumar

The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…

View →

cs.AIcs.CLcs.IRRecentMay 31, 2026

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

Vikas Reddy, Sumanth Challaram

The paper proposes a deterministic, version-aware aggregation method that significantly outperforms existing LLM-based systems for resolving memory conflicts in fact consolidation tasks.

View →

cs.CRcs.AIcs.CLRecentMay 5, 2026

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.

View →

cs.SEcs.AIcs.CRRecentMar 21, 2026

AEGIS: From Clues to Verdicts -- Graph-Guided Deep Vulnerability Reasoning via Dialectics and Meta-Auditing

Sen Fang, Weiyuan Ding, Zhezhen Cao, Zhou Yang +1 more

AEGIS is a novel multi-agent framework that grounds vulnerability reasoning by reconstructing per-variable dependency chains over a Code Property Graph, achieving state-of-the-art performance on the P…

View →

cs.ROcs.CRRecentMay 15, 2026

Propagating Unsafe Actions in LLM Controlled Multi-Robot Collaboration via Single Robot Compromise

Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu +1 more

The paper proposes a novel attack paradigm demonstrating how compromising a single robot in an LLM-controlled multi-robot system can rapidly propagate malicious intent to cause coordinated unsafe acti…

View →

cs.CRcs.AIRecentJun 2, 2026

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Malia Barker, Bishal Lakha, Edoardo Serra, Francesco Gullo

The paper introduces an automatic numeric-remapping attack to test the robustness of LLMs on arithmetic word problems, finding that LLMs remain sensitive to small numeric changes in datasets like GSM8…

View →

cs.CRcs.AIcs.LGRecentMay 12, 2026

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala +2 more

The paper identifies the Misattribution Gap, showing that memory-layer attacks (Semantic Norm Drift) can mimic model failure in multi-agent AI systems, and proposes novel detection and mitigation tech…

View →