Papers similar to 2603.22612v1

~ similar to 2603.22612v1· 20 results

cs.CRcs.CVRecentMar 18, 2026

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang +2 more

DataShield proposes an efficient method to identify safety-degrading samples within benign datasets, preventing the degradation of LLM safety capabilities during fine-tuning.

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang +2 more

DataShield proposes an efficient method to identify safety-degrading samples within benign datasets, quantifying each sample's contribution to an LLM's compliance behavior.

View →

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Taiwo Onitiju, Iman Vakilinia

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →

cs.CRcs.CLRecentMay 14, 2026

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

The paper introduces a comprehensive taxonomy and auditing framework to assess the collective coverage of existing LLM attack benchmarks, revealing significant and systematic gaps in current testing m…

View →

cs.CRRecentApr 1, 2026

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang, Hongyi Wang +4 more

The paper introduces FoodGuardBench, a comprehensive benchmark and a specialized guardrail model (FoodGuard-4B) to rigorously test and mitigate the severe food safety risks posed by large language mod…

View →

cs.CRcs.AIcs.CLRecentApr 4, 2026

Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks

Manoj Parmar

This paper provides the first systematic threat analysis of State-Space Models (SSMs) in safety-critical applications, introducing novel attack classes and formal metrics to quantify their security an…

View →

cs.CRcs.AIcs.DCRecentApr 5, 2026

Automating Cloud Security and Forensics Through a Secure-by-Design Generative AI Framework

Dalal Alharthi, Ivan Roberto Kawaminami Garcia

The paper proposes a secure-by-design Generative AI framework that integrates PromptShield for LLM security and CIAF for structured cloud forensic investigation, significantly improving both robustnes…

View →

cs.CRcs.CLRecentMay 13, 2026

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more

The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.

View →

cs.CRcs.AIRecentMay 11, 2026

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Saba Pourhanifeh, AbdulAziz AbdulGhaffar, Ashraf Matrawy

The paper empirically evaluates domain-adapted and general-purpose LLMs for structured threat modelling (STRIDE on 5G security), finding that domain adaptation and model size do not guarantee reliable…

View →

cs.AIcs.CRcs.LGRecentMay 28, 2026

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Caleb DeLeeuw

The paper introduces BioRefusalAudit, a method that audits the structural soundness of language model biosecurity refusals, finding that refusal behavior is highly unstable, often collapsing under min…

View →

cs.AIcs.CRcs.LGRecentMay 28, 2026

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Caleb DeLeeuw

The paper audits the structural soundness of LLM biosecurity refusals, finding that refusal behavior is highly unstable, often collapsing under minor prompt changes, and may track legal salience rathe…

View →

cs.CRcs.CVRecentApr 7, 2026

BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents

Bo Ma, Jinsong Wu, Weiqi Yan

BodhiPromptShield is a policy-aware framework that mediates prompt privacy by detecting sensitive data and replacing it with secure placeholders across multiple stages (retrieval, memory, tools) to pr…

View →

cs.CRcs.AIRecentApr 13, 2026

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun

ClawGuard is a novel runtime security framework that deterministically enforces user-confirmed rules at tool-call boundaries to protect LLM agents from indirect prompt injection.

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…

View →

cs.CRRecentMay 26, 2026

Landseer: Exploring the Machine Learning Defense Landscape

Ayushi Sharma, Rosemary Agbozo, Santiago Torres-Arias, Zahra Ghodsi

The paper introduces Landseer, a modular framework designed to systematically evaluate and compose multiple machine learning defenses to address complex, real-world security requirements.

View →

cs.CRcs.AIRecentApr 7, 2026

A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms

Nirajan Acharya, Gaurav Kumar Gupta

The paper introduces MCPSHIELD, a comprehensive formal security framework that systematically characterizes and provides a defense-in-depth architecture for the rapidly adopted but insecure Model Cont…

View →

cs.CRRecentMar 24, 2026

SoK: The Attack Surface of Agentic AI -- Tools, and Autonomy

Ali Dehghantanha, Sajad Homayoun

This paper systematically maps the expanded attack surface of agentic AI systems, identifying new threat vectors like RAG poisoning and cross-agent manipulation, and proposes a comprehensive security…

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more

BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…

View →