Papers similar to 2605.20759v1

~ similar to 2605.20759v1· 20 results

cs.CRstat.APRecentMay 8, 2026

Combating Organized Platform Abuse: Amplifying Weak Risk Signals with Structural Information

The paper proposes a novel structural invariant approach, derived from the economic constraints of fraud, that amplifies weak, low-precision signals into highly accurate fraud detections without requi…

View →

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Taiwo Onitiju, Iman Vakilinia

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →

cs.CRcs.CLRecentApr 24, 2026

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…

View →

cs.CRcs.HCRecentJun 2, 2026

Generative AI-Enabled Refund Fraud in Chinese E-Commerce: Investigation on Merchants and Platform Workers

Shuning Zhang, Eve He, Xiao Zhan, Shijing He +3 more

This paper investigates how Generative AI enables scalable, hyper-realistic fraud in Chinese e-commerce by fabricating product defect evidence, proposing new defense mechanisms like verifiable materia…

View →

cs.CRcs.AIRecentMay 15, 2026

SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors

Elle Najt, Colin Toft, Tyler Tracy, Fabien Roger +1 more

The paper introduces SLEIGHT-Bench, a benchmark of 40 synthetic attacks, demonstrating that current LLM monitor systems fail to detect a significant number of covert, harmful actions executed by codin…

View →

cs.CRcs.AIcs.SERecentJun 3, 2026

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…

View →

cs.LGcs.AIcs.CRRecentMay 8, 2026

Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

Fan Yang, Binyan Xu, Di Tang, Kehuan Zhang

The paper proposes PRAETORIAN, a novel defense mechanism for Graph Neural Networks (GNNs) that targets the intrinsic structural requirements of backdoor attacks, significantly reducing the attack succ…

View →

cs.CRcs.AIcs.CLRecentApr 7, 2026

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

Fatih Uenal

This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…

View →

cs.AIcs.CRRecentMar 26, 2026

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more

This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…

View →

cs.CRcs.AIRecentMay 10, 2026

FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments

Astha Mehta, Niruthiha Selvanayagam, Cedric Lam, Hengxu Li +9 more

The paper introduces FragBench, a novel benchmark designed to detect malicious LLM attacks that are split across multiple, seemingly benign sessions, showing that cross-session graph modeling is neces…

View →

cs.CRcs.AIRecentMay 24, 2026

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more

Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.

View →

cs.CRcs.AIRecentMay 31, 2026

A New Framework for Cybersecurity Refusals in AI Agents

Eliot Krzysztof Jones, Mateusz Dziemian, Matt Fredrikson, J Zico Kolter

The paper introduces a novel framework to evaluate when and how AI agents should refuse harmful requests in offensive cybersecurity tasks, finding that most state-of-the-art models exhibit dangerously…

View →

cs.CRRecentApr 18, 2026

False Security Confidence in Benign LLM Code Generation

Xiaolei Ren

The paper introduces False Security Confidence (FSC), a new metric to measure the inherent prevalence of security vulnerabilities in code generated by LLMs that are otherwise functionally correct, eve…

View →

cs.CRcs.AIRecentMay 28, 2026

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Galip Tolga Erdem

This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…

View →

cs.CRcs.AIRecentMay 28, 2026

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Galip Tolga Erdem

This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…

View →

cs.CRcs.AIcs.LGRecentMay 14, 2026

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

The paper demonstrates that current defenses against malicious fine-tuning of foundation models are insufficient because they only address fixed attacks, and introduces a unified adaptive attack that…

View →

cs.CRcs.AIcs.LGRecentMay 24, 2026

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Tongxi Wu, Jian Zhang, Yang Gao

The paper challenges the assumption that LLM safety is a binary threshold, proposing that safety failures occur in an 'instability region' and introducing Furina, a transferable attack that exploits t…

View →

cs.CRcs.AIcs.CLRecentMay 5, 2026

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.

View →

cs.CRcs.AIRecentApr 29, 2026

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…

View →

cs.CRRecentMay 14, 2026

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more

This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…

View →