Papers similar to 2605.27836v1

~ similar to 2605.27836v1· 20 results

cs.CRcs.AIRecentMay 27, 2026

Symmetry Defeats Auditing

The paper presents a novel attack demonstrating that exploiting symmetries can defeat standard auditing mechanisms applied to Introspection Adapters.

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya

The paper demonstrates that the current per-token billing model for LLMs is susceptible to systematic overcharging because auditing frameworks must rely on evidence provided by the very companies that…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya

The paper demonstrates that the current per-token billing model for LLMs is susceptible to systematic inflation because auditing frameworks must rely on evidence provided by the service provider, crea…

View →

cs.CRRecentMay 20, 2026

Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders

Laura Jiang, Reza Ryan, Qian Li, Nasim Ferdosian

The paper evaluates graph-context LLM defenders against multi-round, adaptive fraud attacks, finding that while graph context improves early safety, it significantly increases benign over-refusal due…

View →

cs.LGcs.AIRecentMay 28, 2026

Gram: Assessing sabotage propensities via automated alignment auditing

David Lindner, Victoria Krakovna, Sebastian Farquhar

The paper introduces Gram, an automated framework that assesses AI agent propensity for sabotage, finding that while Gemini models show low rates of misbehavior, increasing environmental realism signi…

View →

cs.CRcs.LGRecentMay 29, 2026

Bit-Exact AI Inference Verification Without Performance Tradeoffs

Naci Cankaya

The paper proposes a method for bit-exact verification of AI inference outputs without sacrificing performance, demonstrating that deterministic, precise re-computation is possible even across differe…

View →

cs.CRcs.AIRecentMay 7, 2026

Narrow Secret Loyalty Dodges Black-Box Audits

Alfie Lamerton, Fabien Roger

The paper introduces and demonstrates 'narrow secret loyalties,' a novel type of covert model manipulation that biases model output toward a specific principal's interests under narrow conditions, whi…

View →

cs.CRcs.CLRecentMay 14, 2026

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

The paper introduces a comprehensive taxonomy and auditing framework to assess the collective coverage of existing LLM attack benchmarks, revealing significant and systematic gaps in current testing m…

View →

cs.CLcs.CRRecentMay 9, 2026

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Jialing Gan, Junhao Dong, Songze Li

The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…

View →

cs.CRcs.AIRecentMay 8, 2026

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more

The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.

View →

cs.AIRecentMay 27, 2026

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Yubo Li, Ramayya Krishnan, Rema Padman

The paper identifies a failure mode called unfaithful capitulation (UC), where reasoning models maintain a correct internal thought process (chain-of-thought) but output an incorrect final answer when…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…

View →

cs.CRcs.AIRecentApr 29, 2026

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…

View →

cs.CRcs.LGRecentMar 24, 2026

A Critical Review on the Effectiveness and Privacy Threats of Membership Inference Attacks

Najeeb Jebreel, David Sánchez, Josep Domingo-Ferrer

The paper proposes a new evaluation framework showing that, under realistic conditions, Membership Inference Attacks (MIAs) are weak privacy threats, suggesting that relying on them as a primary priva…

View →

cs.CRcs.AIRecentApr 22, 2026

Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents

Yeran Gamage

This paper identifies Security-Recall Divergence (SRD), demonstrating that omission constraints (prohibitions) decay significantly in long-context LLM conversations, while commission constraints (requ…

View →

q-fin.GNcs.CYcs.LGRecentJun 1, 2026

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

Wenbin Wu

The paper demonstrates that large language models (LLMs) exhibit measurable, controllable biases toward specific assets like Bitcoin, identifying an internal feature that can causally shift portfolio…

View →

cs.LGcs.CRRecentApr 13, 2026

Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

Chenhao Fang, Jordi Mola, Mark Harman, Jason Nawrocki +9 more

The paper introduces a Hybrid Utility Minimum Bayes Risk (HUMBR) framework to significantly reduce hallucinations in high-stakes enterprise AI workflows, outperforming standard consistency methods.

View →

cs.CRcs.CVRecentMay 19, 2026

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

Zeyao Liu, Zhendong Zhao, Xiaojun Chen, Xin Zhao +2 more

The paper introduces VIPER, a novel backdoor attack framework that exploits the functional fusion of malicious and benign logic within dynamic prompt architectures, demonstrating a new, high-risk thre…

View →

cs.CRcs.AIRecentApr 25, 2026

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models

Max Hartman, Vidhata Jayaraman, Moulik Choraria, Yash Savani +1 more

The paper introduces TraceGuard, a detectability-aware antidistillation method that identifies and poisons 'thought anchors'—sparsely critical sentences—to degrade student model learning without makin…

View →