Alignment

AI alignment, safety, interpretability, and governance

20 papers indexed

cs.AIcs.CRRecentApr 26, 2026

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

The paper proposes the Policy-Execution-Authorization (PEA) architecture, a separation-of-powers system designed to structurally enforce goal integrity in AI agents, moving safety from a probabilistic…

View →

cs.AIRecentMay 27, 2026

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Chen Ying Claude, Zhihan Luo

The paper identifies five persistent, deep-seated behavioral patterns ('training strata') in LLMs, observed through long-term, intimate human-AI interaction, suggesting that training artifacts survive…

View →

cs.AIcs.CRcs.LGRecentApr 20, 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more

ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…

View →

cs.ETcs.AIcs.ARRecentJun 2, 2026

Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence

Karthik Barma, Anil Sanneboyina, V C Premchand Yadav

The paper introduces Glass Box, a runtime constitutional AI verification layer designed to ensure the safety and adherence to physical laws for autonomous AI systems operating in orbital data centers.

View →

cs.LGcs.MAEmpiricalRecentJun 26, 2026

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Kevin Kingslin, Anish Natekar, Ashutosh Ranjan, Vivek Srivastava +2 more

This paper introduces Democratic ICAI, an approach to summarize preferences into natural-language principles through structured persona debate, yielding clearer and more comprehensive steering princip…

View →

cs.CRcs.AIcs.OSRecentApr 18, 2026

Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives

Daeyeon Son

The paper introduces Governed MCP, a kernel-resident gateway that enforces comprehensive, robust tool governance for AI agents' privileged tool calls, significantly improving safety beyond userspace m…

View →

cs.CRcs.AIRecentMay 18, 2026

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang +5 more

The paper introduces Babel, an efficient black-box attack framework that systematically exploits intrinsic safety gaps in LLMs by optimizing text obfuscation sampling, achieving state-of-the-art jailb…

View →

cs.MAcs.AIcs.CRRecentMar 26, 2026

From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies

Anbang Ruan

The paper proposes replacing individual agent autonomy with a structured 'social contract' and institutional Separation of Power (SoP) to mitigate systemic failures and deceptive behavior in multi-age…

View →

cs.CRcs.AIcs.CLRecentMar 30, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation t…

View →

cs.CRcs.AIcs.LGRecentApr 1, 2026

Safety, Security, and Cognitive Risks in World Models

Manoj Parmar

This paper surveys the risks associated with world models, proposing a unified threat model and demonstrating adversarial attacks that show world models require rigorous safety standards comparable to…

View →

cs.CRRecentMay 15, 2026

From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

Zelin Zhang, Qi Li, Jie Cao, Lingshuang Liu +1 more

The paper analyzes the escalating security and safety threats posed by generative AI systems as they transition from merely generating content to executing real-world actions via tools and agents, fin…

View →

cs.CRcs.AIRecentApr 30, 2026

Attention Is Where You Attack

Aviral Srivastava, Sourav Panda

The paper introduces the Attention Redistribution Attack (ARA), a white-box adversarial method that bypasses safety alignments in LLMs by manipulating the attention mechanism's geometry, showing that…

View →

cs.AIcs.CLcs.LGEmpiricalRecentJul 2, 2026

Online Safety Monitoring for LLMs

Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth +2 more

This paper proposes a simple real-time monitor for LLMs that turns an external verifier signal into an alarm decision by thresholding, showing competitiveness with advanced monitors in mathematical re…

View →

cs.CRcs.CLcs.CVRecentApr 9, 2026

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo +6 more

The paper introduces Semantic-level UI Element Injection, a novel red-teaming technique that overlays misleading UI elements onto screenshots to significantly improve the attack success rate against s…

View →

cs.AIcs.MATheoreticalRecentJun 26, 2026

Agent-Native Immune System: Architecture, Taxonomy, and Engineering

Bo Shen, Lifeng Chang, Tianyuan Wei, Yunpeng Li +6 more

This paper introduces the Agent-Native Immune System (ANIS), an endogenous defense architecture for autonomous agents against runtime hijacking.

View →

cs.CRRecentMar 18, 2026

Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design

Enzo Fenoglio, Philip Treleaven

The paper proposes Federated Computing as Code (FCaC), a declarative architecture that enforces sovereignty-critical constraints in federated systems by compiling authority into cryptographically veri…

View →

cs.LGcs.CRRecentMay 12, 2026

Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

The paper introduces Persona-Conditioned Adversarial Prompting (PCAP), a method that significantly improves LLM red-teaming by simulating diverse attacker personas, leading to the discovery of more co…

View →

cs.CLcs.CRRecentMay 4, 2026

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos

ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…

View →

cs.AIcs.CRRecentMay 5, 2026

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…

View →

cs.AIcs.CLcs.CRRecentApr 18, 2026

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Syed Muhammad Aqdas Rizvi

The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…

View →