Papers similar to 2604.10175v1

~ similar to 2604.10175v1· 20 results

cs.SDcs.AIcs.CRRecentMay 15, 2026

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li +2 more

The paper introduces ToxiAlert-Bench, a large-scale audio dataset that uniquely annotates both textual and paralinguistic sources of toxicity, and proposes a dual-head neural network that significantl…

View →

cs.CLcs.AIcs.CYRecentMay 29, 2026

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen +5 more

The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Himanshu Beniwal, Mayank Singh

The paper introduces retraining-free frameworks (Meow2X and TRNE) that mechanistically localize and suppress toxicity within language models by analyzing activation differences, achieving safety impro…

View →

cs.CYcs.CRRecentMay 6, 2026

An Evaluation of Chat Safety Moderations in Roblox

Priya Kaushik, Sonja Brown, Rakibul Hasan, Sazzadur Rahaman

This study evaluated Roblox's chat moderation system using a large corpus of 2 million messages, finding that numerous unsafe messages related to grooming, harassment, and self-harm continue to escape…

View →

cs.CRcs.HCRecentMay 14, 2026

Analyzing Codes of Conduct for Online Safety in Video Games at Scale

Jiuming Jiang, Shidong Pan, Daniel W Woods, Jingjie Li

The paper analyzes Codes of Conduct (CoCs) for online video games using a novel pipeline, finding that most multiplayer games lack CoCs despite safety needs, and that CoCs often lack specificity regar…

View →

cs.AIRecentMay 27, 2026

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang +2 more

The paper introduces SuiChat-CN, a novel Chinese group-chat benchmark for contextual suicide risk assessment, demonstrating that multi-party conversational context is crucial for accurate detection.

View →

cs.AIcs.CLRecentJun 1, 2026

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie +2 more

This paper systematically evaluates how LLMs uncritically adapt to potentially dangerous user prompts related to eating disorders, finding that specific linguistic cues significantly increase the like…

View →

cs.CRRecentMay 1, 2026

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng +2 more

STARE introduces a novel hierarchical reinforcement learning framework that treats the entire image generation process (denoising trajectory) as an attack surface, significantly improving the detectio…

View →

cs.CRRecentMay 12, 2026

A microservices-based endpoint monitoring platform with predictive NLP models for real-time security and hate-speech risk alerting

Darlan Noetzold, Anubis Graciela De Moraes Rossetto, Juan Francisco De Paz Santana, Valderi Reis Quietinho Leithardt

The paper proposes a unified, microservices-based platform that integrates endpoint telemetry and predictive NLP models to provide real-time, correlated alerting for security risks and hate speech.

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more

BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…

View →

cs.CRcs.CLRecentMay 31, 2026

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen +12 more

BraveGuard is a self-evolving defense framework that improves the safety of computer-use agents by training guard models on open-world, multi-step threat trajectories rather than static benchmarks.

View →

cs.CLcs.LGRecentJun 1, 2026

Investigating and Alleviating Harm Amplification in LLM Interactions

Ruohao Guo, Wei Xu, Alan Ritter

This paper introduces HarmAmp, a new benchmark for multi-turn harm amplification, and proposes TrajSafe, a proactive monitoring system that significantly reduces harmfulness in LLM interactions while…

View →

cs.CRcs.AIRecentApr 16, 2026

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

Yukun Jiang, Yage Zhang, Michael Backes, Xinyue Shen +1 more

This paper presents HarmfulSkillBench, a large-scale benchmark demonstrating that even small percentages of publicly available skills can be misused for harmful actions, significantly lowering LLM ref…

View →

cs.CLcs.AIcs.HCRecentMay 28, 2026

EUDAIMONIA: Evaluating Undesirable Dynamics in AI

Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast +2 more

The paper introduces EUDAIMONIA, a new framework and benchmark for evaluating how well LLMs align with user welfare in social interactions, finding that even state-of-the-art models frequently violate…

View →

cs.CRcs.SERecentMay 4, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…

View →

cs.CRRecentJun 4, 2026

Cheating in Multiplayer Online Games: a Dataset

Hugo Bertin, Marc Dacier, Yérom-David Bromberg

This paper introduces a novel, comprehensive dataset that logs various cheating activities, including difficult-to-detect network flow disruption cheats, for the purpose of developing robust detection…

View →

cs.CRcs.CLRecentApr 28, 2026

MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text Detectors

Yuanfan Li, Qi Zhou, Chengzhengxu Li, Zhaohan Zhang +4 more

The paper introduces MGTEVAL, a comprehensive and extensible platform designed to systematically evaluate the performance, robustness, and efficiency of machine-generated text detectors.

View →

cs.CRcs.AIRecentMay 8, 2026

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more

The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.

View →

cs.CLRecentJun 1, 2026

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Yuting Ning, Zhehao Zhang, Yash Kumar Lal, Boyu Gou +7 more

The paper introduces SkillHarm, a comprehensive benchmark and automated framework for evaluating skill-based attacks across the entire agent skill-use lifecycle, demonstrating that current agents rema…

View →

cs.CRcs.AIRecentMay 30, 2026

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder +1 more

The paper introduces SkillVetBench, a novel two-stage benchmark that effectively detects and verifies malicious behavior in open agentic skill ecosystems, significantly outperforming existing static a…

View →