Papers similar to 2605.15984v1

~ similar to 2605.15984v1· 15 results

cs.CLcs.AIcs.LGRecentMay 27, 2026

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

The paper introduces retraining-free frameworks (Meow2X and TRNE) that mechanistically localize and suppress toxicity within language models by analyzing activation differences, achieving safety impro…

View →

cs.CRcs.CYcs.LGRecentApr 11, 2026

"bot lane noob" Towards Deployment of NLP-based Toxicity Detectors in Video Games

Jonas Ave, Irdin Pekaric, Matthias Frohner, Giovanni Apruzzese

This paper addresses the lack of specialized NLP tools for detecting toxicity in real-time video game chat by creating a large, fine-grained dataset and developing a superior, domain-specific detector…

View →

cs.CLcs.AIcs.CYRecentMay 29, 2026

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen +5 more

The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…

View →

cs.CRcs.AIcs.LGRecentMay 22, 2026

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more

The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…

View →

cs.CLcs.AIRecentMay 30, 2026

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

Shefayat E Shams Adib, Ahmed Alfey Sani, Md Hasibur Rahman Alif, Ajwad Abrar

The paper introduces LinguIUTics, a system that significantly improves the classification of rare psychological defense mechanisms in conversational text by fine-tuning Qwen3-8B using specialized imba…

View →

cs.AIRecentMay 27, 2026

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang +2 more

The paper introduces SuiChat-CN, a novel Chinese group-chat benchmark for contextual suicide risk assessment, demonstrating that multi-party conversational context is crucial for accurate detection.

View →

cs.CRRecentMay 1, 2026

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng +2 more

STARE introduces a novel hierarchical reinforcement learning framework that treats the entire image generation process (denoising trajectory) as an attack surface, significantly improving the detectio…

View →

cs.SDcs.CLcs.HCRecentMay 30, 2026

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

Sympatheia is a speech-to-speech dialogue framework that generates emotionally adaptive responses by conditioning its output on continuous affect signals derived from user speech or external multimoda…

View →

cs.CLcs.LGRecentJun 1, 2026

Investigating and Alleviating Harm Amplification in LLM Interactions

Ruohao Guo, Wei Xu, Alan Ritter

This paper introduces HarmAmp, a new benchmark for multi-turn harm amplification, and proposes TrajSafe, a proactive monitoring system that significantly reduces harmfulness in LLM interactions while…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Ihor Stepanov, Aleksandr Smechov

The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…

View →

cs.CVcs.AIcs.CRRecentMar 25, 2026

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin +4 more

The paper analyzes that while multimodal large language models (MLLMs) offer superior semantic understanding for image generation, this enhanced capability significantly increases safety risks, partic…

View →

cs.CRRecentMay 12, 2026

A microservices-based endpoint monitoring platform with predictive NLP models for real-time security and hate-speech risk alerting

Darlan Noetzold, Anubis Graciela De Moraes Rossetto, Juan Francisco De Paz Santana, Valderi Reis Quietinho Leithardt

The paper proposes a unified, microservices-based platform that integrates endpoint telemetry and predictive NLP models to provide real-time, correlated alerting for security risks and hate speech.

View →

cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →

cs.CRRecentApr 8, 2026

RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement

Ziye Wang, Guanyu Wang, Kailong Wang

RefineRAG introduces a novel word-level poisoning framework that significantly enhances knowledge poisoning attacks against RAG systems, achieving state-of-the-art effectiveness and transferability to…

View →

eess.AScs.AIcs.HCRecentMay 27, 2026

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

Lelia Erscoi, Tomi Kinnunen

This study investigates how humans detect synthetic speech in real-world contexts, finding that while overt detection failed for fully synthetic speech, participants still implicitly discriminated utt…

View →