~ similar to 2605.15984v1· 15 results
The paper introduces retraining-free frameworks (Meow2X and TRNE) that mechanistically localize and suppress toxicity within language models by analyzing activation differences, achieving safety impro…
This paper addresses the lack of specialized NLP tools for detecting toxicity in real-time video game chat by creating a large, fine-grained dataset and developing a superior, domain-specific detector…
The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…
Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more
The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…
The paper introduces LinguIUTics, a system that significantly improves the classification of rare psychological defense mechanisms in conversational text by fine-tuning Qwen3-8B using specialized imba…
Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang +2 more
The paper introduces SuiChat-CN, a novel Chinese group-chat benchmark for contextual suicide risk assessment, demonstrating that multi-party conversational context is crucial for accurate detection.
Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng +2 more
STARE introduces a novel hierarchical reinforcement learning framework that treats the entire image generation process (denoising trajectory) as an attack surface, significantly improving the detectio…
Sympatheia is a speech-to-speech dialogue framework that generates emotionally adaptive responses by conditioning its output on continuous affect signals derived from user speech or external multimoda…
This paper introduces HarmAmp, a new benchmark for multi-turn harm amplification, and proposes TrajSafe, a proactive monitoring system that significantly reduces harmfulness in LLM interactions while…
The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…
Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin +4 more
The paper analyzes that while multimodal large language models (MLLMs) offer superior semantic understanding for image generation, this enhanced capability significantly increases safety risks, partic…
Darlan Noetzold, Anubis Graciela De Moraes Rossetto, Juan Francisco De Paz Santana, Valderi Reis Quietinho Leithardt
The paper proposes a unified, microservices-based platform that integrates endpoint telemetry and predictive NLP models to provide real-time, correlated alerting for security risks and hate speech.
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
RefineRAG introduces a novel word-level poisoning framework that significantly enhances knowledge poisoning attacks against RAG systems, achieving state-of-the-art effectiveness and transferability to…
This study investigates how humans detect synthetic speech in real-world contexts, finding that while overt detection failed for fully synthetic speech, participants still implicitly discriminated utt…