~ similar to 2606.01298· 18 results
Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank +1 more
This paper proposes a unified evaluation framework for hate speech detection that systematically assesses model performance and explainability across various label and rationale representation spaces,…
The paper introduces FBHM, a new benchmark for hateful memes, and proposes LSV, a steering vector method that significantly improves VLM performance by addressing the generalization gap.
Darlan Noetzold, Anubis Graciela De Moraes Rossetto, Juan Francisco De Paz Santana, Valderi Reis Quietinho Leithardt
The paper proposes a unified, microservices-based platform that integrates endpoint telemetry and predictive NLP models to provide real-time, correlated alerting for security risks and hate speech.
The paper evaluates LLM-generated reactions to Spanish online news, finding that off-the-shelf models fail to accurately reproduce the measurable properties of real audience discourse, and even fine-t…
Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li +2 more
The paper introduces ToxiAlert-Bench, a large-scale audio dataset that uniquely annotates both textual and paralinguistic sources of toxicity, and proposes a dual-head neural network that significantl…
The paper introduces SONAR, a prompt sanitization framework that uses natural language inference metrics to identify and remove malicious instructions injected into LLM prompts, achieving near-zero at…
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
AsmRAG is a novel framework that improves malware detection by treating it as an evidence-based retrieval task using a code-specialized LLM, achieving high accuracy while providing transparent forensi…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
The paper introduces a synthetic dataset of multi-round conversations to detect conversational smishing, finding that XGBoost with TF-IDF features achieved the best performance (72.5% accuracy).
The paper introduces LinguIUTics, a system that significantly improves the classification of rare psychological defense mechanisms in conversational text by fine-tuning Qwen3-8B using specialized imba…
Kıvanç Kuzey Dikici, Serdar Kara, Semih Çağlar, Eray Tüzün +1 more
SERSEM introduces a selective entropy-weighted scoring framework to significantly improve Membership Inference Attacks (MIAs) against code LLMs by focusing on human-centric coding anomalies rather tha…
Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim +2 more
The paper proposes a novel, scalable technique using unique canary tokens to automatically and accurately identify which web scrapers are feeding data to specific Large Language Models (LLMs).
SALLIE introduces a lightweight, modal-agnostic runtime detection framework that effectively safeguards LLMs and VLMs against both textual and visual jailbreaks and prompt injections without performan…
Shang Shang, Ruiqi Wang, Ruijie Qi, Hao Li +3 more
PhishSigma++ is a novel entity-relation-based detector that improves malicious email detection by focusing on invariant functional relationships between typed entities, significantly outperforming tex…
RefineRAG introduces a novel word-level poisoning framework that significantly enhances knowledge poisoning attacks against RAG systems, achieving state-of-the-art effectiveness and transferability to…
The paper proposes 'Uncertainty,' a multiscale uncertainty estimator that focuses on low-probability tokens to improve the detection of AI-generated text by addressing boilerplate dominance and score…
The paper demonstrates that relying on strict regular-expression parsing for evaluating LLM-based security log classifiers introduces systematic errors, potentially causing a functional model to appea…