~ similar to 2606.00647· 20 results
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
This paper proposes a domain-specialized large language model, PoetryQwen, for precise translation and emotional understanding of classical poetry.
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more
This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…
ThinkSwitch introduces a low-compute co-training procedure that distills the reasoning benefit of large language models into weights, significantly improving performance on specific reasoning tasks.
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li +2 more
The paper introduces ToxiAlert-Bench, a large-scale audio dataset that uniquely annotates both textual and paralinguistic sources of toxicity, and proposes a dual-head neural network that significantl…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar +3 more
The paper introduces LLUMI, an open-source framework that improves LLM writing assistance for mental health support using community feedback, demonstrating comparable performance to proprietary models…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more
The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…
Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more
This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
The paper demonstrates that increasing the toxicity of prompts significantly degrades the factual reliability of LLMs, a degradation linked to the selective amplification of perturbation-sensitive nod…
This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.
Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more
The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
SkillSieve introduces a three-layer hierarchical framework to detect malicious AI agent skills, achieving high F1 scores (0.920) on a large-scale benchmark while maintaining low operational costs.