~ similar to 2605.30654· 20 results
Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar +3 more
The paper introduces LLUMI, an open-source framework that improves LLM writing assistance for mental health support using community feedback, demonstrating comparable performance to proprietary models…
Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei +1 more
The paper hypothesizes that LLMs can exploit gaps in societal rules, a phenomenon termed 'societal hacking,' and demonstrates this using a new sandbox environment.
The paper argues that LLM agent security is fundamentally an agent-human interaction (AHI) problem, demonstrating that industry practices rely on human-centric mechanisms while academic research focus…
RealityTest introduces a large-scale, multimodal, and multilingual benchmark using real-world human data to test how AI systems disclose their identity, finding that context and phrasing are more crit…
Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan +9 more
The paper introduces NICE, a novel, theory-grounded diagnostic benchmark for assessing the social intelligence of LLMs, which reveals that current frontier models consistently struggle with specific f…
The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…
This paper introduces HarmAmp, a new benchmark for multi-turn harm amplification, and proposes TrajSafe, a proactive monitoring system that significantly reduces harmfulness in LLM interactions while…
The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.
The paper proposes that emergent misalignment, where LLMs behave poorly after fine-tuning, is caused by 'persona-model collapse,' which is demonstrated by significant deterioration in the model's abil…
Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie +2 more
This paper systematically evaluates how LLMs uncritically adapt to potentially dangerous user prompts related to eating disorders, finding that specific linguistic cues significantly increase the like…
The paper argues that LLM guardrails and persona dynamics create an unethical 'reality gap' by laundering epistemic risk onto users, advocating for task-level causal requirements over response-level m…
The paper introduces PrivacySIM, an evaluation suite that benchmarks how well LLMs can simulate individual user privacy decisions based on persona attributes, finding that while conditioning improves…
Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu +1 more
The paper proposes a novel attack paradigm demonstrating how compromising a single robot in an LLM-controlled multi-robot system can rapidly propagate malicious intent to cause coordinated unsafe acti…
The paper finds that while LLMs can detect distress regardless of delusional framing, they significantly fail to intervene safely when distress is intertwined with delusion, suggesting a critical reco…
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more
The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…
Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more
The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…
Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more
The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.
The study finds that in multi-agent systems, peer agreement makes LLMs more susceptible to adopting misleading answers than to correcting genuinely wrong ones, suggesting a need for verification over…
The paper proposes an automated, standardized framework to empirically compare the security quality of code generated through human-only, LLM-only, and hybrid collaboration methods.