~ similar to 2606.01897· 16 results
Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen +14 more
The paper introduces MTAVG-Bench 2.0, a new benchmark designed to diagnose high-level failure modes of cinematic expressiveness in multi-talker audio-video generation, showing that even advanced model…
Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast +2 more
The paper introduces EUDAIMONIA, a new framework and benchmark for evaluating how well LLMs align with user welfare in social interactions, finding that even state-of-the-art models frequently violate…
This study analyzes the dynamics of AI-generated multimodal misinformation using a large-scale dataset, finding that while synthetic content is highly viral, its spread is passive and its detectabilit…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin +5 more
The paper introduces InsightVQA, a large-scale benchmark dataset designed for hierarchical visual question answering that assesses complex emotion understanding and cognitive reasoning beyond simple e…
Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou +7 more
The paper introduces Cookie-Bench, a novel, autonomous, and reference-free evaluation framework that significantly improves the assessment of interactive web generation capabilities for frontier LLMs.
The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…
The paper introduces CHARM, a novel framework that detects and mitigates cascading hallucination—the amplification of errors across multi-step agentic RAG pipelines—achieving an 82.1% reduction in err…
The paper introduces the Tacit Understanding Index (TUX) to measure non-explicit alignment between humans and LLMs, finding that this alignment is significantly structured by individual person-level t…
The study demonstrates that conditioning AI brand recommendations on a user's persona significantly alters the recommended product set, particularly for mid-market brands, and this effect is largest o…
Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan +9 more
The paper introduces NICE, a novel, theory-grounded diagnostic benchmark for assessing the social intelligence of LLMs, which reveals that current frontier models consistently struggle with specific f…
Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar +3 more
The paper introduces LLUMI, an open-source framework that improves LLM writing assistance for mental health support using community feedback, demonstrating comparable performance to proprietary models…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
Wuqiang Zheng, Chengbing Wang, Yilin Yang, Junyi Cheng +5 more
This paper introduces personalized empathy, a capability for LLMs to adapt empathetic strategies based on individual user history, and proposes PereGRM, a reward modeling framework that significantly…
The paper proposes a persona-based evaluation framework that replaces monolithic AI benchmarks with structured cognitive profiles to capture diverse human perspectives, while also identifying the chal…
Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang +2 more
The paper introduces PersTurnBench, a novel benchmark and evaluator for assessing personalized user conversation satisfaction at specific turns, addressing the limitation of generic response quality m…