~ similar to 2605.30151· 20 results
This study surveyed higher education practitioners to map their beliefs and behaviors regarding AI integration, finding that while they view AI favorably, institutional barriers and gaps in design-ori…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong +1 more
The paper proposes an interpretable, AI-driven decision layer that ranks course topics needing attention using multiple student and teacher signals, successfully identifying learning gaps before forma…
F. Carichon, S. Sharma, M. Girard, R. Rampa +1 more
The paper introduces IDEAFix, a systematic evaluation framework designed to analyze how structured prompting and task design influence the divergent thinking and originality of idea generation in LLMs…
The study found that providing skills to LLM agents significantly boosts task success, but the specific granularity of how those skills are presented (e.g., low vs. high abstraction) has only small, u…
Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar +2 more
The paper introduces a new benchmark, E2V-Bench, to evaluate text-to-image models on generating pedagogically accurate visuals from arithmetic equations, finding that current models often fail due to…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
Hang Li, Fedor Filippov, Yuling Lin, Pengfei He +5 more
This paper investigates the vulnerability of LLM-based automatic grading systems to prompt injection (PI) attacks, demonstrating that current systems are highly susceptible to manipulation that can le…
Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang +5 more
The paper argues that observed gains in multimodal agents using tools may be due to learning tool-calling patterns rather than genuine capability expansion, finding that tool access provides little co…
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala +1 more
The paper proposes a structured prompt engineering framework to enhance the integrity and reliability of Chain-of-Thought (CoT) reasoning in LLMs, demonstrating significant improvements in security-se…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain +2 more
The paper formalizes LLM degradation during long generation as 'cognitive fatigue' and introduces the Fatigue Index (FI), a measurable, model-agnostic diagnostic tool for real-time monitoring.
The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
This study compares different levels of LLM access in a statistics course, finding that structured, guided use significantly improves students' reasoning skills and independent learning compared to un…
Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more
The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.
The paper proposes a multimodal data analytics framework combining knowledge graphs and temporal modeling to accurately predict advanced mathematics learning difficulties and provide early academic wa…
The paper introduces COVCAL, a risk-controlled method that precisely determines when a partial formalization signal from an autoformalizer can be trusted to certify the correctness of natural-language…
The paper demonstrates that LLM performance in zero-shot annotation is significantly limited by the alignment between the model's internal understanding and the task definition, showing that prompt-ba…