~ similar to 2605.28483· 20 results
Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more
This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…
Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki +1 more
The paper introduces a robust four-stage NLP framework that uses schema-constrained LLMs and ESCO vocabulary to accurately extract and align educational competencies with labor market demands, quantif…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
SkillPager is a novel two-stage framework that efficiently selects minimal, execution-sufficient context from large procedural skill documents by leveraging typed semantic nodes, significantly reducin…
Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu +8 more
MIRA proposes a novel source-aware filtering framework that discovers and anchors evaluation rubrics during data selection, significantly improving code-oriented mid-training data quality while reduci…
The paper introduces a novel, scalable framework to monitor and classify dataset usage within research literature, addressing the current lack of infrastructure for tracking data citations.
The paper proposes InSemRAG, an enhanced RAG framework that improves retrieval accuracy and knowledge integrity by incorporating intent-aware retrieval and semantics-preserving chunking, achieving sta…
Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song +2 more
The paper proposes a graph-constrained approach to scale multi-hop training data by decoupling path discovery from path verbalization, significantly expanding the usable corpus size for LLMs.
Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu +5 more
The paper introduces SkillBrew, a multi-objective framework that treats skill bank curation as a constrained optimization problem to build efficient and well-curated skill repositories for LLM agents.
The paper proposes a rigorous, fixed-budget, cluster-aware standard for LLM-as-a-judge evaluation of multi-hop RAG systems, demonstrating that current evaluation methods often overstate performance.
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…
The paper introduces a novel, training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-Judge, and further proposes an iterative fine-tuning strategy that significant…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
The paper proposes a comprehensive benchmark to systematically audit how varying persona prompts and model choices affect the technical quality and social representativeness of scholar recommendations…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
This study compares different levels of LLM access in a statistics course, finding that structured, guided use significantly improves students' reasoning skills and independent learning compared to un…
Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang +1 more
The paper introduces CITETRACE, a large-scale dataset and evaluation framework that systematically measures structural citation failures in search-augmented LLMs, revealing a pattern called Verified M…
MEMENTO proposes a novel framework that treats the open web as a continuous learning signal, enabling agents to acquire task-specific expertise and reusable research strategies in low-data domains wit…
The paper introduces a hybrid system, HYBRIDSOURCETRACKER (HST), that combines vector search and Winnowing fingerprinting to achieve scalable, high-precision provenance tracking for code generated by…