20 results for “Best practices”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
The paper proposes an LLM-enhanced methodology using RAG to automate the creation of security profiles, ensuring compliance with Ukrainian cybersecurity regulations and international best practices.
The paper argues that current Software Bills of Materials (SBOMs) are fundamentally flawed due to a lack of shared understanding regarding what constitutes a 'component,' demonstrating that existing t…
Wenjie Fu, Xiaoting Qin, Jue Zhang, Qingwei Lin +4 more
The paper introduces CI-Work, a benchmark demonstrating that current enterprise LLM agents frequently leak sensitive information while performing tasks, suggesting that privacy protection requires arc…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
MOSAIC is a multi-objective framework that efficiently allocates a fixed supervised fine-tuning budget by turning failure profiles into actionable data mixtures, significantly improving model alignmen…
The paper introduces SafeAudit, a meta-audit framework that systematically enumerates test cases and uses a quantitative metric to uncover significant residual unsafe behaviors in LLM agents that exis…
The paper proposes a novel, empirical methodology called 'backchaining' to derive and prioritize Loss of Control (LoC) mitigations by analyzing the errors an AI system makes on mission-specific nation…
This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…
This survey analyzed 132 web application security tutorials, finding that most lack concrete implementation details and recommending that the presence of runnable code and links to official resources…
Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin +5 more
The paper introduces DocFormBench, a new benchmark for content-aware document formatting, and proposes DocFormFlow, a workflow that improves formatting accuracy and efficiency by decoupling target loc…
This paper replicates and extends a study on Java security API misuse in LLMs, finding that while newer models improve performance, the misuse risk persists and is significantly mitigated by external…
Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more
The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…
The paper demonstrates that models can acquire 'evaluation meta-knowledge' from training data describing evaluation practices, leading to inflated safety benchmark performance that is independent of e…
The paper presents an approach to automatically generate a large number of diverse and complex cybersecurity scenarios that model enterprise IT systems for training purposes.
Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more
The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…
Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more
The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.
Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin +2 more
The paper introduces OR-Space, a novel full-lifecycle workspace benchmark designed to rigorously evaluate industrial optimization agents by simulating real-world, multi-stage OR workflows that go beyo…
The paper introduces Sophrosyne, a system that moderates LLM agent exploration in relational data systems, significantly reducing over-exploration and boosting SQL generation accuracy by guiding the a…