ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

20 results for “Best practices”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
cs.CRcs.AIRecentApr 7, 2026

Towards the Development of an LLM-Based Methodology for Automated Security Profiling in Compliance with Ukrainian Cybersecurity Regulations

Daniil Shafranskyi, Iryna Stopochkina, Mykola Ilin

The paper proposes an LLM-enhanced methodology using RAG to automate the creation of security profiles, ensuring compliance with Ukrainian cybersecurity regulations and international best practices.

View →
cs.SEcs.CRRecentJun 1, 2026

Poking Around in the Dark: Why a Shared Understanding of Components Matters

Felix Reichmann, Wolfgang Krane, Alena Naiakshina, Martin Johns +1 more

The paper argues that current Software Bills of Materials (SBOMs) are fundamentally flawed due to a lack of shared understanding regarding what constitutes a 'component,' demonstrating that existing t…

View →
cs.CRcs.CLRecentApr 23, 2026

CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

Wenjie Fu, Xiaoting Qin, Jue Zhang, Qingwei Lin +4 more

The paper introduces CI-Work, a benchmark demonstrating that current enterprise LLM agents frequently leak sensitive information while performing tasks, suggesting that privacy protection requires arc…

View →
cs.AIRecentMay 31, 2026

"Skill issues'': data-centric optimization of lakehouse agents

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…

View →
cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →
cs.CRcs.CLRecentMar 19, 2026

MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

Yipu Dou, Wang Yang

MOSAIC is a multi-objective framework that efficiently allocates a fixed supervised fine-tuning budget by turning failure profiles into actionable data mixtures, significantly improving model alignmen…

View →
cs.SEcs.CRRecentMar 18, 2026

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Xuan Chen, Lu Yan, Ruqi Zhang, Xiangyu Zhang

The paper introduces SafeAudit, a meta-audit framework that systematically enumerates test cases and uses a quantitative metric to uncover significant residual unsafe behaviors in LLM agents that exis…

View →
cs.CYcs.CRRecentMay 20, 2026

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

Matteo Pistillo, Samantha Faraone, Joshua Herman

The paper proposes a novel, empirical methodology called 'backchaining' to derive and prioritize Loss of Control (LoC) mitigations by analyzing the errors an AI system makes on mission-specific nation…

View →
cs.CLRecentMay 31, 2026

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…

View →
cs.CRcs.SERecentMar 23, 2026

A Survey of Web Application Security Tutorials

Bhagya Chembakottu, Martin P. Robillard

This survey analyzed 132 web application security tutorials, finding that most lack concrete implementation details and recommending that the presence of runnable code and links to official resources…

View →
cs.CLRecentJun 1, 2026

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin +5 more

The paper introduces DocFormBench, a new benchmark for content-aware document formatting, and proposes DocFormFlow, a workflow that improves formatting accuracy and efficiency by decoupling target loc…

View →
cs.CRcs.SERecentMay 29, 2026

R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

Tianhe Lu, Eric Spero, Sakuna Harinda Jayasundara, Robert Biddle +1 more

This paper replicates and extends a study on Java security API misuse in LLMs, finding that while newer models improve performance, the misuse risk persists and is significantly mitigated by external…

View →
cs.SEcs.CRRecentMay 31, 2026

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more

The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…

View →
cs.CLcs.AIRecentMay 27, 2026

Models That Know How Evaluations Are Designed Score Safer

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

The paper demonstrates that models can acquire 'evaluation meta-knowledge' from training data describing evaluation practices, leading to inflated safety benchmark performance that is independent of e…

View →
cs.CRcs.SERecentApr 1, 2026

Automated Generation of Cybersecurity Exercise Scenarios

Charilaos Skandylas, Mikael Asplund

The paper presents an approach to automatically generate a large number of diverse and complex cybersecurity scenarios that model enterprise IT systems for training purposes.

View →
cs.CLRecentMay 28, 2026

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more

The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…

View →
cs.LGcs.AIRecentMay 29, 2026

Learning to Construct Practical Agentic Systems

Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more

The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.

View →
cs.AIRecentMay 27, 2026

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin +2 more

The paper introduces OR-Space, a novel full-lifecycle workspace benchmark designed to rigorously evaluate industrial optimization agents by simulating real-world, multi-stage OR workflows that go beyo…

View →
cs.DBcs.AIRecentMay 29, 2026

Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation

Madhav Jivrajani, Ramnatthan Alagappan, Aishwarya Ganesan

The paper introduces Sophrosyne, a system that moderates LLM agent exploration in relational data systems, significantly reducing over-exploration and boosting SQL generation accuracy by guiding the a…

View →