Papers similar to 2606.06454

~ similar to 2606.06454· 20 results

cs.SEcs.AIRecentMay 28, 2026

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…

View →

cs.LGcs.AIRecentMay 27, 2026

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Nishal Thomas, Noel Thomas

The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…

View →

cs.CLcs.AIcs.LGRecentMay 29, 2026

Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

Xiaonan Xu, Wenjing Wu

The study found that providing skills to LLM agents significantly boosts task success, but the specific granularity of how those skills are presented (e.g., low vs. high abstraction) has only small, u…

View →

cs.CRcs.SERecentMay 4, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…

View →

cs.CRcs.SERecentApr 30, 2026

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

Maofei Chen, Laifu Wang, Yue Qin, Yuan Wang +2 more

The paper demonstrates that using raw source text for fine-tuning LLMs on vulnerability detection causes high false-positive rates by memorizing surface-level syntax, a problem mitigated by using Abst…

View →

cs.CRRecentMay 4, 2026

PHANTOM: Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry

Abraham Itzhak Weinberg

PHANTOM is a novel framework that generates highly convincing, context-aware honeytokens by incorporating deep organizational knowledge, significantly improving their believability and detection resis…

View →

cs.CRRecentMay 13, 2026

Do Skill Descriptions Tell the Truth? Detecting Undisclosed Security Behaviors in Code-Backed LLM Skills

Wenhui He, Yue Li, Bang Fu, Huan Xing +3 more

The paper introduces SKILLSCOPE, a system that detects security-relevant behaviors in code-backed LLM skills that are not disclosed in the natural language description, finding that 9.4% of skills exh…

View →

cs.SEcs.AIRecentMay 28, 2026

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun +2 more

The paper introduces Code-QA-Bench, a novel framework that rigorously separates genuine code reasoning from mere documentation memorization in repository-level code understanding benchmarks.

View →

cs.CRcs.LGRecentApr 17, 2026

Surgical Repair of Insecure Code Generation in LLMs

Gustavo Sandoval, Brendan Dolan-Gavitt, Siddharth Garg

This paper identifies the 'Format-Reliability Gap'—where LLMs know about code vulnerabilities but generate insecure code anyway—and proposes a localized, per-vulnerability steering vector fix that sig…

View →

cs.CLRecentMay 29, 2026

Mellum2 Technical Report

Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok +5 more

Mellum 2 is an open-weight 12B Mixture-of-Experts (MoE) language model specialized for software engineering, achieving performance competitive with larger models while maintaining the efficiency of a…

View →

cs.PLcs.AIcs.CLRecentMay 27, 2026

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

FPMoE introduces a sparse Mixture-of-Experts (MoE) architecture to improve functional code generation across multiple functional programming languages, achieving state-of-the-art performance with fewe…

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Atoosa Chegini, Soheil Feizi

The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…

View →

cs.CVcs.AIRecentMay 29, 2026

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more

The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…

View →

cs.CLcs.AIcs.LGRecentMay 29, 2026

XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju

The paper introduces XLGoBench, a synthetic benchmark of algorithmic tasks designed to detect persistent cross-lingual skill gaps in large language models.

View →

cs.AIcs.CLcs.LORecentMay 27, 2026

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov +1 more

The paper introduces COVCAL, a risk-controlled method that precisely determines when a partial formalization signal from an autoformalizer can be trusted to certify the correctness of natural-language…

View →

cs.CLcs.IRRecentJun 3, 2026

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Zhenyu Yu, Shuigeng Zhou

This paper evaluates the causal reasoning abilities of large language models and finds that they rely heavily on lexical pattern matching rather than structural reasoning.

View →

cs.CRcs.AIcs.SERecentJun 3, 2026

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…

View →

cs.CRcs.AIcs.MARecentMay 1, 2026

Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere

The paper proposes a trust schema and verification framework to ensure that agent skills, which augment LLMs, are rigorously verified before deployment, thereby making human-in-the-loop oversight scal…

View →

cs.CRcs.AIcs.LGRecentMay 22, 2026

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

Mohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh, Mohammad Hammoudeh +1 more

The paper empirically evaluates the security quality of LLM-generated code across various prompting methods, finding that while prompting alters the structure of weaknesses, it is insufficient to reli…

View →

cs.SEcs.AIcs.LGRecentMay 29, 2026

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Nazmus Ashrafi

The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…

View →