Papers similar to 2605.29788

~ similar to 2605.29788· 19 results

cs.LGcs.AIstat.MLRecentMay 28, 2026

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren +3 more

The paper analyzes the sample complexity of contextual bandits in the $s$-sparse setting, achieving optimal sample bounds for identifying an $\epsilon$-optimal policy.

View →

cs.LGcs.AIRecentMay 30, 2026

Extending Causal Metamodeling to a non-Markovian Queue

Pracheta Amaranath, Anant Bhide, David Jensen, Peter Haas

The paper extends modular dynamic Bayesian networks (MDBNs) to model non-Markovian queues, providing the first causal metamodeling technique for such systems with significant speedup.

View →

cs.AIcs.LGstat.MLRecentMay 31, 2026

Transferring Information Across Interventions in Causal Bayesian Optimization

Mohammad Ali Javidian

The paper proposes graph-coupled causal Bayesian optimization, a method that improves efficiency by sharing information across related interventions through a shared set of causal parameters.

View →

math.OCcs.AIcs.LGRecentJun 1, 2026

MINTS: Minimalist Thompson Sampling

Kaizheng Wang

The paper introduces MINTS, a minimalist Bayesian framework that simplifies sequential decision-making by placing priors only on the optimum location, allowing for the incorporation of structural cons…

View →

cs.LGcs.AIRecentMay 30, 2026

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding +1 more

The paper theoretically explains that optimizing LLMs solely on outcomes leads to brittle reasoning (Reward-Induced Manifold Collapse) by favoring low-complexity shortcuts, and proposes process-based…

View →

cs.LGcs.AIRecentMay 28, 2026

Test Time Training for Supervised Causal Learning

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun +4 more

The paper proposes Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training data aligned with specific test instances to significantly improve…

View →

cs.AIcs.CLcs.LGRecentMay 28, 2026

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang +5 more

The paper introduces Contextual Belief Management (CBM) to address how LLMs should manage accumulating information over long interactions, showing that reinforcement learning significantly improves be…

View →

cs.LGstat.MLRecentJun 1, 2026

Minimax-Optimal Policy Regret in Partially Observable Markov Games

Raman Arora

The paper develops an optimistic maximum-likelihood algorithm that achieves $ ilde{O}(\sqrt{T})$ policy regret for sequential decision-making in partially observable Markov games against adaptive oppo…

View →

cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →

cs.AIRecentMay 27, 2026

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

Shuaike Li, Kai Zhang, Xianquan Wang, Jiachen Liu +1 more

The paper introduces Causal Editing (CODE), a new paradigm that improves knowledge updates in LLMs by grounding fact injection in causal narratives, drastically reducing self-refutation rates.

View →

cs.LGcs.AIRecentMay 29, 2026

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

William Overman, Mohsen Bayati

The paper analyzes the performance of an annealed softmax policy in a Bayesian bandit setting, proving that under specific prior conditions, it achieves near-optimal regret rates by effectively sampli…

View →

stat.MLcs.LGRecentJun 2, 2026

Resource-Constrained Adaptive Inference for Sequential Pricing

Ruicheng Ao, Jiashuo Jiang, David Simchi-Levi

The paper addresses the failure of fixed-price inference in resource-constrained pricing controllers by developing a target-aware controller that tracks local densities and provides certified, shrinki…

View →

cs.AIRecentMay 28, 2026

Structure-Induced Information for Rerooting Levin Tree Search

Jake Tuero, Michael Buro, Laurent Orseau, Levi H. S. Lelis

The paper introduces a learned 'rerooter' mechanism to improve subgoal-based policy tree search, allowing scalable search in complex environments without the overhead of explicit subgoal generation.

View →

cs.LGcs.AIEmpiricalRecentJun 4, 2026

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.

View →

cs.LGcs.AIEmpiricalRecentJun 4, 2026

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.

View →

cs.LGcs.AIRecentJun 1, 2026

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

Peter Chen, Xi Chen

The paper proposes 2FFS, a two-fidelity tree-search algorithm that efficiently identifies the best action in stochastic minimax trees by adaptively combining cheap, biased heuristic evaluations with e…

View →

cs.CRRecentMay 2, 2026

Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly

Igor Santos-Grueiro

The paper identifies and measures a critical failure mode where LLM agents violate policies by losing or corrupting directive-bearing state during the process of assembling the decision context, and p…

View →

cs.CRcs.LGcs.MARecentApr 6, 2026

Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

Yiyao Zhang, Diksha Goel, Hussain Ahmad

The paper introduces C-MADF, a causally constrained multi-agent framework that significantly reduces false positives in autonomous cyber defense by restricting response actions to structurally consist…

View →

cs.AIcs.LGstat.MLRecentJun 1, 2026

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Zelin He, Haotian Lin, Boran Han, Wei Zhu +5 more

ReSkill is an RL-in-the-loop framework that reconciles skill creation and policy optimization by automatically creating, testing, and refining modular skills alongside the agent's policy learning, lea…

View →