Papers similar to 2605.29500

~ similar to 2605.29500· 20 results

cs.LGcs.AIRecentMay 27, 2026

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang +4 more

The paper proposes ProRL, an effective Reinforcement Learning framework that rectifies gradient estimation deficiencies to optimize proactive recommendation paths, significantly outperforming existing…

View →

cs.IRcs.AIRecentJun 1, 2026

Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation

Bangguo Zhu, Peng Huo, Yuanbo Zhao, Zhicheng Du +2 more

The paper proposes TDPM, a time-aware diffusion model for generative recommendation, which significantly improves recommendation accuracy by explicitly modeling the non-stationary, time-evolving natur…

View →

stat.MLcs.AIcs.LGRecentMay 28, 2026

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

This paper analyzes Best-of-$N$ preference data, deriving explicit reward targets for independent-reference variants and establishing design principles for choosing $N$ and the base distribution to op…

View →

stat.MLcs.LGRecentJun 1, 2026

ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation

David Rundel, Fabian Fumagalli, Maximilian Muschalik, Bernd Bischl +1 more

ShaplEIG introduces a Bayesian experimental design framework to efficiently and adaptively estimate Shapley values by minimizing the number of required costly function evaluations.

View →

cs.LGcs.CLRecentJun 3, 2026

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye +3 more

This paper proposes a new framework called STRIDE for training data attribution in Large Language Models.

View →

cs.LGcs.CLRecentMay 31, 2026

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more

OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…

View →

cs.CVcs.AIcs.LGRecentMay 30, 2026

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo

DASH introduces a dual-branch distillation framework to effectively compress class-conditional diffusion models by independently supervising both score branches, significantly preserving guidance fide…

View →

cs.AIRecentMay 27, 2026

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more

The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…

View →

cs.LGcs.AIRecentMay 30, 2026

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

Sheng'en Li, Dongmian Zou

The paper introduces COPF, an online framework that ensures deployment-stable counterfactual fairness in link recommendation systems operating on evolving graphs by monitoring and controlling group di…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Reasoning with Sampling: Cutting at Decision Points

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

The paper introduces Entropy-Cut Metropolis-Hastings, an efficient sampling method that uses next-token entropy to identify and resample from critical decision points in a reasoning trace, significant…

View →

cs.LGcs.AIRecentMay 29, 2026

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi +7 more

The paper proposes S2L-PO, a framework that uses smaller, naturally diverse models as structured explorers to enhance the policy-level diversity and performance of larger language models during traini…

View →

cs.AIcs.LGRecentMay 28, 2026

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

Tim Woydt, Paul-David Zuercher

The paper introduces Nested Contextual Causal Bandits (NCCBs) to model multi-timescale sequential decisions and proposes a certified policy optimization method, NCTS, that provides quantifiable risk b…

View →

cs.LGcs.AIcs.CRRecentMay 28, 2026

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Anany Kotawala

The paper introduces NumLeak, a framework demonstrating that top-tier LLMs often exhibit high fidelity recall of specific public numeric benchmarks (like financial factors) due to memorization, which…

View →

cs.LGcs.AIcs.CRRecentMay 28, 2026

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Anany Kotawala

The paper introduces NumLeak, a framework demonstrating that top-tier LLMs often exhibit high fidelity recall of specific public numeric benchmarks, suggesting that their apparent skill may be due to…

View →

cs.AIRecentMay 29, 2026

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris +4 more

PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of re…

View →

cs.LGcs.AIRecentMay 29, 2026

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

Rodney Lafuente-Mercado

The paper introduces ARCA, a novel credit assignment method that measures token salience directly from the adapter's residual hidden state, addressing the degeneracy of standard intrinsic signals when…

View →

cs.CRRecentApr 10, 2026

Hagenberg Risk Management Process (Part 3): Operationalization, Probabilities, and Causal Analysis

Eckehard Hermann, Harald Lampesberger

The paper introduces a comprehensive framework, Realtime Risk Studio, that operationalizes qualitative risk models (Bowtie diagrams) into formal, probabilistic, and intervention-ready runtime models u…

View →

cs.LGcs.CRRecentJun 1, 2026

Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment

Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu +2 more

The paper proposes a novel framework, LPCD, that uses latent causal modeling to robustly assess evolving adversarial risks in live streaming by decoupling malicious intent from superficial tactical sh…

View →

cs.CLcs.AIRecentMay 28, 2026

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more

The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Atoosa Chegini, Soheil Feizi

The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…

View →