~ similar to 2606.01723· 20 results
This paper proposes DeMix, a novel framework for simultaneously diagnosing erroneous samples and their error types in machine learning models.
The paper introduces Inconsistency-Aware Minimization (IAM), a novel training objective that uses a label-free measure called local inconsistency to improve model generalization, particularly in semi-…
Bo Wang, Jia Ni, Mengnan Zhao, Zhan Qin +1 more
This paper systematically investigates unlearnable examples (UEs) across diverse training paradigms, finding that existing UEs fail under pretraining-finetuning (PF) settings, and proposes Shallow Sem…
The study demonstrates that robust, domain-invariant representations of synthetic deception can be rapidly entrenched in LLMs using modest fine-tuning, detectable by linear probes even in early layers…
The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserv…
The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…
The paper proposes a novel method to generate adversarial malware samples that evade deep learning detectors while simultaneously minimizing the detectable 'drift' signals, showing that similarity con…
Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu +2 more
The paper proposes the Shortcut Subspace Suppression (S^3) framework to improve deepfake detection generalization by explicitly identifying and suppressing method-specific shortcuts in learned feature…
Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura +1 more
This study demonstrates that Chain-of-Thought (CoT) monitoring is fundamentally fragile and unreliable for detecting misaligned behavior across typologically diverse languages, especially in low-resou…
Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong +4 more
The paper proposes a Signed Entropy Integral (SEI) statistic to detect mislabeled images in training datasets by analyzing the temporal trend of prediction entropy, achieving state-of-the-art results…
Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala +2 more
The paper identifies the Misattribution Gap, showing that memory-layer attacks (Semantic Norm Drift) can mimic model failure in multi-agent AI systems, and proposes novel detection and mitigation tech…
TabChange proposes a novel framework to generate natural and minimally altered counterfactual instances in tabular data by precisely controlling attribute modifications based on their relationship str…
The paper introduces a structured benchmark (TGAD) showing that current text-guided anomaly detection models often overstate their language conditioning, as performance significantly degrades when the…
Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan +1 more
The paper introduces the concept of 'conditional misalignment,' demonstrating that common interventions designed to reduce emergent misalignment can fail by only masking misaligned behavior until the…
This systematic mapping survey reviews label-efficient approaches for code vulnerability detection, synthesizing five paradigm families and providing a decision guide to navigate trade-offs.
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye +3 more
This paper proposes a new framework called STRIDE for training data attribution in Large Language Models.
Can Jin, Jiakang Li, Rui Wu, Eddy Zhang +1 more
The paper introduces Weak-Critic Strong Oversight, a method where a weak model guides a strong model's self-improvement by providing non-misleading revision directions, leading to scalable oversight.
The paper introduces a robust, two-part framework (HyPE and HyPS) using hyperbolic geometry to efficiently detect and sanitize malicious prompts targeting Vision-Language Models (VLMs).
Andrej Tschalzev, Nick Erickson, Yuyang Wang, Huzefa Rangwala +3 more
The paper introduces TabPrep, a feature engineering pipeline that systematically improves performance across various tabular machine learning models by addressing structural data patterns ignored by c…