~ similar to 2606.02342· 18 results
The paper demonstrates that passive motion traces recorded during a mobile selfie capture can serve as a measurable, low-friction auxiliary signal for enhancing both spoof screening and user identity…
This paper systematically diagnoses the failure modes of linear deception probes in LLMs, finding that while single-direction probes are insufficient, multi-dimensional probes can recover robust detec…
GAFSV-Net introduces a novel 2D vision framework by encoding temporal signature data into a six-channel Gramian Angular Field image, significantly improving online signature verification accuracy over…
This paper demonstrates that typographic attacks pose a significant, measurable, and physically consequential threat to household robot manipulation systems by causing the robot to grasp and transport…
The paper introduces a structured benchmark (TGAD) showing that current text-guided anomaly detection models often overstate their language conditioning, as performance significantly degrades when the…
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…
The paper introduces OpAI-Bench, a novel benchmark designed to study how AI authorship signals evolve and accumulate during the progressive co-editing process between humans and AI.
This paper proposes a 3D CNN detector that leverages temporal artifacts to accurately identify high-quality deepfake videos, demonstrating robust detection even after social media re-encoding.
The paper demonstrates that using synthetic hand images containing accessories, generated via inpainting, significantly improves the robustness of hand detectors for safety-critical applications by cl…
NeuroLip proposes an event-based spatiotemporal framework for visual speaker recognition that achieves robust cross-scene generalization by capturing fine-grained lip dynamics, outperforming existing…
The paper systematically compares multimodal transformer and LLM approaches for document type classification, finding that specialized multimodal Transformers outperform LLM-based models, especially w…
KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.
Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan +3 more
The paper introduces Pocket-Dentist, an efficiency-aware benchmark and model that demonstrates that compact, smaller Vision-Language Models (VLMs) can outperform larger models in accuracy while drasti…
Zijian Ling, Jianbang Chen, Hongwei Li, Hongda Zhai +5 more
BioMoTouch is a multi-modal touch authentication framework that jointly models physiological contact structures (from capacitive screens) and behavioral motion dynamics (from inertial sensors) to achi…
The paper developed and validated Quantitative Movement Testing (QMT), a computer vision pipeline that accurately extracts 3D kinematic biomarkers from standard smartphone videos, providing an objecti…
This paper proposes using homoglyphic substitution, replacing characters with visually similar alternatives, as a method to degrade and prevent the extraction of personal information via adversarial s…
Zhiwei Chen, Yijie Li, Yimo Zhang, Shiyun Shao +8 more
GaMi is a multimodal material identification system that uses mmWave and acoustic sensing with a cross-modal subtractive disentanglement framework to achieve high accuracy (95.2%) for material identif…