~ similar to 2606.05149· 19 results
Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more
The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…
The paper addresses the difficulty of using general vision-language models (VLMs) for fine-grained driver behavior recognition by creating a new, richly described dataset and demonstrating that fine-t…
FedTrident proposes a comprehensive framework to defend Federated Learning-based Road Condition Classification against Targeted Label-Flipping Attacks, achieving robust performance comparable to non-a…
David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin +2 more
The paper proposes a novel global multi-modal alignment framework to robustly learn video representations from noisy and complementary sensor data, significantly improving driver distraction detection…
KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.
Qingwen Pu, Kun Xie, Hong Yang, Di Yang +1 more
The paper develops a novel deep reinforcement learning framework, SMamba-DDPG, to accurately model vehicle-type-specific pedestrian crash avoidance behavior, finding that pedestrians react faster and…
This paper investigates the application of Parameter-Efficient Fine-Tuning (PEFT) methods, specifically adapters and LoRA, to large pretrained models for instance segmentation, demonstrating that thes…
Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang +1 more
The paper introduces a structured multi-level visual perturbation framework to systematically analyze how dependent VLA-based driving behavior is on visual information, revealing uneven visual groundi…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
TrafficRAG is a multimodal retrieval-augmented framework that automates traffic accident liability determination by integrating visual evidence, structured legal knowledge, and advanced LLM reasoning.
DeepIPCv3 is a novel multi-modal framework that fuses LiDAR and DVS event streams using cross-modal attention to achieve state-of-the-art, highly reactive avoidance maneuvers for sudden pedestrian cro…
This study comparatively evaluates four CNN architectures (VGG16, ResNet50, EfficientNetB0, and XceptionNet) for fake image detection, finding VGG16 achieved the highest accuracy (91%).
The paper demonstrates that using synthetic hand images containing accessories, generated via inpainting, significantly improves the robustness of hand detectors for safety-critical applications by cl…
This paper proposes a systematic joint workflow combining HARA and TARA to comprehensively identify and analyze risks stemming from inherent limitations of Deep Neural Networks (DNNs) used in autonomo…
Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa +4 more
The paper proposes an attention-enhanced deep learning framework using EfficientNet and CBAM to achieve high accuracy (93.3%) in classifying peach leaf damage, demonstrating improved robustness under…
The paper introduces a structured benchmark (TGAD) showing that current text-guided anomaly detection models often overstate their language conditioning, as performance significantly degrades when the…
Shuo Ju, Qingzhao Zhang, Huashan Chen, Xuheng Wang +5 more
The paper introduces a novel adversarial attack that uses static, view-dependent camouflage on a vehicle to induce consistent feature drift, causing autonomous systems to predict false, yet plausible,…
This paper systematically analyzes the high cross-architecture transferability of physical adversarial attacks on Vision-Language Models (VLMs) used in autonomous driving, demonstrating that attacks e…
Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen +2 more
ReasonLight is a multimodal foundation model-enhanced RL framework that enables zero-shot traffic signal control by semantically refining RL-proposed actions using heterogeneous sensor and camera data…