Papers similar to 2604.05642v1

~ similar to 2604.05642v1· 16 results

cs.CRcs.AIcs.MMRecentApr 9, 2026

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

Longgang Zhang, Xiaowei Fu, Fuxiang Huang, Lei Zhang

The paper introduces a new benchmark (BGTD) and a multimodal framework (mmTraffic) that enables explainable, evidence-grounded interpretation of encrypted network traffic using LLMs.

View →

cs.CVRecentJun 1, 2026

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs +1 more

The paper addresses the difficulty of using general vision-language models (VLMs) for fine-grained driver behavior recognition by creating a new, richly described dataset and demonstrating that fine-t…

View →

cs.CVcs.LGeess.IVRecentJun 3, 2026

An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

Gandhimathi Padmanaban, Fred Feng

This paper presents an open-source computer vision pipeline for classifying vehicle body types from naturalistic roadway video.

View →

cs.CVcs.CRRecentMar 17, 2026

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal, Tanmay Talsaniya, Parag Patel, Meet Patel

KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.

View →

cs.CRRecentMay 31, 2026

Privacy-Preserving Smart Surveillance with Cross-Dataset Violence Detection and Decentralized Evidence Governance

Hasan Coşkun, Furkan Çolhak, Andrea Kulakov, Vesna Dimitrova

The paper proposes a privacy-preserving smart surveillance framework that uses a MobileNetV2-based classifier for violence detection and employs decentralized, threshold-based encryption for evidence…

View →

cs.AIRecentMay 31, 2026

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

Siyan Li, Zehao Wang, Jiachen Li, Kanok Boriboonsomsin +2 more

This survey reviews how Large and Multi-modal Language Models (LLMs/MM-LLMs) are being applied to integrate diverse data sources for enhanced decision support in transportation systems management and…

View →

cs.CLcs.AIcs.CVRecentJun 1, 2026

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more

The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…

View →

cs.CRcs.CYRecentMay 8, 2026

Binge, Bot, Repeat: Unpacking the Ecosystem of Video Piracy on Telegram

Sadikshya Gyawali, Jaishnoor Kaur, Taylor Graham, Josef Horacek +3 more

This study provides the first large-scale analysis of video piracy on Telegram, quantifying its massive financial impact and developing a resilient detection framework, Anti-RIP, to combat it.

View →

cs.CLcs.AIcs.LGRecentMay 28, 2026

PhoneWorld: Scaling Phone-Use Agent Environments

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li +20 more

The paper introduces PhoneWorld, a scalable pipeline that automatically converts real-world GUI trajectories and screenshots into controllable, reproducible phone-use environments, significantly impro…

View →

cs.CVcs.AIcs.CLRecentJun 1, 2026

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…

View →

cs.CRcs.CYRecentApr 30, 2026

Tracking Conversations: Measuring Content and Identity Exposure on AI Chatbots

Muhammad Jazlan, Ethan Wang, Yash Vekaria, Zubair Shafiq

This paper systematically measured web tracking across 20 popular AI chatbots, finding that a majority share both conversational content and user identity information with third parties.

View →

cs.CRcs.CYRecentMar 25, 2026

A Large-Scale Study of Telegram Bots

Taro Tsuchiya, Haoxiang Yu, Tina Marjanov, Alice Hutchings +2 more

This paper provides a large-scale characterization of Telegram bots, revealing that while they serve useful functions like crowdsourcing, they are also extensively used for malicious activities such a…

View →

cs.CVcs.AIRecentJun 1, 2026

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more

The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…

View →

cs.CRcs.ETcs.LGRecentApr 30, 2026

Selfie-Capture Dynamics as an Auxiliary Signal Against Deepfakes and Injection Attacks for Mobile Identity Verification

Erkka Rantahalvari, Olli Silvén, Zinelabidine Boulkenafet, Constantino Álvarez Casado

The paper demonstrates that passive motion traces recorded during a mobile selfie capture can serve as a measurable, low-friction auxiliary signal for enhancing both spoof screening and user identity…

View →

cs.CVcs.CRRecentMay 28, 2026

On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection

Gudrun Schappacher-Tilp, Nicoletta Kaehling, Jan Kornberger, Egon Teiniker

The paper proposes a privacy-preserving visual monitoring system that performs object detection and generates natural language alerts entirely on an edge device, ensuring GDPR compliance by never tran…

View →

cs.CRcs.AIcs.MMRecentMar 23, 2026

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee

This paper introduces ComicJailbreak, a new benchmark demonstrating that structured visual narratives can effectively jailbreak Multimodal Large Language Models (MLLMs), requiring new safety alignment…

View →