Computer Vision

Object detection, segmentation, recognition, video

20 papers indexed

cs.CVEmpiricalRecentJul 7, 2026

Vision as Unified Multimodal Generation

Xiaoyang Han, Jianhua Li, Kewang Deng, Zukai Chen +13 more

The paper presents SenseNova-Vision, a unified multimodal model for computer vision tasks using natural language instructions and optional visual prompts, trained primarily on a new corpus and requiri…

View →

cs.CVcs.CRcs.SIRecentMay 14, 2026

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed +2 more

This study systematically evaluates Vision Mamba models for detecting AI-generated images, finding that while they show promise, their current strengths and limitations must be understood relative to…

View →

cs.CVcs.ROEmpiricalRecentJul 17, 2026

PIXIE: A Zero-Shot texture-invariant 6D pose estimation framework for unseen objects with assembly defects

Leon Jungemeyer, Alejandro Magaña, Gautham Mohan, Matthias Karl +1 more

The paper introduces PIXIE, a zero-shot framework for estimating 6D pose of an object from an RGB image using only an untextured 3D model.

View →

cs.CVEmpiricalRecentJun 12, 2026

Scratched Lenses, Shifted Depth: Passive Camera-Side Optical Attacks

Qinlin He, Zeming Zhuang, Yongji Wu, Lan Zhang +2 more

This paper identifies and explores a new type of physical adversarial attack on vision systems called Scratch-induced Lens Adversarial Streak Hijacking (SLASH), which causes persistent and selective o…

View →

cs.CVcs.AIEmpiricalRecentJun 12, 2026

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori +1 more

This paper investigates acoustic attacks on Artificial Intelligence (AI) based computer vision systems using lower frequencies in the audible range, and explores the impact on various image and object…

View →

cs.CVcs.AIcs.ROEmpiricalRecentJul 17, 2026

DPNeXt: A Lightweight Multi-Scale Feature Fusion Framework for Efficient ViT-Based Multi-Task Dense Prediction

Jehun Kang, Jungha Wang, Youngjun Hwang, David Hyunchul Shim

This paper proposes DPNeXt, a streamlined multi-scale feature fusion decoder for Multi-Task Learning (MTL) in robotics perception systems, improving frozen VFM utilization and mitigating negative indu…

View →

cs.LGcs.CRcs.CVRecentMar 25, 2026

Amplified Patch-Level Differential Privacy for Free via Random Cropping

Kaan Durmaz, Jan Schuchardt, Sebastian Schmidt, Stephan Günnemann

The paper shows that using random cropping, a standard data augmentation technique, can naturally amplify differential privacy guarantees for machine learning models without requiring any changes to t…

View →

cs.CVRecentJun 1, 2026

Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research

Michelle R. Greene

Places in the Wild introduces a massive, high-resolution RAW photograph dataset of 67,574 images captured in situ across 810 locations, providing unprecedented detail for ecologically valid vision res…

View →

cs.CVcs.AIcs.LGRecentMay 27, 2026

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

This study empirically benchmarks classical and quantum machine learning models for image recognition, finding that while quantum models offer superior accuracy and resource efficiency at high dimensi…

View →

cs.CVEmpiricalRecentJul 7, 2026

ProxyPose: 6-DoF Pose Tracking via Video-to-Video Translation

Ruihang Zhang, Felix Taubner, Pooja Ravi, Kiriakos N. Kutulakos +1 more

The paper introduces ProxyPose, a method for six-degree-of-freedom (6-DoF) pose tracking using a video diffusion model and a single marked pixel in the first frame.

View →

cs.CVcs.AIRecentMay 31, 2026

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao +2 more

The paper introduces Multi-temporal Referring Segmentation (MTRS), a new task requiring models to segment language-described temporal changes, and proposes MTRefSeg-R1, a specialized framework that ac…

View →

cs.CVcs.LGeess.IVRecentJun 3, 2026

An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

Gandhimathi Padmanaban, Fred Feng

This paper presents an open-source computer vision pipeline for classifying vehicle body types from naturalistic roadway video.

View →

cs.CVcs.AIcs.CRRecentMar 29, 2026

Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage +2 more

The paper introduces CAIAMAR, a multi-agent reasoning framework that achieves context-aware and high-fidelity anonymization of personally identifiable information (PII) in street imagery, significantl…

View →

cs.CVRecentJun 4, 2026

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more

The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.

View →

cs.CVcs.AIRecentMay 28, 2026

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

The paper proposes xModel-KD, a cross-modal knowledge distillation framework, to improve 3D point cloud segmentation by effectively transferring rich appearance cues from 2D images to sparse 3D geomet…

View →

cs.CVcs.AIcs.RORecentMay 28, 2026

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

Chenxi Tao, Seung-Kyum Choi

The paper reframes industrial visual sim-to-real transfer as a domain-gap problem categorized by the availability of explicit object geometry (CAD), arguing that the required prior evidence dictates t…

View →

cs.CVRecentJun 1, 2026

Edge Prediction for Roof Wireframe Reconstruction with Transformers

Gustav Hanning, Ludvig Dillén, Jonathan Astermark, Johanna Lidholm +1 more

The paper proposes a Transformer-based end-to-end architecture to reconstruct 3D house roof wireframes from sparse point clouds and semantic data, achieving state-of-the-art results on the S23DR Chall…

View →

cs.CVcs.AIEmpiricalRecentJul 10, 2026

Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models

Shravan Murlidaran, Miguel P. Eckstein

This paper introduces the Complex Social Behavior (CSB) dataset and evaluates the progress of scene description accuracy in vision language models (VLMs) from 2017 to 2025. The authors find that MLLMs…

View →

cs.CVcs.AIRecentMay 30, 2026

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

Arun Sharma

GeoSAM-3D proposes a novel framework for open-vocabulary 3D scene segmentation from simple monocular video by propagating object prompts using a geodesic distance kernel on a reconstructed Gaussian sc…

View →

cs.CVcs.AIRecentMay 28, 2026

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more

This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…

View →