Computer Vision
Object detection, segmentation, recognition, video
20 papers indexed
Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation
This study systematically evaluates Vision Mamba models for detecting AI-generated images, finding that while they show promise, their current strengths and limitations must be understood relative to…
Amplified Patch-Level Differential Privacy for Free via Random Cropping
The paper shows that using random cropping, a standard data augmentation technique, can naturally amplify differential privacy guarantees for machine learning models without requiring any changes to t…
Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research
Places in the Wild introduces a massive, high-resolution RAW photograph dataset of 67,574 images captured in situ across 810 locations, providing unprecedented detail for ecologically valid vision res…
Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study
This study empirically benchmarks classical and quantum machine learning models for image recognition, finding that while quantum models offer superior accuracy and resource efficiency at high dimensi…
xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR
The paper proposes xModel-KD, a cross-modal knowledge distillation framework, to improve 3D point cloud segmentation by effectively transferring rich appearance cues from 2D images to sparse 3D geomet…
PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding
Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more
The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.
An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers
This paper presents an open-source computer vision pipeline for classifying vehicle body types from naturalistic roadway video.
An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao +2 more
The paper introduces Multi-temporal Referring Segmentation (MTRS), a new task requiring models to segment language-described temporal changes, and proposes MTRefSeg-R1, a specialized framework that ac…
Edge Prediction for Roof Wireframe Reconstruction with Transformers
The paper proposes a Transformer-based end-to-end architecture to reconstruct 3D house roof wireframes from sparse point clouds and semantic data, achieving state-of-the-art results on the S23DR Chall…
Towards Context-Aware Image Anonymization with Multi-Agent Reasoning
The paper introduces CAIAMAR, a multi-agent reasoning framework that achieves context-aware and high-fidelity anonymization of personally identifiable information (PII) in street imagery, significantl…
GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video
GeoSAM-3D proposes a novel framework for open-vocabulary 3D scene segmentation from simple monocular video by propagating object prompts using a geodesic distance kernel on a reconstructed Gaussian sc…
Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes
The paper reframes industrial visual sim-to-real transfer as a domain-gap problem categorized by the availability of explicit object geometry (CAD), arguing that the required prior evidence dictates t…
VLM3: Vision Language Models Are Native 3D Learners
Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more
The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…
PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation
PixVOD proposes a fully parallelizable, pixel-distributed framework for visual odometry and depth estimation that performs computations directly on the sensor using Gaussian Belief Propagation.
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…
Privacy-Preserving Semantic Segmentation without Key Management
The paper introduces a novel privacy-preserving semantic segmentation method that enables model training and inference using independently encrypted images for each client and image.
CamGeo: Sparse Camera-Conditioned Image-to-Video Generation with 3D Geometry Priors
Xuanyi Liu, Deyi Ji, Liqun Liu, Lanyun Zhu +7 more
CamGeo is a novel framework that improves sparse camera-conditioned image-to-video generation by distilling rich 3D geometric priors into the diffusion backbone, resulting in geometrically consistent…
GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes
GeM-NR proposes a novel, training-free framework to achieve general multi-view image editing, enabling consistent edits that drastically change both the geometry and appearance of a nonrigid scene.
Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video
The paper developed and validated Quantitative Movement Testing (QMT), a computer vision pipeline that accurately extracts 3D kinematic biomarkers from standard smartphone videos, providing an objecti…
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma +2 more
The paper proposes GASP, a framework that injects fundamental geometric priors directly into Vision-Language Models (VLMs) using ground-truth video geometry, significantly enhancing 3D spatial reasoni…