20 results for “Knowledge of computer vision and machine learning”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper addresses the vulnerability of DNNs used in robotic semantic segmentation to adversarial attacks by proposing specialized detection strategies to enhance safety in robotic perception system…
This book provides a compact, derivation-oriented mathematical primer that connects major families of generative AI models, showing their underlying structural relationships.
Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more
The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…
Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more
The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…
Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang +2 more
This paper formalizes the binding problem using information theory and develops a probing method to measure binding information in deep learning representations, demonstrating that binding is crucial…
This paper presents an open-source computer vision pipeline for classifying vehicle body types from naturalistic roadway video.
Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou +3 more
The paper introduces ToolFG, a novel tool-integrated MLLM framework that enhances fine-grained image classification by enabling models to autonomously use external tools to gather verifiable visual cu…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer +1 more
The paper introduces PRAXIS, a novel algorithm that efficiently approximates the computation of 'Rashomon sets' for decision trees, significantly reducing memory and runtime complexity.
This study systematically evaluates Vision Mamba models for detecting AI-generated images, finding that while they show promise, their current strengths and limitations must be understood relative to…
Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao +5 more
The paper introduces BilliardPhys-Bench, a new benchmark that demonstrates that current multimodal LLMs struggle with complex physical reasoning and predicting object dynamics in simulated environment…
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…
The paper establishes a theoretical information-theoretic bound proving that for Vision-Language-Action (VLA) models, capability and robustness cannot both be arbitrarily high, quantifying the trade-o…
Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar +2 more
The paper introduces RoboTrustBench, a comprehensive benchmark that evaluates the trustworthiness of video world models for robotic manipulation across challenging scenarios, finding that current mode…
This paper systematically analyzes 48 studies on perception attacks against autonomous vehicles, revealing that the increasing reliance on multi-sensor fusion creates new, complex vulnerabilities that…
The paper argues that the standard FID metric is unreliable because its performance depends significantly on the geometric structure and density of the reference dataset, not just the sample quality.
Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton +2 more
This paper introduces a new evaluation framework, SpatialUncertain, demonstrating that current Vision-Language Models (VLMs) are prone to overconfident and incorrect answers to spatial questions when…
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
The paper evaluates the adversarial robustness of two open-source Vision-Language Models (LLaVA and Qwen2.5-VL) in a simulated e-commerce environment, finding that while LLaVA is vulnerable to gradien…
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more
The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…