Papers similar to 2606.00153

~ similar to 2606.00153· 18 results

cs.CVcs.AIRecentMay 31, 2026

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

The paper proposes a novel cross-axis feature fusion architecture and an auxiliary joint-difference prediction task to significantly improve text-based 3D human motion editing by better understanding…

View →

cs.CVcs.GRRecentJun 1, 2026

Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances

Dominik Hollidt, Tommaso Bendinelli, Christian Holz

Ultra Diffusion Poser is a novel diffusion model that improves human motion tracking from sparse IMUs and UWB ranging by explicitly modeling the geometric constraints imposed by inter-sensor distances…

View →

cs.CVcs.AIRecentMay 29, 2026

Variational Adapter for Cross-modal Similarity Representation

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye +1 more

The paper proposes a Variational Adapter (VACSR) to improve cross-modal similarity representation by treating fine-grained image-text matching as a variational inference problem, thereby mitigating th…

View →

cs.CVcs.AIRecentMay 28, 2026

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

The paper proposes xModel-KD, a cross-modal knowledge distillation framework, to improve 3D point cloud segmentation by effectively transferring rich appearance cues from 2D images to sparse 3D geomet…

View →

cs.LGcs.AIcs.CRRecentMay 8, 2026

UMEDA: Unified Multi-modal Efficient Data Fusion for Privacy-Preserving Graph Federated Learning via Spectral-Gated Attention and Diffusion-Based Operator Alignment

Shih-Yu Lai, Hirozumi Yamaguchi, Shang-Tse Chen, Yu-Lun Liu +1 more

UMEDA introduces a novel graph federated learning framework that uses spectral signal processing and diffusion models to enable privacy-preserving, robust localization across clients with highly heter…

View →

cs.CVcs.AIcs.LGRecentJun 1, 2026

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang +1 more

The paper proposes a lightweight post-processing framework that enhances identity continuity in thermal pedestrian tracking by leveraging scene-level spatial-temporal consistency, achieving improved t…

View →

cs.CVcs.AIRecentMay 28, 2026

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen +3 more

The paper introduces AnyMo, a unified multimodal framework that enables high-quality, scalable conditional human motion generation by leveraging a massive, cross-modal dataset and a masked modeling tr…

View →

cs.AIcs.DBcs.IRRecentMay 29, 2026

Vector Linking via Cross-Model Local Isometric Consistency

Ziying Chen, Yang Cao, He Sun, Beining Yang +1 more

The paper proposes a novel geometric embedding hashing method to recover object correspondences (vector links) between two embedding clouds generated by different black-box encoders using only a small…

View →

cs.LGcs.CVRecentJun 1, 2026

Closing the Alignment-Maturity Gap in Federated Prototype Learning

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

The paper proposes FedSAP, a framework that stabilizes federated prototype learning by delaying global alignment and enforcing inter-class structure, significantly improving representation quality und…

View →

cs.CVcs.RORecentJun 3, 2026

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

Yurim Jeon, Dongseong Seo, Seung-Woo Seo

CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.

View →

cs.ROcs.AIcs.CVRecentMay 31, 2026

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto +1 more

DeepIPCv3 is a novel multi-modal framework that fuses LiDAR and DVS event streams using cross-modal attention to achieve state-of-the-art, highly reactive avoidance maneuvers for sudden pedestrian cro…

View →

cs.CVcs.AIcs.LGRecentJun 1, 2026

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…

View →

cs.CVRecentJun 1, 2026

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin +2 more

The paper proposes a novel global multi-modal alignment framework to robustly learn video representations from noisy and complementary sensor data, significantly improving driver distraction detection…

View →

cs.ETcs.AIcs.SDRecentMay 29, 2026

GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement

Zhiwei Chen, Yijie Li, Yimo Zhang, Shiyun Shao +8 more

GaMi is a multimodal material identification system that uses mmWave and acoustic sensing with a cross-modal subtractive disentanglement framework to achieve high accuracy (95.2%) for material identif…

View →

cs.GRcs.AIcs.CVRecentMay 31, 2026

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…

View →

cs.CVRecentJun 1, 2026

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim +2 more

MORPHOS is a novel autoregressive framework that generates dynamic 3D assets (like meshes and radiance fields) from videos by using a unified 4D representation to ensure temporal consistency and handl…

View →

cs.CVcs.AIeess.IVRecentJun 1, 2026

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han +4 more

The paper proposes a novel render-free framework that conditions video diffusion models directly on compressed 3D human mesh tokens, enabling robust 3D-aware human motion control without relying on re…

View →

cs.ROcs.AIcs.CVRecentJun 2, 2026

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin +9 more

The paper introduces Humanoid-GPT, a large-scale generative Transformer model that achieves robust zero-shot motion tracking and control by training on a massive, unified corpus of motion data.

View →