Kai Wang

11 indexed papers

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

NLP×6Crypto×6AI×5Vision×4ML×4Multiagent×1

Frequent co-authors

Chenkai Wang2×

Gang Wang2×

Zeming Wei2×

Hezhen Hu1×

Wangbo Zhao1×

Lanqing Guo1×

Research Timeline

2026

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

The paper introduces ReCAP, a native GUI agent that significantly improves CAPTCHA solving success (from 30% to 80%) by integrating specialized CAPTCHA capabilities into a general-purpose, end-to-end vision-language model.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates against major LLMs.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

This paper demonstrates a novel attack against the shuffling defense used in secure Transformer inference, showing that randomly permuted activations can still be exploited to recover model weights.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rather than just malicious user prompts.

Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while maintaining clean generation quality.

CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion Controllers

The paper introduces CCLab, an adversarial testing framework, to systematically evaluate the robustness of both learning-based and traditional congestion controllers, finding that learning-based controllers are generally more robust and can be further improved using adversarial traces.

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

The paper proposes a dual-interventional framework to characterize how linguistic structures and contextual cues influence LLMs' spatial reasoning for navigation, finding that topological information is crucial, while semantic details can be unreliable.

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

The paper introduces EASE, a method that enhances multimodal Reinforcement Learning with Verifiable Rewards (RLVR) by providing spatial attention supervision anchored to visual evidence, significantly improving visual grounding and reasoning capabilities in VLMs.

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

The paper introduces PMC-InterCPT, a refined biomedical interleaved corpus that enhances multimodal continued pretraining by integrating figure-referencing body text alongside captions, leading to improved medical and general multimodal model performance.

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

HumanNOVA introduces a photorealistic, universal, and rapid model capable of generating high-quality 3D human avatars from a single input RGB image.

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

The paper introduces TVIR, a new benchmark and multi-agent framework for deep research, to evaluate and improve the generation of factually reliable, text-visual interleaved reports.

Highlighted terms show continued research focus across papers

Papers

cs.CVRecentJun 1, 2026

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang +5 more

HumanNOVA introduces a photorealistic, universal, and rapid model capable of generating high-quality 3D human avatars from a single input RGB image.

View →

cs.CLRecentJun 1, 2026