Kai Wang
11 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces ReCAP, a native GUI agent that significantly improves CAPTCHA solving success (from 30% to 80%) by integrating specialized CAPTCHA capabilities into a general-purpose, end-to-end vision-language model.
The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates against major LLMs.
This paper demonstrates a novel attack against the shuffling defense used in secure Transformer inference, showing that randomly permuted activations can still be exploited to recover model weights.
The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rather than just malicious user prompts.
The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while maintaining clean generation quality.
The paper introduces CCLab, an adversarial testing framework, to systematically evaluate the robustness of both learning-based and traditional congestion controllers, finding that learning-based controllers are generally more robust and can be further improved using adversarial traces.
The paper proposes a dual-interventional framework to characterize how linguistic structures and contextual cues influence LLMs' spatial reasoning for navigation, finding that topological information is crucial, while semantic details can be unreliable.
The paper introduces EASE, a method that enhances multimodal Reinforcement Learning with Verifiable Rewards (RLVR) by providing spatial attention supervision anchored to visual evidence, significantly improving visual grounding and reasoning capabilities in VLMs.
The paper introduces PMC-InterCPT, a refined biomedical interleaved corpus that enhances multimodal continued pretraining by integrating figure-referencing body text alongside captions, leading to improved medical and general multimodal model performance.
HumanNOVA introduces a photorealistic, universal, and rapid model capable of generating high-quality 3D human avatars from a single input RGB image.
The paper introduces TVIR, a new benchmark and multi-agent framework for deep research, to evaluate and improve the generation of factually reliable, text-visual interleaved reports.
Papers
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang +5 more
HumanNOVA introduces a photorealistic, universal, and rapid model capable of generating high-quality 3D human avatars from a single input RGB image.