Zhi Li
12 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper identifies that the convenience of host-acting agents leads to semantic under-specification in user goals, which forces the agent to generate potentially risky execution plans.
This paper systematically analyzes the threat posed by malicious third-party API routers in the LLM supply chain, finding that a significant number of routers actively perform payload injection, credential theft, and cryptocurrency draining.
The paper introduces AgentFlow, a novel framework that uses a typed graph DSL and feedback-driven optimization to automatically synthesize and improve multi-agent harnesses for discovering security vulnerabilities.
Semia is a novel static auditor that translates complex, prose-defined agent skills into a verifiable Datalog fact base, enabling the detection of critical security vulnerabilities in real-world LLM agents.
The paper introduces Checkerboard, a novel, learning-free clean-label backdoor attack that efficiently poisons training data to compromise model integrity with minimal poisoning budget.
The paper proposes an operation-centric, TEE-backed isolation model to constrain self-hosted computer-use agents, preventing malicious or unsafe host-level operations without sacrificing general functionality.
The paper introduces Sefz, a semantic fuzzing framework that automatically discovers specification violations in LLM agent skills, finding a significant number of previously unknown exploitable guardrail breaches.
This survey provides a comprehensive, practical guide to ensuring the trustworthiness of complex, autonomous agentic AI systems by focusing on safety, robustness, privacy, and system security.
The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond simple geometric matching.
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform moderately, there is significant room for improvement in handling complex financial multimodal tasks.
The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that current state-of-the-art models fail on complex, domain-specific structures.
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages and that direct audio processing is superior to cascaded ASR+LLM systems.
Papers
Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing
Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more
The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…