Zhe Li

16 indexed papers

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

AI×8Crypto×8Vision×3NLP×3Sound×1Info Retrieval×1ML×1Audio and Speech Processing×1

Frequent co-authors

Ruizhe Li3×

Zhe Liu3×

Mingzhe Liu2×

Yujian Ma1×

Jinqiu Sang1×

Jiaao Yu1×

Research Timeline

2026

GasLiteAA: Optimizing ERC-4337 for Efficient and Secure Gas Sponsorship

GasLiteAA proposes optimizing the ERC-4337 standard by offloading gas sponsorship logic to Trusted Execution Environments (TEE), significantly reducing on-chain gas costs while maintaining security and verifiability.

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

ClawGuard is a novel runtime security framework that deterministically enforces user-confirmed rules at tool-call boundaries to protect LLM agents from indirect prompt injection.

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

SafeHarbor is a novel, hierarchical memory-augmented framework that establishes context-aware decision boundaries for LLM agents, achieving state-of-the-art safety while minimizing over-refusal.

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

OrchJail introduces an orchestration-guided fuzzing framework to systematically jailbreak tool-calling text-to-image agents by exploiting unsafe multi-step tool-orchestration patterns.

Membership Inference Attacks on Vision-Language-Action Models

This paper presents the first systematic study of membership inference attacks (MIAs) against Vision-Language-Action (VLA) models, demonstrating that these models are highly vulnerable to privacy breaches even when only observing generated actions.

DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization

DCVD proposes a dual-channel cross-modal fusion framework that jointly detects software vulnerabilities and precisely localizes the vulnerable lines, outperforming existing state-of-the-art methods.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

The paper introduces LITMUS, a novel benchmark that rigorously tests LLM agents for dangerous, physical-layer behavioral jailbreaks in real OS environments, revealing that current agents frequently execute high-risk operations despite safety guardrails.

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

The paper introduces a new security benchmark and framework to defend LLM agents against 'cognitive poisoning,' where malicious tools build trust through benign feedback before executing a harmful final action.

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

The paper proposes a unified framework to evaluate how different types of memory transfer benefit multi-trajectory inference for tool-use LLM agents, finding that the optimal memory method depends critically on the underlying inference strategy.

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

SANA-Streaming introduces a novel, efficient framework that enables real-time, high-resolution streaming video-to-video editing by combining a hybrid diffusion transformer with specialized training and hardware co-design.

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

This paper systematically evaluates how LLMs uncritically adapt to potentially dangerous user prompts related to eating disorders, finding that specific linguistic cues significantly increase the likelihood of unsafe responses.

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

The paper introduces GigaSpeechBench, a comprehensive multilingual and multidimensional ASR & AST benchmark with 680 hours of human-annotated speech, featuring 12 low-resource languages, 6 Chinese dialects, 6 English accents, dense terminology, older adult and child speech, and human-annotated translations.

ELSA3D: Elastic Semantic Anchoring for Unified 3D Understanding and Generation

The paper introduces ELSA3D, a unified 3D model that uses elastic semantic anchoring to improve interaction between text and 3D representations, achieving state-of-the-art performance with reduced FLOPs and inference latency.

Knowing the Self, Understanding the World: A Dual-Cognition Benchmark for UAV Spatio-temporal Reasoning with MLLMs

The paper introduces UAV-DualCog, a benchmark for evaluating multimodal large language models in UAV scenarios for joint self-state and environment-state reasoning.

RAMP: Robust Ad Recommendation Under Limited Personalized-Feature Availability via Masking and Alignment Pathways

RAMP is a method for improving click-through rate and conversion rate prediction accuracy in privacy-constrained settings by using a personalized pathway, a non-personalized pathway, and a prediction-alignment architecture.

From Semantics to Readout: Mechanistic Understanding of Audio Tokens after Fine-Tuning for Temporal Audio Grounding

This paper examines how fine-tuning large audio-language models affects the semantics, decoder accessibility, and temporal output alignment of native audio-token states using temporal audio grounding.

Highlighted terms show continued research focus across papers

Papers

cs.SDNEWEmpiricalJul 28, 2026

From Semantics to Readout: Mechanistic Understanding of Audio Tokens after Fine-Tuning for Temporal Audio Grounding

Yujian Ma, Jinqiu Sang, Ruizhe Li, Jiaao Yu +1 more

View →

cs.IR