The paper introduces HLL, a benchmark that tests if multimodal agents can successfully substitute for human verification (like CAPTCHA) in complex, real-world workflows, finding that current agents are still brittle and fail under realistic conditions.
Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL
Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification
The paper introduces Opt-Verifier, a novel LLM-based framework that significantl…
Self-Trained Verification for Training- and Test-Time Self-Improvement
The paper proposes Self-Trained Verification (STV), a novel method that trains v…
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration
The paper introduces OmniVerifier-M1, a multimodal meta-verifier that uses symbo…
Federated Formal Verification: Cross-Backend Citation, Cross-Axis Convergence, and AI-Orchestrated P…
The paper proposes a federated formal verification architecture that treats veri…
FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search
FineVerify introduces a fine-grained self-verification framework that improves a…
EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
The paper introduces EgoBench, the first interactive multimodal benchmark design…
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verifica…
The paper introduces FinVerBench, a comprehensive benchmark for financial statem…
Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding
The paper introduces Hybrid Verified Decoding, a method that predicts the accept…