Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard | ArxivCSExplorer