Bole Ma
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU fabrics.
Leyline introduces a novel serving-side primitive that allows agentic LLMs to perform targeted, efficient edits to the KV cache, avoiding costly full re-prefilling after content modification.
Papers
Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics
The paper proposes moving the query instead of the KV-cache during cross-instance attention, demonstrating that this approach is significantly cheaper than moving the cache, especially on modern GPU f…