Seongyong Ju
4 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.
The paper introduces Temporal Logit Observability (TLO), a training-free diagnostic that analyzes the decoding process to reveal the temporal patterns of LLM safety failures, showing that failure mechanisms are often distinct even when the final Attack Success Rate is the same.
The paper introduces Persona Attack, a novel memory injection jailbreak method that demonstrates how accumulating instructions in the model's context window can override internal safety alignments, achieving high success rates.
The paper introduces Persona Attack, a novel memory injection jailbreak method that demonstrates that accumulating instructions in the model's context window can override internal safety alignments, achieving high attack success rates.
Papers
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
The paper introduces Persona Attack, a novel memory injection jailbreak method that demonstrates how accumulating instructions in the model's context window can override internal safety alignments, ac…