Francesco Belardinelli
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significantly across different LLMs and domains.
The paper introduces a novel shielding framework for Robust MDPs (RMDPs) that guarantees safety under worst-case transition probabilities, enabling safe reinforcement learning even when transition dynamics are unknown.
Papers
Robust Shielding for Safe Reinforcement Learning
The paper introduces a novel shielding framework for Robust MDPs (RMDPs) that guarantees safety under worst-case transition probabilities, enabling safe reinforcement learning even when transition dyn…