Alignment priors wash out under RL

Reading moral books doesn’t matter much if the later training regime teaches different habits.

From Tomek Korbak and team at OpenAI:

Can midtraining on docs about aligned AI bake in alignment priors for agents? We report an experiment where those priors are quickly washed away by RL and fail to generalize to agentic settings. But that cuts both ways: priors that AIs are misaligned fade too!

Tuesday, 31 March 2026