No, we don't do anything. Theoretically we could judge several times with different ordering.
We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!
I was looking at: https://arxiv.org/abs/2506.18254 but your approach is even more general.
EDIT: by shared I only mean the adjacency to LLMs/AI/ML, RL is a pretty big differentiator though and project looks great
We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!