Are you sure? Because at the end of page 3 in that paper it is stated:
The benchmark is NOT about figuring out what the problem is but ability to learn to solve multiple problems.
And it makes sense if you consider one task is “open the window” and the another - in the same set - is “close the window”.
If you claim “the agent should infer a policy from the state of the window” then it will end up opening the window when it is closed and close it when it is open.
Try to apply that policy in real application and see what happens: you get stuck in an window open/close loop forever. What would be the purpose of that?
Inferring local goals within various contexts might be an important step towards general AI, but that is NOT the aim of the Meta-World benchmark as you claim.