Google DeepMind Research Scientist Laurent Orseau and MIRI Research Associate Stuart Armstrong have written a new paper on error-tolerant agent designs, “Safely Interruptible Agents.” The paper is forthcoming at the 32nd Conference on Uncertainty in Artificial Intelligence.
Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.
This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the off-policy learning property to prove that either some agents are already safely interruptible, like Q-learning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.
Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.
In the case of superintelligent AI systems, corrigibility is primarily aimed at averting unsafe convergent instrumental policies (e.g., the policy of defending its current goal system from future modifications) when such systems have incorrect terminal goals. This leaves us more room for approximate, trial-and-error, and learning-based solutions to AI value specification.
Interruptibility is an attempt to formalize one piece of the intuitive idea of corrigibility. Utility indifference (in Soares, Fallenstein, Yudkowsky, and Armstrong’s “Corrigibility”) is an example of a past attempt to define a different piece of corrigibility: systems that are indifferent to programmers’ interventions to modify their terminal goals, and will therefore avoid trying to force their programmers either to make such modifications or to avoid such modifications. “Safely Interruptible Agents” instead attempts to define systems that are indifferent to programmers’ interventions to modify their policies, and will not try to stop programmers from intervening on their everyday activities (nor try to force them to intervene).
Here the goal is to make the agent’s policy converge to whichever policy is optimal if the agent believed there would be no future interruptions. Even if the agent has experienced interruptions in the past, it should act just as though it will never experience any further interruptions. Orseau and Armstrong show that several classes of agent are safely interruptible, or can be easily made safely interruptible.
Sign up to get updates on new MIRI technical results
Get notified every time a new technical paper is published.