New paper: “Alignment for advanced machine learning systems”


MIRI’s research to date has focused on the problems that we laid out in our late 2014 research agenda, and in particular on formalizing optimal reasoning for bounded, reflective decision-theoretic agents embedded in their environment. Our research team has since grown considerably, and we have made substantial progress on this agenda, including a major breakthrough in logical uncertainty that we will be announcing in the coming weeks.

Today we are announcing a new research agenda, “Alignment for advanced machine learning systems.” Going forward, about half of our time will be spent on this new agenda, while the other half is spent on our previous agenda. The abstract reads:

We survey eight research areas organized around one question: As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators? We focus on two major technical obstacles to AI alignment: the challenge of specifying the right kind of objective functions, and the challenge of designing AI systems that avoid unintended consequences and undesirable behavior even in cases where the objective function does not line up perfectly with the intentions of the designers.

Open problems surveyed in this research proposal include: How can we train reinforcement learners to take actions that are more amenable to meaningful assessment by intelligent overseers? What kinds of objective functions incentivize a system to “not have an overly large impact” or “not have many side effects”? We discuss these questions, related work, and potential directions for future research, with the goal of highlighting relevant research topics in machine learning that appear tractable today.

Co-authored by Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch, our new report discusses eight new lines of research (previously summarized here). Below, I’ll explain the rationale behind these problems, as well as how they tie in to our old research agenda and to the new “Concrete problems in AI safety” agenda spearheaded by Dario Amodei and Chris Olah of Google Brain.

Increasing safety by reducing autonomy

The first three research areas focus on issues related to act-based agents, notional systems that base their behavior on their users’ short-term instrumental preferences:

1. Inductive ambiguity identification: How can we train ML systems to detect and notify us of cases where the classification of test data is highly under-determined from the training data?

2. Robust human imitation: How can we design and train ML systems to effectively imitate humans who are engaged in complex and difficult tasks?

3. Informed oversight: How can we train a reinforcement learning system to take actions that aid an intelligent overseer, such as a human, in accurately assessing the system’s performance?

These three problems touch on different ways we can make tradeoffs between capability/autonomy and safety. At one extreme, a fully autonomous, superhumanly capable system would make it uniquely difficult to establish any strong safety guarantees. We could reduce risk somewhat by building systems that are still reasonably smart and autonomous, but will pause to consult operators in cases where their actions are especially high-risk. Ambiguity identification is one approach to fleshing out which scenarios are “high-risk”: ones where a system’s experiences to date are uninformative about some fact or human value it’s trying to learn.

At the opposite extreme, we can consider ML systems that are no smarter than their users, and take no actions other than what their users would do, or what their users would tell them to do. If we can correctly design a system to do what it thinks a trusted, informed human would do, we can trade away some of the potential benefits of advanced ML systems in exchange for milder failure modes.

These two extremes, human imitation and (mostly) autonomous goal pursuit, are useful objects of study because they help simplify and factorize out key parts of the problem. In practice, however, ambiguity identification is probably too mild a restriction on its own, and strict human imitation probably isn’t efficiently implementable. Informed oversight considers more moderate approaches to keeping humans in the loop: designing more transparent ML systems that help operators understand the reasons behind selected actions.

Increasing safety without reducing autonomy

Whatever guarantees we buy by looping humans into AI systems’ decisions, we will also want to improve systems’ reliability in cases where oversight is unfeasible. Our other five problems focus on improving the reliability and error-tolerance of systems autonomously pursuing real-world goals, beginning with the problem of specifying such goals in a robust and reliable way:

4. Generalizable environmental goals: How can we create systems that robustly pursue goals defined in terms of the state of the environment, rather than defined directly in terms of their sensory data?

5. Conservative concepts: How can a classifier be trained to develop useful concepts that exclude highly atypical examples and edge cases?

6. Impact measures: What sorts of regularizers incentivize a system to pursue its goals with minimal side effects?

7. Mild optimization: How can we design systems that pursue their goals “without trying too hard”—stopping when the goal has been pretty well achieved, as opposed to expending further resources searching for ways to achieve the absolute optimum expected score?

8. Averting instrumental incentives: How can we design and train systems such that they robustly lack default incentives to manipulate and deceive their operators, compete for scarce resources, etc.?

Whereas ambiguity-identifying learners are designed to predict potential ways they might run into edge cases and defer to human operators in those cases, conservative learners are designed to err in a safe direction in edge cases. If a cooking robot notices the fridge is understocked, should it try to cook the cat? The ambiguity identification approach says to notice when the answer to “Are cats food?” is unclear, and pause to consult a human operator; the conservative concepts approach says to just assume cats aren’t food in uncertain cases, since it’s safer for cooking robots to underestimate how many things are food than to overestimate it. It remains unclear, however, how one might formalize this kind of reasoning.

Impact measures provide another avenues for limiting the potential scope of AI mishaps. If we can define some measure of “impact,” we could design systems that can distinguish intuitively high-impact actions from low-impact ones and generally choose lower-impact options.

Alternatively, instead of designing systems to try as hard as possible to have a low impact, we might design “mild” systems that simply don’t try very hard to do anything. Limiting the resources a system will put into its decision (via mild optimization) is distinct from limiting how much change a system will decide to cause (via impact measures); both are under-explored risk reduction approaches.

Lastly, we will explore a variety of different approaches to preventing default system incentives to treat operators adversarially under the “averting instrumental incentives” umbrella category. Our hope in pursuing all of these research directions simultaneously is that systems combining these features will permit much higher confidence than systems implementing any one of them. This approach also serves as a hedge in case some of these problems turn out to be unsolvable in practice, and allows for ideas that worked well on one problem to be re-applied on others.

Connections to other research agendas

Our new technical agenda, our 2014 agenda, and “Concrete problems in AI safety” take different approaches to the problem of aligning AI systems with human interests, though there is a fair bit of overlap between the research directions they propose.

We’ve changed the name of our 2014 agenda to “Agent foundations for aligning machine intelligence with human interests” (from “Aligning superintelligence with human interests”) to help highlight the ways it is and isn’t similar to our newer agenda. For reasons discussed in our advance announcement of “Alignment for advanced machine learning systems,” our new agenda is intended to help more in scenarios where advanced AI is relatively near and relatively directly descended from contemporary ML techniques, while our agent foundations agenda is more agnostic about when and how advanced AI will be developed.

As we recently wrote, we believe that developing a basic formal theory of highly reliable reasoning and decision-making “could make it possible to get very strong guarantees about the behavior of advanced AI systems — stronger than many currently think is possible, in a time when the most successful machine learning techniques are often poorly understood.” Without such a theory, AI alignment will be a much more difficult task.

The authors of “Concrete problems in AI safety” write that their own focus “is on the empirical study of practical safety problems in modern machine learning systems, which we believe is likely to be robustly useful across a broad variety of potential risks, both short- and long-term.” Their paper discusses a number of the same problems as the alignment for ML agenda (or closely related ones), but directed more toward building on existing work and finding applications in present-day systems.

Where the agent foundations agenda can be said to follow the principle “start with the least well-understood long-term AI safety problems, since those seem likely to require the most work and are the likeliest to seriously alter our understanding of the overall problem space,” the concrete problems agenda follows the principle “start with the long-term AI safety problems that are most applicable to systems today, since those problems are the easiest to connect to existing work by the AI research community.”

Taylor et al.’s new agenda is less focused on present-day and near-future systems than “Concrete problems in AI safety,” but is more ML-oriented than the agent foundations agenda. This chart helps map some of the correspondences between the topics the agent foundations agenda (plain text), the concrete problems agenda (italics), and the alignment for ML agenda (bold) discuss:

Work related to high reliability

  • realistic world-models ~ generalizable environmental goals ~ avoiding reward hacking

    • naturalized induction
    • ontology identification
  • decision theory
  • logical uncertainty
  • Vingean reflection

Work related to error tolerance

  • inductive ambiguity identification = ambiguity identification ~ robustness to distributional change
  • robust human imitation
  • informed oversight ~ scalable oversight
  • conservative concepts
  • impact measures = domesticity ~ avoiding negative side effects
  • mild optimization
  • averting instrumental incentives
  • safe exploration

“~” notes (sometimes very rough) similarities and correspondences, while “=” notes different names for the same concept.

As an example, “realistic world-models” and “generalizable environmental goals” are both aimed at making the environment and goal representations of reinforcement learning formalisms like AIXI more robust, and both can be viewed as particular strategies for avoiding reward hacking. Our work under the agent foundations agenda has mainly focused on formal models of AI systems in settings without clear agent/environment boundaries (naturalized induction), while our work under the new agenda will focus more on the construction of world-models that admit of the specification of goals that are environmental rather than simply perceptual (ontology identification).

For a fuller discussion of the relationship between these research topics, see Taylor et al.’s paper</>.


 

Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.