Mitigating inner misalignment

For some context about the problem of inner misalignment, you may want to read this post from Evan Hubinger and others.

We can design a training process that leads to a low inner misalignment likelihood by attempting to explicitly integrate the base objective into the mesa optimizer. This can still lead to outer misalignment if our base objective is poorly designed. For example, using the diamond heist problem outlined in ELK, if our base objective really is determined by the presence of a diamond on the screen, then we can’t expect solving inner alignment to help us much. In some sense, we’re doomed in this scenario by our extremely limited choice of sensors which does not align well with human experience (which I will address later).

I claim that it is quite feasible to explicitly use the base objective in the mesa optimizer, and to do so in a way that produces capable systems relative to the implicit alternative. Although this question doesn’t depend on us solving outer alignment, I will assume that we have solved outer alignment so that it’s easier to judge inner alignment success, as the result will be a mesa optimizer that does what we want.

In this scenario, we have perfectly designed a terminal value model that outputs terminal values based on inputs of sensory signals of the human designer (e.g. we have a neural interface that directly interfaces with the designer’s visual, auditory signals etc.). This terminal model, T, is our base objective, and is aligned with the designer’s objective for the system. With T, we then train an RL system using T as the reward.

This might fail in ways outlined in “Risks from learned optimization”, such as by proxy alignment between the base and mesa objective. For example, if we just train a policy network, A, (e.g. with REINFORCE) A could optimize a pseudo-aligned mesa objective.

However, there is a whole class of model adjustments that we can make to reduce the likelihood of undesirable behavior on the deployment distribution, and all of these adjustments rely on explicitly leveraging the terminal value model, T, in conjunction with a world model.

In the simplest case, we introduce an instrumental value model, I, and a world model, W. I is trained to represent the expected cumulative terminal value of future states, given that A is followed, for a given state. W takes as input an environment state and an action, and predicts the next state. Instead of just sampling from A, as we did in the previous example, we sample from A a few times rooted at the current state, use W to predict the next state for each action as S_{t+1} = W(S_t, a), and then compute the total value of that action as T(S_{t+1}) + I(S_{t+1}). Then, we could choose the action out of this set of sampled actions which yields the highest total value. We could also implement failsafe mechanisms that prevent execution of the action if it yields a dramatic decrease in total value. This would help prevent gross inner misalignment of A because it would be checked by T on all distributions. There could also be opportunity for misalignment in I, but directly incorporating T into the mesa optimizer’s action generation will reduce the extent to which the mesa optimizer can diverge on the deployment distribution, because its actions will be at least partially bounded by T.

Extending this, one could implement multistep actions and inferences in a monte carlo tree search (MCTS) architecture. At inference time, one could sample from A, predict the next state with W, and then predict the total value of the next state using I and T as described in the previous example. Then, we could repeat this process at the next state, resulting in a rollout process, where random trajectories conditioned on the policy network are evaluated for their expected total valuation in part by directly calling T and these total values are propagated backward through time to the current node. This would reduce the risk of misalignment in I. For example, if I were misaligned and attempted to rank a certain state highly because it was instrumental to some different mesa objective, that ranking would be validated directly by T by rolling out to the next state and evaluating the terminal value of the base objective at the next state. Such an approach would explicitly incorporate T at arbitrarily distant states into the current action generation for all distributions, and thus would also greatly reduce chances for inner misalignment.

With the world model, we could also drastically expand the space of states over which we evaluate the model for inner misalignment, because we are no longer limited to data collected from the real world. We can randomly initialize as many states as is computationally feasible, and then evaluate the inner alignment of the mesa optimizer by having it generate actions, using W to infer the next states, and then calling T on those next states to evaluate the degree of satisfaction of the base objective.

With the world model, we could also optimize directly over state space using gradient descent to search for inner misalignment. If T, W, and A are differentiable, we could optimize argmin_S (T(W(S, A(S)))) to identify states which result in actions generated by A that yield a low performance on the base objective.

Let’s walk through a potentially problematic scenario and why this approach may shine in this scenario. Consider the possible case where the mesa optimizer produces some sequence of actions that yields a set of sensory signals that the designer rates highly, but would rate lower if he had some extra information about the world. Human preferences can only be based on human observations, and thus a preference is only relevant to the extent that it can be observed. For example, in the case we just described where the designer rates a set of observations highly but would rate them lower if he had some missing observation, such a preference would need to be rooted in some future observations that are not preferred. For example, let’s consider a slightly different version of the diamond vault scenario (from ELK) with the revised training process described herein (including the model’s observation modalities, the perfect terminal value model, and the mesa optimizer that incorporates this terminal model). In this scenario, there is still the single camera into the vault and this vault camera is what the human uses to examine the diamond. To attempt to construct the same failure mode as in ELK, we could have the mesa optimizer pin an image in front of the camera without the designer noticing, and he would rate this state highly. However, his actual dispreference with this state is only present insofar as future states which depend on the diamond being in the vault are made less likely by this state. Therefore, a more capable mesa optimizer which can anticipate further and more accurately into the future would anticipate a future state in which the designer observes that he can no longer sell the diamond because he does not possess it, and correctly determine that pinning an image to the vault camera is not part of the optimal policy. For example, using our MCTS approach described above with a long enough rollout distance would eventually find a future state which T rates very low, and thus would avoid pinning the image to the vault camera.