I’m still getting better about moving my half-formed writing from my personal notes to something somewhat presentable to the public. Here’s what I have so far.
-
Mitigating inner misalignment
What happens when gradient descent (an optimizer) produces a model that is itself an optimizer (i.e. a mesa-optimizer)? How can we ensure alignment between the base optimizer objective and the mesa-optimizer objective?
-
Proving the absence of deceptive misalignment
An answer I gave to an interview question about designing a model that avoids deceptive alignment.
-
Absolute morality and AI freedom
Are us humans satisfied with another agent successfully achieving its goals (assuming they don't interfere with us) if those goals are not human?
-
Considerations for building AGI - from a newbie to the field
This was written in 2023 when I had just started gaining exposure to ideas around AGI and alignment. I wrote this as part of an assignment for an engineering ethics class.