Writing

I’m still getting better about moving my half-formed writing from my personal notes to something somewhat presentable to the public. Here’s what I have so far.

My possible aims for AI alignment research
Some of my high-level and quite tentative aims for AI alignment research over the next few years.
Mitigating inner misalignment
What happens when gradient descent (an optimizer) produces a model that is itself an optimizer (i.e. a mesa-optimizer)? How can we ensure alignment between the base optimizer objective and the mesa-optimizer objective?
Proving the absence of deceptive misalignment
A modeling approach and a proof for why it avoids deceptive misalignment in a very limited context.
Absolute morality and AI freedom
Are us humans satisfied with another agent successfully achieving its goals (assuming they don't interfere with us) if those goals are not human?
Considerations for building AGI - from a newbie to the field
This was written in 2023 when I had just started gaining exposure to ideas around AGI and alignment. I wrote this as part of an assignment for an engineering ethics class.