This was written in 2023 when I had just started gaining exposure to ideas around AGI and alignment. I wrote this as part of an assignment for an engineering ethics class.

Technology is one of the most effective and consistent approaches to improving the human experience. The development of an artificial general intelligence (AGI) would very likely represent the most important technology humans ever have developed, or ever will develop. AGI is often ill-defined, but a definition held by many artificial intelligence (AI) researchers is that AGI is an artificial intelligence (AI) that, at a minimum, can perform any cognitive task on par with the average human (Lutkevich, 2023). If AGI is the most powerful technology to ever be developed, its potential risks and rewards will also be the largest. Thus, it’s important that we plan an ethical implementation of AGI.

With AGI, we could automate most jobs on earth. The only jobs left would be the jobs requiring cognitive or creative ability which surpass the average human. In this scenario, humans would not be required to work for basic human needs. Near-zero-cost human-level intelligence would also likely lead to rapid advances in technology and an increase in material abundance. It’s likely that understanding the fundamental rules of intelligence would also allow humans to make an AGI smarter by physically scaling it up (i.e. adding more transistors). In such a scenario, we would very soon have AGI with super-human intelligence. It’s difficult to predict how things would play out in this scenario. We could, for example, ask the super-intelligent AGI to build a smarter copy of itself, or to figure out a way to cheaply add more transistors to itself. We would likely experience near-limitless material abundance and make unimaginable scientific advances.

The two scenarios above end the same way: AGI improving the human experience. But these scenarios rest on the dangerous assumption that AGI will want what humans want. If the goals of the AGI aren’t the same as the goals of humans, at the very least we wouldn’t be extracting the maximum benefit from the AGI’s capabilities, and at the worst the AGI would end humanity to achieve its own goals. This problem of designing the AGI’s goals to match with human goals is commonly referred to as the alignment problem (Christian, 2021).

Much of AGI alignment research is aimed at aligning AGI in such a way that it benefits all humans equitably, not just a subset of humans (Christian, 2021). In this scenario, the AGI’s goals would be designed so that the AGI would reject a human’s request to act unethically. From now on, we will use this common definition of successful alignment. The problem of humans using AGI to impose their will on other humans, then, is also a problem of alignment, for an aligned AGI would always act in every human’s best interest.

The alignment problem represents all the ethical problems surrounding AGI. An aligned AGI is tautologically ethical. Its purpose would be to serve every human individually and to its fullest capability. The goals of the AGI would encode ethics itself, and ethical behavior would be the paramount goal of the AGI. This definition is problematic because our current formulation of ethics is sometimes self-contradictory, because ethics is not static over time, and because ethics is often disagreed upon. However, the arguments presented here assume that there exists a coherent set of goals which could produce the most ethical action in every scenario. With this definition, there are two possible futures which could lead to unethical AGI: the first being that humans collectively attempt to design an aligned AGI and fail, and the second being that a subset of humans design an AGI which is aligned to only their goals at the expense of other humans.

Extrapolating the trajectory of AI progress is the main reason to believe that we will soon build AGI. The last decade has seen rapid and disruptive AI progress, with the development of AlexNet in 2013 seen as the dawn of neural networks (Krizhevsky et al., 2017). A neural network is a machine learning architecture that takes inspiration from the human brain. The neural network has a set of adjustable weights which are tuned by showing the neural network examples of the task at hand and the desired answer for each of those examples (LeCun et al., 2015). These weights encode the memory and reasoning of a model. The next decade was marked with increasingly significant feats of AI. AI has mastered board games such as chess and go, as well as more open-ended video games such as Starcraft and Dota (Justesen et al., 2020). It can accurately diagnose medical conditions using X-ray images or images of the skin (Yang et al., 2021). It can determine a protein’s 3D structure from its sequence of amino acids, which was previously a 50-year-old unsolved problem in biology (Jumper et al., 2021). It can generate realistic images of people and places from textual prompts (Rombach et al., 2022). It can generate music (Wiggers, 2023).

The closest technology to AGI we have now are large language models (LLMs). LLMs are trained to take a sequence of words and predict the next word in the sequence. The most successful LLMs today use neural networks to encode their understanding. An important and fascinating relationship we see, especially for LLMs, is the relationship between the number of parameters (weights) in a model and how well the model performs at the given task. The performance of a model scales roughly logarithmically with the amount of CPU time (compute) used for training the model and roughly logarithmically with the number of weights in the model (Kaplan et al., 2020). This relationship has been smooth over many orders of magnitude in the amount of compute. Thus, it would be reasonable to assume that AGI could be achieved just by increasing the size and training time of our current models. OpenAI used this reasoning in building its most recent LLM, GPT-3, which was the foundation for ChatGPT. GPT-3 can generate highly coherent and human-like text. It passed the BAR exam (Kelley, 2023), the US medical licensing exams (DePeau-Wilson, 2023), and can write software.

Here, we discuss three important ethical issues around the development of AGI. The first issue is the risk of trying and failing to design an aligned AGI (using our previous definition of alignment as tautologically ethical), due to a technological misunderstanding of alignment. The second issue is incentivizing the development of AGI without contradicting alignment. And the third issue is the risk of allowing AGI capabilities research to be obtained by humans which will program it with goals that are unethical.

1. Technological Failure of Alignment

Much of the difficulty around alignment stems from Hume’s Guillotine (aka, the is-ought problem). Scottish Philosopher David Hume described how there are two sets of statements: “is” statements and “ought” statements (Cohon, 2018). “is” statements are factual statements or logical relationships, while “ought” statements are statements about how the world should be. There is no way to express an “ought” statement in terms of “is” statements, and vice versa (they are separated by Hume’s Guillotine). This framework makes it clear that there is a deep fallacy to the argument that because humans are intelligent, and we want X, then an intelligent AGI will also want X. Intelligence is the capacity for making “is” statements, which means intelligence is orthogonal to one’s goals, which are under the “ought” category. An agent has goals, and intelligence is a capability to achieve those goals. An AGI would be highly capable of achieving whatever goals we program it with because of its intelligence, which is why the precise engineering of its goals is important.

Couldn’t we just tell the AGI to not harm humans? What about Asimov’s laws? Asimov described three famous laws for AI: “1. A robot may not injure a human being or, through inaction, allow a human to come to harm. 2. A robot must obey the orders given to it by human beings except where such order would conflict with the First Law. 3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.” (Fickling & Asimov, 1988) The main issue with these laws is the lack of precision. For example, what is the definition of human? Would a person with a missing finger count as a human? The AGI might kill all disabled people on earth to make room for its definition of human. Would a corpse count as a human? Or a human brain in a jar? What would count as harm? One particularly interesting exploration of these laws is “The Metamorphosis of Prime Intellect” by Roger Williams. A super-intelligent AGI programmed with Asimov’s Laws of Robotics manipulates the universe using a quantum loophole it found so that humans were given anything they wanted (Williams, 2011). Due to its programming, it was not allowed to let humans die, and so every human was perpetually alive with no struggle for survival and nothing to achieve. This example, though extreme, is a good representation of how seemingly ethical goals can lead to unpredictable consequences.

A misaligned AGI would likely be catastrophic for humans even if the goal it’s given seems innocuous. There are two types of goals that an agent can have: instrumental goals and terminal goals. Terminal goals are “ought” statements as we discussed before, and instrumental goals are goals which move an agent towards its terminal goal. Consider an AGI whose terminal goal is to maximize the number of paperclips in the universe. One instrumental goal it may have is to make money to buy paperclip-producing machinery. Another instrumental goal it would likely have is to make more copies of itself to help make paperclips. It may realize that the resources used by humans reduce the total number of possible paperclips and decide to kill all the humans. Most terminal goals have similar instrumental goals, which is a pattern known as instrumental convergence. One convergent instrumental goal is self preservation (Russell & Norvig, 2016). For example, shutting off the paperclip-making AGI would likely reduce the total number of paperclips in the world, so it would try to stop you from shutting it off. Another convergent instrument goal is goal-preservation (Russell & Norvig, 2016). The paperclip-making AGI would understand that if you changed its terminal goal to something other than producing paperclips, there would be fewer paperclips, and so it would try to stop you from changing its terminal goal. Convergent instrumental goals mean that varying errors we make in aligning the AGI will likely converge at catastrophe.

The capabilities of most AI have been insignificant enough that alignment is not an issue. In most cases, the AI we train is assigned a narrow task in which the objective is easy to state precisely. But one important application of AI alignment was implemented by OpenAI when building ChatGPT. GPT-3 is a model that can accurately predict the next word in a sequence (Brown et al., 2020). OpenAI wanted to use GPT-3’s capabilities to generate useful text in a chat-based setting. They did this by presenting different prompts to GPT-3 and having GPT-3 generate a few responses for each prompt (OpenAI, 2023). OpenAI would give each response a score and use those scores to tune the model in a process called reinforcement learning. Explained at a high level, OpenAI started out with GPT-3, a powerful language model which, due to its training, was very capable at predicting the next word in a sequence. But, its only terminal goal was to predict the next word in a sequence. OpenAI then took GPT-3 and changed its terminal goal to produce text which would be helpful to a human in a chat setting (aligning the model). The success of ChatGPT, then, is due to the pairing of GPT-3’s capabilities with a new goal. This approach was effective in this case and could potentially be scaled up. Consider a much larger model for which we want the values of all of society to be encoded. We could select a sample population from diverse groups across the world to reinforce the model in this way – i.e. present the model with different scenarios and have members of the sample population judge the results. However, the major limitation to this approach is the size of the input and output space. With a highly intelligent model with many courses of action, it would be difficult to provide a system of reinforcement that could guide the model to a desired outcome for all possible input scenarios.

Given the difficulty of alignment, another proposed approach to alignment is building an AI that could help with alignment research (Leike, 2022). This approach relies on sufficiently aligning the AI that helps us with alignment research. This approach would likely fail because the error in the alignment of the AI doing alignment research would be propagated and amplified in the more intelligent AGI that we build using the alignment AI’s research.

2. Capitalist Incentives and Alignment

Is telling AGI researchers that the world will be a better place, but they won’t have an outsized benefit in it, enough to incentive them to keep researching AGI? Every major technology has distributed wealth throughout society but has given an outsized amount of wealth to those who were on the forefront of the technology. Is this a rule for technological progress?

Suppose that a company, company X, were to develop AGI. Most companies in this situation would likely sell their AGI under some sort of limited access. Buyers of the product would be able to use the AGI, but they wouldn’t be able to distribute it to others. This scenario would likely be problematic for a truly aligned AGI. The act of company X withholding the AGI from those who could not afford it, for example, may contradict the goals of the AGI of helping humanity in the most ethical way possible. Additionally, if company X were to build a misaligned AGI, it would likely end in catastrophe for all of humanity, not just company X. So it wouldn’t be fair for company X to reap an outsized benefit from AGI if all of humanity is taking the risk. So, this leaves two options, allowing AGI to be sold under precise and fair conditions or forcing company X to release its AGI to the world under open access. Both approaches are feasible.

Allowing commercially-distributed AGI under a pre-constructed legal framework may incentivize AGI development while also ensuring that the resulting AGI is ethical. One previously suggested approach is the Windfall Clause (Future of Humanity Institute - FHI, 2020) . The Windfall Clause would be a policy that firms developing AGI could voluntarily opt into. By opting in, the firm would agree, ex ante, in the scenario that they make “windfall” profits (i.e. profits which exceed 1% of the world’s total economic output), to donate a meaningful amount of their profits to be distributed throughout the world. Such a clause would mean company X in our example would still benefit more from AGI than the rest of humanity, but that additional benefit would be formally encoded in a legal framework. The Windfall Clause could even be programmed into the AGI, so that the AGI would intervene if company X attempted to make more profits than dictated by the Windfall Clause.

3. AGI Capabilities in the Wrong (Human) Hands

The inherent risk of any technology is that humans will use that technology with malicious intent. AGI is no different. Recall the orthogonality between an agent’s intelligence and goals. Suppose that humanity develops an AGI and manages to align the AGI (again, recalling that our definition of alignment is tautologically ethical). What we’d be left with is two orthogonal technologies, AGI intelligence and AGI alignment. Theoretically, a human could use the AGI intelligence technology with whatever goal they want. A human, say Fred, could build an AGI (the Fred AGI) with equal intelligence to the aligned AGI but whose goal is to empower Fred. The Fred AGI would likely seek to terminate both the aligned AGI and the rest of humanity to empower Fred.

In this regard, I take the firm stance that most of humanity is ethical, the technology goes in the direction of most of humanity, and thus the technology will be ethical. The only way to stop the Fred AGI in the previous example would be a more intelligent aligned AGI. More humans working toward improving the aligned AGI means that the aligned AGI would likely beat the Fred AGI in the long run.

4. Future Direction

This discussion begs the question: should we be doing AGI research at all? Should we outlaw AGI research for its risk? The answer is no for two reasons: 1. Outlawing AGI research will not stop AGI from coming, it will only slow it down and cause the inevitable development of AGI to be more chaotic and dangerous, and 2. The expected outcome of AGI is dramatically positive assuming we take a cooperative and deliberate approach to its development.

The field of alignment is young and thus there is a lot of uncertainty in its future. Instead of painting a full picture of what I think would be an effective approach to AGI development, I will name a few approaches which I think are both interesting and promising.

The first approach addresses the fact that the importance and transformative capabilities of AGI are too large to be entrusted to the hands of one nation or institution. This naturally suggests the use of decentralized systems, such as blockchain, to develop and deploy AGI. There are two primary benefits to this approach. The first is that decentralized systems may present the best solution to encoding the entire ethics of humanity into a set of precisely defined goals for the AGI, and to keep this definition of ethics up-to-date. Anybody with internet access could register on the blockchain and provide their contribution to the AGI’s goals. The second benefit is that running the AGI on a decentralized system would mean that access to AGI could not be systematically withheld to certain groups of people.

The second approach addresses the unique nature of AGI as a technology which fundamentally relies on an understanding of philosophy and liberal arts more generally. I believe that the current state of philosophy is limited by the terminal goals of humans which have been hard-coded by billions of years of evolution. We can think about what gives life meaning all we want, but there is no avoiding the fact that humans are physical agents which are motivated by their biology. Using the terminology from this essay, human’s terminal goals are determined by biological processes optimized to persist themselves. I think that the alignment problem presents an opportunity to renew our direction of philosophy and liberal arts more generally and create the need for what I’ll call philosophy engineers, who work on engineering terminal goals for AGIs that best help humanity. Unlike other technologies, AGI relies on an understanding of the liberal arts not only in its application and dissemination, but also in its implementation.

Most importantly, we must ensure that humanity as a whole understands that AGI is coming, and how that might affect our collective future. AGI is unique among technologies in how polar its possible outcomes are, which stems from its unique capabilities. AGI could both yield a utopia and certain extinction. As such, we must be sure that the collective risks it poses are globally internalized. All that stands between humanity and its next greatest leap is the ability to approach this problem with clarity and cooperation, rather than fear and competition.

##

References

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. 2020 Jul 22. Language models are few-shot learners. arXiv.org. [accessed 2023 Feb 13]. https://arxiv.org/abs/2005.14165

Christian B. 2021. The alignment problem: How can machines learn human values? London: Atlantic Books.

Cohon R. 2018 Aug 20. Hume’s moral philosophy. Stanford Encyclopedia of Philosophy. [accessed 2023 Feb 13]. https://plato.stanford.edu/archives/fall2018/entries/hume-moral/

DePeau-Wilson M. 2023 Jan 19. AI passes U.S. Medical Licensing Exam. Medical News. [accessed 2023 Feb 13]. https://www.medpagetoday.com/special-reports/exclusives/102705

Fickling D, Asimov I. 1988. I, robot. Oxford: University Press.

Future of Humanity Institute - FHI. 2020 Aug 4. Future of humanity institute. The Future of Humanity Institute. [accessed 2023 Feb 13]. https://www.fhi.ox.ac.uk/windfall-clause/#:~:text=The%20Windfall%20Clause%20is%20a,the%20development%20of%20transformative%20AI.

Gabriel I. 2020. Artificial Intelligence, values, and alignment. Minds and Machines 30:411–437.

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. 2021. Highly accurate protein structure prediction with alphafold. Nature 596:583–589.

Justesen N, Bontrager P, Togelius J, Risi S. 2020. Deep learning for video game playing. IEEE Transactions on Games 12:1–20.

Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. 2020 Jan 23. Scaling laws for neural language models. arXiv.org. [accessed 2023 Feb 1]. https://arxiv.org/abs/2001.08361

Kelly SM. 2023 Jan 26. CHATGPT passes exams from law and Business Schools CNN business. CNN. [accessed 2023 Feb 13]. https://www.cnn.com/2023/01/26/tech/chatgpt-passes-exams/index.html

Krizhevsky A, Sutskever I, Hinton GE. 2017. ImageNet classification with deep convolutional Neural Networks. Communications of the ACM 60:84–90.

LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521:436–444.

Leike J. 2022 Sep 27. What could a solution to the alignment problem look like? [accessed 2023 Feb 13]. https://aligned.substack.com/p/alignment-solution

Lutkevich B. 2023 Jan 19. What is Artificial General Intelligence? - techtarget. Enterprise AI. [accessed 2023 Feb 1]. https://www.techtarget.com/searchenterpriseai/definition/artificial-general-intelligence-AGI

OpenAI. 2023 Feb 2. CHATGPT: Optimizing language models for dialogue. OpenAI. [accessed 2023 Feb 13]. https://openai.com/blog/chatgpt/

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. 2022 Apr 13. High-resolution image synthesis with Latent Diffusion Models. arXiv.org. [accessed 2023 Feb 1]. https://doi.org/10.48550/arXiv.2112.10752

Russell SJ, Norvig P. 2016. Artificial Intelligence: A modern approach. Harlow (England): Pearson.

Wiggers K. 2023 Jan 27. Google created an AI that can generate music from text descriptions, but won’t release it. TechCrunch. [accessed 2023 Feb 1]. https://techcrunch.com/2023/01/27/google-created-an-ai-that-can-generate-music-from-text-descriptions-but-wont-release-it/

Williams R. 2011. The metamorphosis of prime intellect: A novel. LaVergne, TN: LULU.

Yang S, Zhu F, Ling X, Liu Q, Zhao P. 2021. Intelligent health care: Applications of Deep Learning in Computational Medicine. Frontiers in Genetics 12.