Hi everyone! I just finished a 15-part blog post series:
I think it hits on some themes that y’all would find interesting, and that have come up here at HTM forum. (And in some cases, this forum is where I first learned about them!)
The primarily-neuroscience posts are 2,3,4,5,6,7,13, while the rest are more directly about AGI safety.
Here are post titles & summaries:
1. What’s the problem & Why work on it now?—Covers all the fun (non-neuroscience) basics like “What does AGI mean?”, “What does AGI safety mean?”, “What does brain-like AGI mean?”, “Why are we talking about AGI safety when AGI doesn’t even exist yet?”, and “You can’t seriously believe that trope about AGI sending robot armies to battle humans to extinction, like as a literal thing in real life, can you??”
2. “Learning from scratch” in the brain—I define “learning from scratch” as particular family of computational systems / algorithms, a family that includes any ML algorithm initialized from random weights, and also includes pretty much anything you’d think of as a “memory system” (e.g. a blank hard disk drive). I argue that 96% of the human brain by volume—basically the telencephalon and cerebellum—“learns from scratch” in this sense. This post also touches on cortical uniformity, and its even-more-obscure cousins “allocortical uniformity”, “striatal uniformity”, “pallidal uniformity”, and “universal cerebellar transform”.
3. Two subsystems: Learning & Steering—Following up the above, I claim that the brain is split into two subsystems, based on whether or not they “learn from scratch” as defined in Post #2: the “Learning Subsystem” (telencephalon & cerebellum) which “learns from scratch”, and the “Steering Subsystem” (hypothalamus & brainstem) which doesn’t. This two-subsystem take has some resemblance to triune brain theory, Jeff’s “New brain / Old brain” distinction, @Bitking’s “Dumb boss / smart advisor”, etc., but I think my version is a really nice clean conceptual distinction, and offers elegant insights not only into mammal brains but even fruit fly brains. I also include a section directly responding to Jeff’s argument in A Thousand Brains that AGI wouldn’t pose a risk of catastrophic accidents, and also a section arguing that brain-like AGI is probably not centuries away but may arrive even in the next decade or two.
4. The “short-term predictor”—A “short-term predictor” has a supervisory signal (a.k.a. “ground truth”) from somewhere, and then uses a supervised learning algorithm to build a predictive model that anticipates that signal a short time (e.g. a fraction of a second) in the future. I talk about how these can be implemented in the brain, and a few of the functions that I think they serve, including my grand theory of the cerebellum.
5. The “long-term predictor”, and TD learning—I claim that you can take a short-term predictor, wrap it up into a closed loop involving a bit more circuitry, and wind up with a new module that I call a “long-term predictor”. The way it works is closely related to TD learning. I claim that there is a large collection of side-by-side long-term predictors in the brain, each comprising a short-term predictor in the telencephalon (but only in certain parts of the telencephalon, like the amygdala, medial prefrontal cortex, and ventral striatum) that loops down to the brainstem, and then back via a dopamine neuron. For example, one long-term predictor might predict whether I’ll feel pain in my arm, another whether I’ll get goosebumps, another whether I’ll release cortisol, and so on.
6. Big picture of motivation, decision-making, and RL—Here I fill in in the last ingredients to get a whole big picture of motivation and decision-making in the brain. There’s also a section in which I argue against the common idea (close to what Jeff often says) that the Learning Subsystem is the home of ego-syntonic, internalized “deep desires”, whereas the Steering Subsystem is the home of ego-dystonic, externalized “primal urges”.
7. From hardcoded drives to foresighted plans: A worked example—I was concerned that Post #6 was too abstract, so here I work my way through a simple, concrete example: I ate a yummy cake a couple years ago, and now I want to eat that kind of cake again, and so I devise and execute a plan to make that happen. What’s happening under the surface during each step of this process, in the Post #6 model?
8. Takeaways from neuro 1/2: On AGI development—Given the discussion of neuroscience in Posts 2-7, how should we think about the software development process for brain-like AGI? Some relevant topics here include training time, the importance (and safety problems) of online learning, and whether we should expect programmers to do an outer-loop search analogous to evolution.
9. Takeaways from neuro 2/2: On AGI motivation—Given the discussion of neuroscience in Posts 2-7, what lessons do we learn about how the motivation of an AGI would work? I dive a bit into (what I call) “credit assignment” (i.e. changes to valence and other learned visceral reactions), and the question of whether AGIs will want to wirehead, and more generally why AGIs will not necessarily be trying to maximize their future reward, along with a few other topics.
10. The alignment problem—Suppose we have a particular thing that we want our AGI to be doing—“clean my house”, “invent a better solar cell”, or more simply “do whatever I would find most helpful”. How do we design the AGI such that it wants to do that particular thing, and not something totally different? This open problem is called “the alignment problem”. I discuss lots of the reasons that the problem seems hard: “Goodhart’s Law”, “Instrumental Convergence”, “Inner Alignment”, misinterpreted reward signals, wrong reward signals (e.g. rewarding the AGI for doing the right thing for the wrong reason), “ontological crises”, the AGI manipulating its own training process, and more.
11. Safety ≠ alignment (but they’re close!)—In my terminology, “AGI alignment” means that an AGI is trying to do things that the AGI designer had intended for it to be trying to do, while “AGI safety” is about what the AGI actually does, not what it’s trying to do. Safety and alignment can come apart in principle, but I argue that in practice, alignment is more-or-less necessary and sufficient for safety. For example, intuitively it seems that a simple solution to out-of-control AGIs is to build an AI in an air-gapped box, and power it off if it tries anything funny. However, on closer examination, this “solution” turns out to be hopelessly inadequate.
12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”—I suggest two broad research paths that might lead to aligned AGI. (1) In the “Controlled AGI” path, we try, more-or-less directly, to manipulate what the AGI is trying to do; (2) In the “Social-instinct AGI” path, our first step is to reverse-engineer some of the “innate drives” in the human Steering Subsystem (hypothalamus & brainstem), particularly the ones that underlie human social and moral intuitions. Next, we would presumably make some edits, and then install those “innate drives” into our AGIs. I talk about some relevant considerations, and conclude that we should pursue both research paths in parallel. I also talk about “life experience” a.k.a. training data, and why we can’t just get safe AGI merely by raising it in a loving human family.
13. Symbol grounding & human social instincts—A key part of the “Social-instinct AGI” path would require reverse-engineering circuits in the hypothalamus and brainstem that underlie human social instincts. I talk a bit about how these circuits might work, with a strong emphasis on the open question of how these circuits solve a certain “symbol grounding problem”, and end with a plea for more theoretical & experimental research.
14. Controlled AGI—Here I switch over to the “Controlled AGI” path mentioned above. I don’t currently see any promising-to-me paths forward to solve this problem, but I talk about some intriguing proto-ideas, like systems to continually refine the AGI’s goals when it hits edge-cases, or building tools to directly make sense of an AGI’s giant world-model.
15. Conclusion: Open problems, how to help, AMA—I list 7 open problems from the series where I strongly endorse further research (two are traditional neuroscience, two are traditional CS, three are directly about AGI). Then I talk about practical aspects of doing AGI safety (a.k.a. AI alignment) research, including funding sources, connecting to the relevant research community, and where to learn more. I wrap up with some takeaway messages.
Happy for feedback, pushback, and discussion!!