I spent the past two weeks visiting the Alignment Research Center (ARC) in Berkeley. ARC is a research organization dedicated to solving the “AI alignment problem”: How do we ensure that current and future AI systems are safe and aligned with humanity’s goals and values? I had known of ARC’s existence since their inception (they were founded in 2021 by Paul Christiano) and had always been curious what they were up to. While there are many different AI safety organizations, ARC had a reputation for being unusually theoretical and abstract in their approach (compared to other AI safety research directions like mechanistic interpretability which seemed more concrete and approachable).

I was invited by Mike Winer, a physics postdoc at the Institute for Advanced Study (IAS). The IAS—located in Princeton, New Jersey—has perhaps the most storied history of any research institution in theoretical physics: it’s where Einstein and von Neumann worked after moving to the US, and it continues to host a high concentration of great researchers.

I chatted with Mike when we were both attending the AI + Physics workshop at CUNY. Over the course of the conversation, I realized that we had mutual contacts (after the conversation, I also realized that we had actually spoken on the phone many years ago when I was applying to PhD programs!). As chance would have it, he soon after came to Yale to give a talk on his work with ARC. While he was in town, I took him out to dinner (pizza night at BAR, a New Haven classic), and we had a wide-ranging conversation covering a lot of ground at the intersection of physics, machine learning, and even philosophy. At the end of the night, he said that if I’m ever free, I should visit ARC (which he has an official affiliation with).

I try to avoid giving unsolicited advice, but I’m going to break my rule just this once: if someone offers you the opportunity to go on a wacky rationalist adventure in California, you drop everything and go.

Constellation

The on-site headquarters for ARC is contained within the Constellation co-working space. Co-working spaces are office spaces where more than one organization shares the same resources (e.g. common areas, kitchens, and bathrooms).

Co-working spaces seem to serve several important purposes, one of which is to delegate the responsibility of managing an office space to a dedicated organization. The Constellation co-working space was really nice: a stylish professional look, balanced by just enough plants and warm furniture to make it feel like your space. The space also had a lot of quality-of-life benefits like providing complimentary breakfast, lunch, and dinner. The fridges were filled with a cornucopia of protein shakes, yogurts in artisanal ceramic containers, and carbonated beverages. I had never drunk Vitamin Water before—because why in the world would you pay for water—but by the end of my stay, I had cut all the normal water out of my diet in favor of the more tasty alternative. They also had a well-supplied snack bar—well, mostly well-supplied. I recently developed some adult-onset allergies which limit what sorts of foods I can eat, so the only snack in their offerings that I could eat were the fig bars. People of Constellation: if you were wondering why all the fig bars had mysteriously disappeared, now you know why.

fig bars

If I save the world from the killer robots, you now know who to thank.

I was told during my initial tour that there are fifteen to twenty safety organizations at Constellation. This caught me off guard because I hadn’t realized that there were twenty AI safety organizations in the world—let alone that many in the same co-working space! To be fair, quite a few of the organizations at Constellation are small, scrappy efforts dedicated to tackling one specific problem. But there were also larger organizations as well, such as Redwood Research, which has really emerged as one of the leading organizations in the AI safety space.

One of the benefits of having so many organizations in the same co-working space is that it encourages the cross-pollination of ideas. During my stay, I made a concerted effort to sit at different tables and get to know as many people as possible. While most of the organizations are safety-oriented, there are also organizations working on other AI-adjacent topics. For example, Eleos AI is an organization dedicated to model welfare: How do we ensure that, as models become more capable and possibly more sentient, they are treated in a fair and humane manner?

There was a wonderful openness to the conversations at Constellation. While most of the conversations that I was a part of were centered on what people were working on, there were occasional forays into other topics that can get surprisingly deep (e.g. the intricacies of Mandarin versus Cantonese). And the nature of the work naturally invites philosophical inquiry: What is intelligence? What is consciousness? To what extent can one actually predict the future?

ARC’s Research Agenda

The last time I was closely tracking the state of technical AI safety work was back in 2022. At the time, everyone seemed to be talking about mechanistic interpretability (mech interp). Mech interp can be thought of as developing a neuroscience for machine learning models.

Machine learning models are famously black boxes: we can observe that, over the course of training, the models get better at performing the tasks we train them to do. But we don’t know how they do it. This is different from traditional computer programs which, while potentially incredibly complicated, can still be understood by decomposing them into distinct subroutines and algorithms—a feat made possible by the nature of high-level programming languages.

Mech interp aims to transform machine learning models from black boxes into interpretable programs. For example, the discovery of induction heads could be put under the mech interp umbrella. An induction head is a circuit in transformers composed of two attention heads that performs the following subroutine: if earlier in the text the sequence AB is present, then the next time that A appears, B is more likely to be the output (I am using “A” and “B” to denote variables, not the literal letters). If it turns out that one could build a library of similar basic circuits that machine learning models tend to build over the course of training, then one could interpret a particular network as a composition of these circuits.

The 2021-2022 upswell in mech interp was spearheaded by many great researchers, including Neel Nanda and Chris Olah. I remember reading papers like Toy Models of Superposition and being impressed by the work being done. But eventually, I got busier with PhD work and the various blog posts that I skimmed on LessWrong were failing to capture my attention, so I stopped following technical AI safety work all that closely.

I was never overly familiar with ARC’s research agenda. My only exposure to ARC’s approach to AI safety had been through the concept of Eliciting Latent Knowledge (ELK). The idea behind ELK is that if you have an AI system that builds an internal model of the world, it might “know” things that aren’t reflected in its outputs. For example, imagine a model trained to predict camera footage with the goal of preventing bank robberies. If a burglar tampers with the camera, the model might predict that everything will look fine—even if it knows on some level that the footage no longer reflects the actual state of the world. How do you get the model to report this latent knowledge? ELK made sense from a philosophical point of view, but I had trouble translating it in my head to concrete experiments you could run on current machine learning models.

Since then, ARC has shifted their agenda. In the past year, the framework that they’ve been working in is organized around what they call the matching sampling principle: the idea that we should be able to understand machine learning models using mechanistic approaches at least as well as we can understand them by feeding them data points (sampling). There is a really nice blog post by Eric Neyman explaining the matching sampling principle and how it informs their research agenda.

Here’s my understanding of the matching sampling principle: we want our AI models not to do things that are bad. So during training, we will train them not to do bad things. However, once the model has been trained, how do you know if you’ve succeeded? You could try asking the AI, “Hey, are you by any chance secretly misaligned?” but it doesn’t seem all that wise to trust whatever answer it gives you. Another more reasonable approach would be to give the AI model a bunch of data (you could imagine giving it data that you expect to closely reflect the real-world environment that it will be deployed in, or data specifically designed to probe whether the model is misaligned) and see how often it does a bad thing. This would give you a rough upper bound on the probability of catastrophe.

But this sampling-based approach won’t work at scale. Evaluating the model on a single sample involves not only the computational cost of running the model, but also the cost of whatever evaluation process is used to determine whether the model’s output is harmful (e.g., you can imagine a committee of GPT-5-level AIs working in concert to evaluate the superintelligence). Let’s say with this process you can establish that on a randomly chosen input the probability it does something bad is at most $10^{-6}$. But you want to deploy it in the real world, where it will be used by potentially billions of people for years on end. Even though $10^{-6}$ might seem like a small number, it’s not good enough. How do you get guarantees on the order of $10^{-30}$, where feeding the AI the equivalent number of data points would not be feasible?

The way that you would do that is by actually understanding the model, rather than just treating it as a black box. You could say something like “Catastrophe happens only when criteria A, B, and C are met. A, B, and C can be reasonably modeled as independent. Since we can show that each of A, B, and C happen less than $10^{-10}$, the overall probability of catastrophe is no higher than $10^{-30}$.”

This doesn’t seem so different from how bridge builders approach their craft. With modern-day safety standards, it’s not good enough for bridge-builders to say “Well, bridges don’t collapse all that often. Trust.” Instead, there need to be physically motivated reasons—with numbers attached—for why, over the wide range of possible conditions a bridge can endure (sweltering heat, heavy traffic load, brittle winters, rainstorms), it should still only fail a small percentage of the time.

(It also seems similar to what is called “probabilistic risk assessment” in the context of engineering projects like nuclear power plants and aircraft design.)

This research agenda is nice because it (a) has a clear story of how it could help align even super-intellegent AIs while (b) being tractable and operationalizable on smaller toy problems.

During my time there, they had me work on some mini-research projects that fit into their overall research agenda while also being a good fit for my statistical physics background. I had a lot of fun with the problems they gave me and hope to keep puzzling over them. ARC seems to be managed very similarly to a computer science lab you would find in academia: you have researchers with projects, regular meetings to address bottlenecks, and they publish papers and give talks about their work. And the problems they work on are astonishingly concrete: while the problems aren’t being solved for their own sake but for the overall goal of making AIs aligned, the day-to-day work involves arguing with colleagues up at the whiteboard on how to solve some challenging but intellectually stimulating combinatorics problem.

ARC is hiring! Their typical hiring process is to do a ten-week research stint as a visiting researcher, and if it’s a good fit for both sides, transition to a full-time position. They are especially interested in people with backgrounds in computer science, math, and theoretical physics. If you are interested, you can apply here.