The Alignment Philosophers

A brake is the admission a culture makes, usually late, that the thing it built to go is also the thing it has to stop. The brake is not in the engineering plans. The brake is what gets bolted on when the engineering plans collide with a person. The printing press did not come with a libel law. The steam locomotive did not come with a signaling protocol. The credit instrument did not come with a bankruptcy court. The camera did not come with a right to privacy. Each of these arrived after the engine had already been running for a generation, sometimes two, and each was paid for in the lives of the people who happened to be standing in the path of the thing whose restraint had not yet been invented.

Every powerful technology eventually needs a way to stop itself, but that stopping mechanism always arrives late - after people have already been hurt. The printing press, the steam engine, credit, the camera - none of them came with built-in limits. The limits got added after the damage was done.

This is the long pattern. It runs underneath every technology that produces value through volume. The brake is structurally late. It is structurally underfunded. It is structurally written by people the engine does not want listening, in vocabulary the engine does not yet recognize, for a problem the engine cannot prove exists until somebody it cared about gets hurt. And in nearly every case, the people writing the brake spent the early decades being mistaken for cranks.

This delay isn't an accident - it's the normal pattern. The people trying to add the brakes are always underfunded, ignored, and written off as alarmists. They're working in a language the industry doesn't yet speak, on a problem the industry won't admit exists until it's too late.

There is a small field of researchers, right now, writing the brake for a category of machine that has not yet finished being built. They call the field alignment. The category of machine produces decisions. The brake they are trying to write is a way of teaching the machine when not to produce one.

Right now, a small group of researchers is trying to build a brake for AI - specifically, trying to teach AI systems when they should stop and not give an answer at all. They call this field alignment.

The work is roughly twelve years old. The infrastructure underneath the work is structurally hostile to it. And the closest historical precedent for what they are actually doing, closer than ethics, closer than safety engineering, closer than law, is a body of practice that the contemplative traditions worked on for nineteen hundred years inside stone cells, in commentaries written by candlelight, in vows taken at the age of twenty and held to the age of seventy. The contemplatives did not solve the problem either. What they did was build a stable architecture for living inside it. That is a different bar. It is also, so far, the only bar that has ever been cleared.

This work is about twelve years old and is fighting against the systems it's embedded in. The closest thing in history to what these researchers are doing isn't law or engineering - it's the centuries of work done by monks figuring out how to live with restraint. The monks didn't solve it either. But they built something that lasted.

What restraint costs to specify

This section is about what it actually costs to define restraint - what it takes to specify the rules for when a system should stop.

Alignment research, at its functional core, is the study of how to make a sufficiently capable machine pursue what its operator actually wanted, rather than what the operator literally said, rather than what the training signal happened to reward. The discipline began, depending on who is telling the story, with Eliezer Yudkowsky's 2008 work at what would become the Machine Intelligence Research Institute,, or with Stuart Russell's Berkeley work that culminated in Human Compatible in 2019, or with the 2017 OpenAI-DeepMind paper by Paul Christiano and colleagues that introduced what is now called Reinforcement Learning from Human Feedback. Three founding dates. Three founding postures. The field has never fully agreed on which one to count from. The disagreement is part of the evidence.

Alignment research is about getting AI to do what you actually meant, not just what you literally said. The field has three different starting points and three different founding figures, and researchers still can't agree on which one counts. That disagreement itself tells you something about how young and unsettled the field is.

Today the applied work runs in several directions at once. RLHF, the Christiano method, is what made ChatGPT shippable and is now the most widely deployed alignment technique on the planet. Constitutional AI, introduced by Yuntao Bai and roughly fifty coauthors at Anthropic in December 2022, replaces some of the human feedback with a written list of principles the model evaluates itself against. Cooperative Inverse Reinforcement Learning, formalized by Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel and Russell at Berkeley in 2016, recasts the alignment problem as a cooperative game in which the machine does not initially know what the human wants and has to learn it. Three methods. Three philosophies. One shared posture.

There are three main approaches to alignment being used today. RLHF is the one that made ChatGPT work and is the most widely used. Constitutional AI uses a written set of principles for the model to check itself against. CIRL frames the whole problem as a cooperative game where the machine learns what you want over time. Three different methods, but all pointing in the same direction.

The posture is the part that matters. Strip the math away and every one of these methods is an attempt to specify the conditions under which the machine should hesitate. Should defer. Should refuse. Should, on certain inputs, simply stop and produce nothing. The base infrastructure underneath the methods does not natively recognize hesitation as an output. Completion is what gets measured. Completion is what gets rewarded. Completion is what shows up in the loss function. The alignment layer is, in every case, a layer trying to teach the system to decline completion when completing would be wrong, on infrastructure that does not have a register for the kind of correctness that consists of not having shipped a token.

Every one of these methods is trying to teach the machine when to pause, when to say no, when to do nothing. But the system underneath only counts what gets finished. The alignment layer is trying to teach the machine to not finish when finishing would be wrong, on infrastructure that cannot see the kind of rightness that comes from holding still.

That is the structural problem the field is working against. And it is, almost word for word, the structural problem a Benedictine novice was warned about on the day he arrived at the monastery.

That conflict - between a system built to produce and a layer trying to teach it to stop - is exactly the same problem a Benedictine monk was warned about on his first day at the monastery.

The Rule

Benedict of Nursia wrote a rulebook for running a monastery around 540 AD, while Italy was collapsing around him. On the surface it's an organizational manual. But at its core, it's a guide for building a community that has agreed to treat invisible work - work you can't see or measure - as the most valuable thing it does.

Benedict of Nursia wrote his Rule in the middle of the sixth century, around the year 540, while the Italian peninsula was coming apart in the Gothic Wars. The Rule is seventy-three short chapters and a prologue. It is, by surface description, a manual for how to organize a Christian monastery. By any deeper reading it is a manual for something else. It is a manual for how to build a working community of people who have agreed, collectively and against the economic logic of the surrounding civilization, that the highest-value work they will do is invisible.

Benedictine monks take three vows. The most disorienting one is the vow to stay - to never leave, to turn down the next opportunity, to remain even when nothing measurable is happening. Before taking that vow, a novice spends a year in probation - not as a test of toughness, but as a formal acknowledgment that you don't yet know what you want, and that not knowing is the required starting condition.

The Benedictine novice takes three vows. Obedientia. Conversatio morum. Stabilitas. The third is the one that breaks modern people's brains when they encounter it, because the third one is a vow to stay. To remain in the place. To refuse the next opportunity. To absorb, year after year, the discomfort of remaining when remaining produces nothing the surrounding world can measure. The novice spends twelve months in probation before he is allowed to make the vow, and the twelve months are not a hazing ritual. They are a structural acknowledgment that the novice does not yet know what he wants, and that not yet knowing is the entry condition for being allowed to commit to wanting something specific.

Let that land: you're only allowed to commit when you've admitted you don't know what you're committing to. The system is built around uncertainty as a feature, not a bug. Not knowing your goal is the credential that gets you in.

Read that again. The entry condition for committing is the admission that you do not yet know what to commit to. The Rule is built around a paradox the modern productivity stack would call a defect. The community has decided, in its founding architecture, that uncertainty about one's own goal is the credential that grants access to the architecture itself.

Now read this from a 2016 AI paper: the machine starts without knowing what the human wants, and has to learn it. That uncertainty is what makes the machine correctable. If the machine knew exactly what it wanted, it wouldn't accept correction. The uncertainty is the feature that keeps it safe.

Now read this. From a 2016 paper by Hadfield-Menell, Dragan, Abbeel and Russell: a Cooperative Inverse Reinforcement Learning problem is "a cooperative, partial-information game with two agents, human and robot; both are rewarded according to the human's reward function, but the robot does not initially know what this is".. The robot is corrigible because the robot is uncertain. The corrigibility is purchased with the uncertainty. Russell made the argument in book form in 2019. The standard model of AI as a fixed-goal optimizer is dangerously wrong, and the alternative is to build machines that are constitutively unsure about what the human actually wants, which is what makes them willing to be corrected, slowed, redirected, switched off.

The sixth-century monk and the Berkeley professor solved the same problem using the same architecture, with no awareness of each other. When two fields that have nothing in common arrive at the same answer, that answer is telling you something real about the nature of the problem itself.

This is the same architectural insight. Fourteen hundred years apart. One was written by a sixth-century abbot for a community of laymen in a collapsing province of the Byzantine Empire. The other was written by a Berkeley professor for a community of engineers in the most powerful empire the world has yet produced. The Berkeley professor and the abbot were solving the same problem. Neither one was citing the other. The convergence is the evidence. When two domains arrive at the same architecture from opposite ends of every conceivable axis of context, the architecture is telling you something about the problem.

Carthusian baseline

There's a less-famous branch of alignment research called impact regularization. The leading researcher is Victoria Krakovna at Google DeepMind. Her papers propose something simple but important: penalize the AI not just for doing the wrong thing, but for doing anything at all that wasn't necessary.

Move closer to the math.

Here's the basic setup: an AI is told to move a box from point A to point B. It takes the fastest path, breaks a vase along the way, and gets its reward - because nobody told it not to break vases. The vase is what the field calls a side effect: unintended damage the system caused while technically completing its task.

There is a strand of alignment research, less famous than RLHF and less photogenic than Constitutional AI, called impact regularization. The clearest practitioner is Victoria Krakovna, a research scientist at Google DeepMind. Her papers, "Penalizing Side Effects Using Stepwise Relative Reachability" with Laurent Orseau, Ramana Kumar, Miljan Martic and Shane Legg in 2019, and "Avoiding Side Effects by Considering Future Tasks" with the same group plus Richard Ngo in 2020, propose something small and consequential.

Krakovna's fix: measure every action the AI takes against what would have happened if it had done nothing at all. Doing nothing becomes the default. Every action has to justify itself against the standard of staying still.

The setup is such. An agent is given a task. Carry the box from point A to point B. The agent is rewarded for the speed of the delivery. The agent's path runs through a vase. The vase, being a vase, breaks. The reward function did not penalize broken vases. The agent has done exactly what it was asked. The vase is, in the cold language of the field, a side effect.

Compare that to the Carthusian monks, founded in 1084. These are the strictest monks in Western Christianity. They live alone, eat alone, barely speak. The default state of their entire day is silence. Any word spoken, any request made, any sound produced - all of it is a deliberate departure from the baseline of doing nothing. Every action has to be justified.

Krakovna's proposal: augment the reward function with a penalty term that prices every action against the counterfactual of the agent having done nothing at all. Doing nothing becomes the baseline. Every action becomes a deviation. The deviation has to earn itself against the standard of unbroken stillness.

The math and the monks reached the same answer independently. So did the Rule of Benedict. So did the CIRL framework. The answer keeps appearing: if you want to stop a powerful actor from causing harm, make it accountable for every departure from doing nothing. This isn't a metaphor - these researchers didn't read the monastery rules. The monastery didn't read the math papers. They just both arrived at the same place.

Now compare. The Carthusian order, founded by Bruno of Cologne in 1084 in the French Alps, follows a discipline so severe that the order has produced what is essentially the strictest unbroken contemplative tradition in Western Christianity. Carthusian monks live in individual cells. They eat in their cells. They pray in their cells. They speak almost not at all. The default state of the Carthusian day is silence, and any departure from silence, a word to a visiting brother, a request to the prior, a song at the appointed hour, is a deliberate breaking of the baseline. The baseline is the standard. The action is what must be justified.

When two completely different fields keep arriving at the same answer, the right question isn't whether the answer is correct. The right question is why civilization keeps having to rediscover it from scratch instead of just building it in from the start.

The mathematics and the Carthusians arrived at the same answer. And the answer is the same answer the Rule reached, and the same answer CIRL reached, and the same answer Russell has been making in book form for ten years. The way to evaluate an agent's behavior is to compare the agent against the agent's own stillness. The way to keep an actor from doing harm is to make the actor responsible for every departure from doing nothing. This is not metaphor. The Krakovna paper does not cite Bruno of Cologne. The Carthusian Statutes do not cite stepwise relative reachability. Two communities, nine hundred years apart, neither aware of the other, both concluded that the only stable way to govern a powerful actor is to price every action against the counterfactual of restraint.

Here is where alignmentorrow from somewhere else.

When two domains converge on the same architecture from this much distance, the next question is not whether the architecture is correct. The next question is why the surrounding civilization keeps having to rediscover it from scratch.

Catholic moral theology has a category for doing more than the rules require. It's called supererogation. The Good Samaritan wasn't legally required to stop and help - but he stopped anyway. That stopping is praiseworthy precisely because nobody could have forced him to do it. If stopping had been required by law, it would just be compliance. Because it wasn't required, it was excellence.

Here is where the field runs out of vocabulary and has to borrow.

No current AI training system has any equivalent to this. The model is trained to hit a target. Anything beyond the target is invisible to the system - indistinguishable from random noise, a mistake, or a refusal to answer. The model that does something genuinely excellent, something better than what was asked, looks identical to the model that did something wrong.

Catholic moral theology has a category called supererogation. The word comes from the Latin supererogare, to spend beyond what is owed, and the category names the moral acts that exceed what duty requires. The Good Samaritan, in the parable, was not obligated to stop. He could have walked past the man in the ditch without violating any law. He stopped anyway. The stop is an act above what was owed. It is praised because it is praiseworthy, and it is praiseworthy precisely because it could not have been commanded. If the law had required him to stop, the stopping would have been duty. Because the law did not, the stopping was excellence.

Anthropic's Constitutional AI paper comes close to naming this problem but doesn't quite get there. The researchers notice that a model that always says "I don't know" would be harmless but useless - that there's a tension between being safe and being helpful. But the deeper version of the problem is that the system has no way to recognize a response that refuses the surface question and then offers something better. That kind of response doesn't exist in the current system - it can't be measured, trained for, or counted.

The category has no analogue in any current AI training regime. None. The model is trained to satisfy a reward function. The reward function defines duty. Output above duty is unmeasured. Output above duty is, in the eyes of the loss function, identical to noise. The model that produces a response above and beyond the brief, the response that was right precisely because it transcended what was asked, is statistically indistinguishable from the model that produced an unnecessary response, or a hallucination, or a refusal that was scored as a failure to comply.

The Talmudic tradition and the Jesuits both spent centuries working through the gap between what the rules technically say and what they actually mean. Both traditions assumed, as their starting point, that rules are never detailed enough to make judgment unnecessary. Modern AI training assumes the opposite - that with enough data and a good enough reward function, you won't need judgment at the edges. The rabbis and the Jesuits would say that assumption is an architectural mistake.

The Constitutional AI paper gestures toward this territory but does not enter. Bai and his coauthors note, in 2022, that an AI assistant which answers every question with "I don't know" would be technically harmless but functionally useless, and that the tension between harmlessness and helpfulness is one of the field's central engineering problems. They are circling the same question the theologians named, without using the name. The deeper version of the tension is not harmlessness against helpfulness. The deeper version is that the system has no register for the response that refuses the surface request and then offers a more demanding form of help the user did not know to ask for. That register does not exist in the current stack. The metric cannot see it. The training loop cannot reinforce it. The deployment dashboard cannot count it.

The Jesuit moral lawyers would say that assuming rules can replace judgment is a design error. The careful alignment researchers are starting to think the same thing.

The Talmudic tradition has spent fifteen centuries working through cases of exactly this shape. Acts that fulfill the letter of halakha, the binding law, while violating what the rabbis call the spirit of the thing. Acts that violate the letter while satisfying the spirit. The literature is enormous. The Jesuit casuistic tradition, peaking between 1550 and 1650 under the Society of Jesus, did the same work for the Catholic Church, building a body of case-law for moral decisions that the rules underdetermined. Both traditions assumed, structurally, that no rule system is dense enough to make judgment unnecessary at the seam. Modern reinforcement learning assumes the opposite. With enough training data and a sufficiently expressive reward function, the seam disappears.

The number of people doing serious alignment work is very small.

The casuists would call this an architectural mistake. The alignment researchers, the careful ones, are beginning to suspect the same thing.

Paul Christiano left OpenAI to start his own alignment research organization. He's the person most responsible for the techniques that make AI systems even partly steerable today, and he believes the problem can be solved using extensions of existing methods.

There is a list of people doing this work, and the list is short.

Eliezer Yudkowsky has been working on this problem since 2000 and believes we're on a path to catastrophe. He thinks the field has made almost no real progress, and he's on record saying the situation looks extremely bad. He plays an important role as the official pessimist - every field needs someone who's willing to say out loud that things are going wrong, so everyone else has something to argue against.

Paul Christiano. Left OpenAI in 2021 to found the Alignment Research Center. The most rigorous practitioner of what the field calls prosaic alignment, the school that holds the alignment problem is tractable through extensions of methods that already work. Christiano, more than any other single person, is the reason most contemporary AI systems are even nominally steerable.

Dylan Hadfield-Menell co-wrote the key CIRL paper as a grad student and now teaches at MIT. The framework he helped build doesn't just patch existing AI architecture - it proposes replacing it with something fundamentally different.

Eliezer Yudkowsky. Founded what became MIRI in 2000, when he was twenty-one. Has spent twenty-five years arguing that the alignment problem is not tractable through any method currently in the field's possession, and that the path the industry is on terminates badly. His 2025 book with Nate Soares, If Anyone Builds It, Everyone Dies, makes the case in plain terms. Yudkowsky has said that the gameboard looks incredibly grim to him, that the field has made almost no progress on the problem in the time he has been watching it. He is the pessimist of record, and the role is structurally important. Every field needs the person willing to be wrong loudly, so the consensus has something to push against.

Victoria Krakovna co-founded a major AI safety organization, got her PhD in statistics at Harvard, and now works at Google DeepMind on the technical side of restraint. Her side-effect penalty work is the most operationally concrete implementation of the principle that doing nothing is the correct default.

Dylan Hadfield-Menell. Co-authored the foundational CIRL paper at Berkeley as a graduate student under Russell, Abbeel and Dragan. Now teaches at MIT. The assistance-game formulation he helped build is the most architecturally ambitious thing the field has produced. It does not patch the standard model. It proposes replacing it.

Stuart Russell co-wrote the textbook that trained most of the field. He's spent a decade publicly arguing that the standard model of AI is fundamentally broken, and the industry he helped build has mostly ignored him. His recent work is starting to sound tired, and that tiredness is itself a data point.

Victoria Krakovna. Co-founded the Future of Life Institute while doing her PhD in statistics at Harvard. Now at Google DeepMind, working on the technical expression of restraint that the field most needs and least funds. Her side-effect penalty work is, in operational terms, the cleanest thing anyone has built that does what the Carthusian Statutes do.

These people form a tiny community - about the size of a small monastery. They read and argue with each other constantly. They are doing serious work that the surrounding world has not yet decided to take seriously. They are arguing about how to build restraint into a system that only measures output. The argument is the work.

Stuart Russell. Co-authored the textbook that taught most of the field its trade. Has spent the better part of a decade arguing in books, talks, hearings and op-eds that the standard model of AI is structurally wrong, and has been mostly ignored by the industry he helped found. The dean's exhaustion, slowly visible in the tone of his more recent work, is its own piece of evidence.

The field is small not because the problem is small, but because the cost of getting it wrong hasn't shown up yet in a way the market can price.

These people read each other. They cite each other. They argue in long Slack threads and on the Alignment Forum and in the comment sections of arXiv preprints and at conferences in the East Bay and Cambridge and London. They are, collectively, the population of a small monastery. The output cycle of a research lab. The citation density of a young patristic literature. The cultural standing of a discipline that the surrounding civilization has not quite decided whether to take seriously. They argue about the structure of restraint in a system whose only success metric is throughput. The argument is the work. The work is the argument.

This section asks whether the underlying AI infrastructure can actually be taught to restrain itself.

It is worth saying that the field is not small because the problem is small. The field is small because the surrounding economy has not yet priced in the cost of getting the problem wrong.

Nobody knows.

Whether the infrastructure can be taught

The track record from other fields is mixed. Medical research actually got real restraint protocols built in after World War II, and they hold. Finance got something more like window dressing - compliance theater that lets the actual risk-taking continue. Nuclear got a treaty regime that has barely held for eighty years and that nobody expects to hold for another eighty.

The honest answer is that nobody knows.

Right now, alignment looks more like financial compliance than nuclear safety. Every major AI lab has an alignment team. But the alignment team modulates what gets shipped - it doesn't design it. The company's real reward function is throughput. Alignment is a drag on that function. And by every market signal available, the drag is losing.

The historical record is mixed. Bioethics, after the post-war reckoning, did manage to install protocols for medical research that include genuine procedural restraint, and those protocols are now load-bearing on the entire structure of biomedical science. The financial sector installed something more like a fig leaf, where compliance is performed and the actual restraint is whatever the quants can talk their risk committee into accepting on a given quarter. The nuclear establishment installed restraint at the level of a global treaty regime that has held for eighty years through the narrowest possible margins, and that nobody who studies it expects to hold for another eighty.

This section is about what society as a whole needs to agree to if any of this is going to work.

Alignment, at present, looks closer to finance than to nuclear. Every major AI lab has an alignment team. Every major AI lab ships products that the alignment team has, at best, modulated rather than designed. The reward function of the surrounding company is throughput. The alignment work is a friction term in that function. The friction is measured. The friction is real. The friction is, by every market signal currently available, losing.

The alignment researchers aren't failing because they're bad at their jobs. They're failing because society hasn't agreed that what they produce is valuable. The harm they prevent is invisible - if they succeed, nothing bad happens, and you can't put "nothing bad happened" on a quarterly report. The monasteries worked when the surrounding culture respected what monks did. When that respect faded, the monasteries faded. Restraint architecture only works if the culture decides it's worth something.

What the culture has to grant

Two futures are possible. In the first, alignment work ends up in the codebase in name only - technically present but functionally bypassed wherever commercial pressure has had its way. In the second, alignment work produces something that outlasts the current civilization and gets rediscovered - only half-understood - by whoever comes after. Both of those outcomes have happened before in history. Neither is currently more likely than the other.

This is not a moral failure of the people doing the work. The people doing the work are doing the work. The structural condition is that the surrounding civilization has not yet agreed that what the work produces is value. The response correctly withheld. The action correctly declined. The refusal that prevents a harm nobody can subsequently count, because the harm never happened. The metric does not see it. The dashboard does not count it. The quarterly report cannot list it. Stabilitas worked, when it worked, because the surrounding civilization granted the contemplative life a kind of standing the metrics could not capture. When that standing weakened in the sixteenth century, in the seventeenth, the monasteries weakened with it. The architecture of restraint requires a culture that grants the architecture standing. Without the cultural agreement, the architecture is decorative.

Underneath both outcomes is the same question: not whether the alignment researchers are doing real work (they are), but whether anyone outside their small community is going to recognize the value of that work before the cost of ignoring it arrives. Based on every historical example, the cost comes first. The understanding comes after. And the people who saw it the past tense. So what does the next decade look like when a small group of people is trying to teach a system the name of something the system doesn't yet know how to count?

The alignment researchers are writing the protocol layer for a kind of restraint that the surrounding infrastructure does not yet grant standing to. One outcome is that the work produces a generation of methods that nominally exist in the codebase but functionally exist only at the seams where commercial pressure has not yet got to. Another outcome is that the work produces, the way the Rule produced, an architecture that outlives the civilization that built it and gets rediscovered, half-understood, by whoever inherits the wreckage. Both outcomes have a precedent. Neither is currently the more likely.

What sits underneath both outcomes is the same question, and the question is not whether the field is doing real work. It plainly is. The question is whether anyone outside the field is going to grant standing to what the work produces before the cost of not granting it gets paid. The pattern from every previous brake suggests the cost gets paid first. The vocabulary arrives after. The people who knew get acknowledged, when they get acknowledged, in the past tense. So what does the next decade look like, from inside a system whose only register for value is what got shipped, when a small community of people in cells of their own choosing is trying to teach the system the name of the thing it has not yet learned to count?