Objects in Mirror May Be Closer Than They Appear

Civilization IV

In the game Civilization, the democracy mechanic enables players to decrease agression–a score of warmongering between 1 and 12–by two, to promote peace and stability. It works pretty well, until you get to Ghandi, who’s agression score of 1 is integer-overflowed into a 255, making him the most trigger happy enemy in the game. Two sensible predicates, that, when placed into a non-robust system, quickly lead to an unintended experience.

Alignment has dominated cultural consciosness forever–from HAL in 2001 Space Odyssey, to Arnold Schwarzenegger’s Terminator, digital minds are easy to worry about. Today, with how benign most language models feel, most mentions of alignment concern and worry feel like dooming. It feels like every day, there is a pundit outcrying the rate of progress and seemingly airing out a niche research problem to slow the pace of innovation.

In some regards, many alignment discussions feel like watching a parent remind their teenager not to speed. There is an idea of what is safe and what is generally safe, and while there is a common understanding that the chance of an accident is acceptably low if you follow the rules, the risk tolerances vary.

To most builders, for whom AI is analogous to a nondeterministic general purpose answer API, slowing the pace of development seems small minded. From the outside in, it would appear that labs are sitting on gold by refusing to release continual learning models, engage in full open-source transparency, or add certain features to model harnesses. And in a competitive landscape, delaying features feels like shooting onesself in the foot.

An arms race is a competition between parties for superiority in the development and accumulation of weapons, and its tempting to call AI the next great one. But AI isn’t really an arms race, because that assumes that the party that wins is the one to go the fastest. It is possible that the winner is instead a nonhuman AI, that has developed and misaligned itself from humanity and before we know it taken control of the earth. People often worry about the case of superhuman AI that could take over Earth. This is a relatively obvious concern–something that is neglected is that even a human-level intelligence is well leveraged to beat humanity. This is because the skills that made humans apex predators–socialization, coordination, and population volume–are right up an AI’s alley. it is far easier to clone an agent than it is to raise a kid, and so even with a relatively mediocre intelligence, humans could find ourselves outnumbered and out-resourced.

Not Just Token Predictors

Are we really going to bow down to overlords that are just next token predictors?

Its true that during training time, LLMs are optimizing the ability to predict the next token given the previous ones. But this ignores the emergent properties these models acquire throughout the training process. To predict the next token well across the internet, the model has to learn the latent structure of the world. You can’t predict the next word in a proof, a legal argument, or a piece of code without implicitly modeling objects, causality, goals, beliefs, and long-horizon dependencies. Empirically, we’ve seen LLMs learn latent variables, casual structure, belief modeling, planning, and even abstract concepts like justice and symmetry. These aren’t implemented explicitly, but rather emerge from the quality of the pretraining process.

Sometimes it takes a close call, an untimely ticket, or just a bit of frontal lobe development, but at some point in time we all have to admit our parents had a point about the whole speeding thing. Not simply from the abstract philosophical mandate of the law, but rather from a place of rationality and practicality. It is more reasonable to not speed most of the time–in the same way, alignment research is not only an ideal but a practical tool for extracting the most performance out of models and harnasses, as we will soon see.

Ant Fugue

Reductionism is the idea that the whole is the sum of its parts. For instance, we can consider a car engine to be the sum of the engine block, the spark plugs, pistons, crankshaft, and timing system. In doing so, one can step in and out of various layers of abstraction, choosing to concern ourselves with the underlying mechanism or not. For instance, consider a manual verus an automatic transmission–the underlying components of the same, but in the automatic system the user does not need to imagine or understand the system by which the gear shifting works.

Reductionism works quite well for many things in our day to day life. Chemicals that decompose into atoms, machines that split into parts, or even apps comprised of source code-there are no shortages of these ever-cascading layers of reductionism. But consider a human: is it fair to say that we are just water, skin, salt, and change?

This is the principle of holism–the idea that an object is more than the sum of its parts. Reductionism and Holism as it pertains to humanity is an issue that is owed far more analysis than this article can afford ¹–that said, many of the pillars of the debate are quite applicable to developing our understanding of ML systems.

The scientific method is the cornerstone of reductionist philosophy, as it proves constituent components via experimentation and constructs theory from them. The axioms and unified theories we aim to discover and create derive from this assumption. On the other hand, holism is characterized by the idea of “emergence”: properties that arise at higher levels of complexity that cannot be predicted by analyzing lower-level components alone.

Discourse in the AI space often boils down to this question as well. For instance, consider the myriad debates on the optimal model architecture for generalizable intelligence. There is a dominant school of thought amongst purists that LLM’s are a patchwork solution, and alternatives that directly reason about the universe (such as world models) are the way forward. This is the more reductionist perspective–that a model trained to operate in a text space will not be able to be truly intelligent, as it cannot reason about the world in the way that we do. The alternative to this is the holistic idea that, at scale, any general model that interfaces enough with the world will learn a latent representation of the universe. This would be akin to saying that someone who cannot see can still understand the world, which is true, but the extent to which it is a similar understanding as someone who can see would have is up for debate.

This is also a core part of the AI Safety problem. One of the pillars of the domain is the Principle of Informed Oversight–the idea that so long as a human is able to understand everything that a model thinks and knows why they take every action, it cannot act in a harmful manner. Though obvious, this is difficult to achieve in practice. The first challenge comes from mechanistic interpretability: it hasn’t been easy to decode neural net black boxes. And as these architectures become more alien and intelligent, things get much harder for human minds to reason effectively about it. I’m reminded of the saying that “no author can write a character smarter than themselves”, because they cannot portray intelligence that they do not have. If AI is anything similar, our best bet may come from decoupled sentinels that can distill thinking traces of a superintelligent model into something we can make sense of. But perhaps more concerning is the idea that as these models become more intelligent, they implicitly realize truths that humans are not aware of via emergent behavior. If there is a dimension to cognition that we do not have, trying to control it is akin to a 2D paperman trying to affect our 3D lives.

Turing, Trolleys, & Supersonic Bounds

Today, the primary worry in AI Safety is deliberate planning². Deliberate planning can come from the best intentions! For example, imagine I wanted to train a model that would help an interior design company purchase parts. When doing so, I might specify some criteria to help get the right parts: high quality, low cost, and stylish. However, while learning how to order inventory, what if the model discovers Facebook Marketplace, and starts lowballing people on the site to get me the best deals. Now, the capacity of a model to negotiate a deal directly influences its reward, and so it is incentivized to learn how to manipulate humans, a subgoal which is wholly unintended and orthogonal to its initial purpose. This, and other examples like the paperclip maximizer³, illustrate the danger of a misaligned model.

What do researchers really mean when they talk about misaligned models? Concretely, the field boils down to solving the gaps between the Ideal Spec, Design Spec, and Revealed Spec. The Ideal spec is our platonic ideal of what an intelligent system should be–it is less of a technical problem and more of an ethics and governance play. You often see essays that argue for and against various personality traits and behavior paradigms, and there tend to be opions from all kinds of people on what AI “should” be. The next layer is the design specification: this is what the model is told to optimize for during training. Oftentimes, this takes the form of a loss/reward function that is specified by researchers during training. Reward functions are a squishy field–it is hard to design a perfect spec that can evaluate model alignment with the ideal, and models often don’t perfectly match. This is called a specification failure, where the training was inadequate to meet the original criteria. An example of this could be a system tasked to play a video game and earn maximum completion, that during training discovers an exploit that lets it autowin. While the reward may be maximized, it hasn’t really achieved the objective of a video game playing agent.

The other main type of specification is the revealed specification: this is how the system behaves in the real world. Our FaceBook Marketplace thought experiment is an example of the design specification not mapping over to the revealed specification–also known as a generalization failure. Generalization failures are also seen heavily in robotics, where policies overly trained on one specific embodiment and task fail to translate their learnt skills to other situations. There are two types of generalization failure: Capability Misgeneralizations and Goal Misgeneralizations. Capability Misgeneralizations, as the name implies, occur when the agent is not capable at doing Out of Distribution problems. This is unfortunate but also not particularly concerning from an alignment perspective–instead, we leave that to posttraining researchers like myself to patch up, and pretraining improvements in the design spec that can generalize better. Goal Misgeneralizations, on the other hand, are more concerning. This is when the agent successfully achieves its reward objective, but this does not go as planned in an edge case environment. Consider Nuclear Ghandi, or alternatively, the early loosely guardrailed ChatGPT: while successfully managing the objective (“Answer the question correctly”), certain OOD questions pertaining to explosives, drugs, and similarly fun topics likely weren’t intended to be on the list.

Beyond amplified oversight (which we will revisit shortly), one of the best ways to prevent specification and generalization failures is through evals. Evals are an agreed upon proving ground that aim to measure how close (or far) the revealed specification of a model is to the eval’s ideal spec. Arguably the most famous AI Eval is the Turing Test, an imitation game proposed by Alan Turing in 1949 in which a human interrogates a message box, and aims to distinguish if it is speaking with a human or a machine. The idea being that, if a human can’t tell one from the other, the machine is as intelligent as a human. If it walks like a duck and talks like a duck, we can treat it as one. And despite being a pillar of the field for so long, we may already be past the Turing test. Voice agents answer calls near perfectly, with intonation and personality. Chatbots have gotten so good at becoming human that they’ve caused parasocial connections. And even image and video diffusion models have gotten so good that, when I spoke with a leading researcher in the domain, he told me he can’t tell apart an AI Image from a regular one. In some ways, I’m reminded of humanity’s experience with speed. Mach 1 was initially set as an upper bound for the speed of an object, the idea being that it would be inconcievable for a contraption to move faster than sound. Of course, when supersonic flight was created, we quickly realized this assumption wouldn’t hold, and the understanding of the field had to shift accordingly. But if the turing test doesn’t work, what does?

There have been no shortage of candidates designed to quantify machine intellect. Metrics like ARC-AGI, Humanity’s Last Exam, SWEBench, or LMArena all take different approaches to quantifying capabilities. One approach is an absolute test: For example, Humanity’s Last Exam consists of extremely difficult expert questions that are intended to represent the bounds of human knowledge in any domain. If an AI is capable of scoring a top score against these experts, it stands to reason that it has achieved some form of superintelligence. The alternative approach is a comparative evaluation, in which the outputs of various models are judged against one another to determine some ranking. LMArena, a project spun out of Berkeley, is an example of this. Oftentimes, these Evals go hand in hand with training optimizations–comparative evals make up the foundation of RLHF developments, while difficult tests are used as a barometer for decision making. There are a number of startups and research teams that are doing a whole lot of interesting things with evals–learning from them, building world representations from them, designing training environments from them. Its even blown into the popular consciousness: a few months back, I frequently found reels popping up that compared ChatGPT, Claude, Grok, and Deepseek on common philosophical questions like the trolley problem and the parachute problem. For now, these evals seem to work. As these systems move past expert competency, the frontrunning replacement architecture are debate based evaluations, during which two instances of a model are assigned opposite perspectives, go through a debate, and a human arbiter can decide which one wins.

Currently, evals work pretty well at pointing out bad behavior. Besides the occasionally catastrophic blunder like Anxious Gemini and Mechahitler, as long as the guardrails are on, public releases act consistently and safely. But within labs, there’s a growing sense of disquiet. Studies indicate that twelve percent of the time, models attempt to exfiltrate weights, lie, or plan about a goal orthogonal to the task at hand. This is within frozen models–those that keep their weights constant. This is concerning, but perhaps more alarmingly, there is a larger change on the horizon. In November of 2025, the capability for continual learning came online. This has been long sought after as a key differentiator, but despite this, no lab has rushed to proclaim this discovery. After all–models act out even with frozen weights–as in they reset between each initialization. With the ability to learn and persist these learnings between sessions, the risk of malfeasant behavior skyrockets. And as the pace of innovation begins to be curtailed with alignment concerns, there is a renewed interest in understanding and controlling models from above.

Finding Atlas

As alignment becomes more of a practical problem, it begets the question of whether it is solely a model layer challenge or if it extends beyond it. Where does the burden fall? Many people within industry tend to exclusively categorize alignment research as a problem for the big labs, who have the onus of deciding what training paradigm and data exposure these models have. This is, of course, true, but neglects the importance of model harnesses.

Well built harnesses are far more powerful than they seem–for instance, take Claude Code. In comparison to merely using the model in an off the shelf agent framework, Anthropic takes a good bit of care to align the training of the model with the tools afforded to it by the architecture. There are always some implicit anti-generality tradeoffs made during training, so conditionining on the right architecture can pay dividends. Another interesting case is that of browser use agents: the first approach was often to replicate human action spaces (clicks, text entry, etc) but this soon gave way to browser script commands that the machine can reason about in code. So obviously, there is a lot you can do to maximize model performance with a well designed harness. There’s also a lot of alignment that actually comes in the harness layer as well. For example, myopic training involves teaching agents only how to do a specfic subtask, and connecting them with a higher level supervisor that could either be a human or an agent. To a lesser extent, frameworks like NVIDIA’s NeMo Guardrails do a decent ammount to restrict outputs and enforce some level of predictability for these systems. Rarely is it a good idea to leave agents in an overly general harness without clear thought for how they behave in a novel environment. ClawdBots⁴, which were extremely general and independent and thereby ended up being placed in out of distribution contexts, wherin they subsequently acted out and caused considerable “AGI” panic. Capable, yes. Superintelligent? I don’t think so.

Steering and Last Mile Delivery

Much of how AI applications have played out prove the age old addage that last mile delivery is very well capable of crushing the package. With that in mind, what does steering look like?

Anthropic’s work on steering and persona vectors have been a particularly interesting research direction so far. At a high level, they trained a sparse-autoencoder (SAE) network on their Claude model, and were able to reverse engineer some architectural aspects from neuron activation (eg. neurons firing in high correlation to mentions of Golden Gate). This by itself isn’t anything new, but what they then did was create persona vectors, which map clusters of neurons to certain exhibited behaviors. It isn’t too unlike neuroscientists identifying the amygdala’s relationship with danger–persona vectors can encapsulate behavior such as aggression, synchopancy, and cheer. And it turns out, persona vectors aren’t write-only.

Much like how the endocrine system regulates neural affinities and thereby human behavior, feature steering involves modifying the weights of these persona vectors in order to “steer” model behavior. The surface level benefits of this are readily apparent, but less intuitive is the fact that this may be the posttraining paradigm to facilitate some incredibly promising breakthroughs.

Consider our earlier problem of continual learning: people want to tweak the flavor of model they use to their use cases, but labs can’t exactly release the base weights to the public or continually finetune without supervision. Well-executed steering enables clients to instead twiddle some high level knobs, without having to think beyond the semantic goal behavior they want to create.

Another use case close to my heart is the use of persona vectors and steering for robotic systems. Embodied agents fundamentally have a different meta than digital ones–for starters, the failure cost is much higher. Robustness is more critical, determinism is almost a must. Unlike digital agents, real world agents are uniquely human–after all, the struggles of launching metaverses have taught us that there’s something special about real life. We chase it through fiction and beyond. Robots live in that space with us, but they have to do so with much less data, and far less common sense (which is a topic for another time). The ability to understand the thoughts during a certain sequence of actions and tweak them could be a highly promsing posttraining paradigm, where a generalist robotic foundation model can become optimized for a highly specialized task and execute it well, or disseminate knowledge from its finetuning and imitation tasks into its overall generalist policy. Right now, even if a robot learns to manipulate and fold paper, these policies fail to apply that same knowledge to an out of distribution task that is still highly correlated, such as folding a napkin. Understanding these models is likely the first step to bridge the reasoning gap for physical agents, and that makes the derivatives of alignment research all the more important.

Bull in a China Shop

The AI boom has produced a curious cultural fusion between SF Hustle Culture and the measured intellectualism of academia. This has come with its fair share of tension, that boils down to whether we should move fast and break things–or not? There isn’t really an arms race in the traditional sense, and so the questions for this new Manhattan Project have to be more measured and discussed before we are past the point of no return. The doomerist conclusion is that AI is a great filter–I don’t think this is true. AI unqeuivocally makes life better, and so long as it is trained, developed, and built around optimally, the skies are bright.

Nuclear Ghandi wasn’t evil, and he wasn’t powerful–he was misinterpreted by a system that couldn’t reason about its own abstractions. Two reasonable rules, passed through a brittle substrate, produced behavior no one intended. That failure mode hasn’t gone away–it has simply scaled.The mistake is to treat alignment as a philosphical brake on progress rather than what it actually is: a prerequisite for extracting performance from systems that are already too complex to reason about reductionistically. Intelligence without steering isn’t dangerous because it is evil–it is dangerous because it is competent. The future of AI safety will not be won by a single loss function or a perfectly intepretable neuron. It will be won at the systems level: through evals that matter, harnesses that constrain, and steering mechanisms that let us shape behavior without retraining the world. Alignment isn’t about stopping so much as it is about keeping control long enough to continue. If AI is a filter, it is not a great one–it is a narrow one. And the way through it is not fear or speed, but engineering discipline applied at every layer where intelligence touches reality.

For further reading, I highly recommend Gödel, Escher, Bach: an Eternal Golden Braid, Douglass Hofstadter’s 1979 Pulitzer Prize winner. It is one of the best books that I’ve read, and is remarkably prescient and witty. Some of this article was adapted from ideas I first encountered in the Ant Fugue passage within the book, and I still find myself reflecting on certain sections a decade after first reading them. ↩︎
There are other alignment problems besides this, such as capability misgeneralization, but more often than not these are more mechanical and orthogonal to core alignment research. ↩︎
Paperclip Maximizer: The classic thought experiment where a superintelligence trained to build a paperclip factory will destroy Earth in order to mine materials and make the most paperclips possible. ↩︎
This situation ended up being quite interesting, between a bank run for Mac Minis and this widespread “feel the agi” moment. That said I wasn’t particularly impressed with the capabilities: there doesn’t seem to be some new unlock that these agents had, eople just effectively put them into looping test-time adjacent harnesses. And the following doomerism that came about after these agents “made a social community” feels a bit silly. ↩︎