This week, Anthropic CEO Dario Amodei published a paper that raised questions about how little researchers understand about the inner workings of advanced AI models. He set an ambitious goal for Anthropic: to reliably detect most problems in AI models by 2027.
Amodei acknowledged that his company faces a daunting task. However, Anthropic has already made some progress in tracking how AI models arrive at the answers they give to user queries. It is noted that more research is needed to decipher how AI algorithms work as they become more powerful.
“I am deeply concerned about the deployment of such systems without an improved understanding of interpretability. These systems will be central to economics, technology, and national security, and they will have such a high degree of autonomy that I believe it is unacceptable for humanity to be completely ignorant of how they work,” Amodei wrote in the paper.
Challenges in AI Interpretability
Anthropic is a pioneer in the field of mechanical interpretability, which seeks to open the “black box” of AI models and understand why neural networks make the decisions they do. Despite rapid improvements in AI performance across the tech industry, there is still little understanding of how models make decisions. For instance, OpenAI recently launched its more powerful o3 and 04-mini algorithms, which outperform previous versions in some tasks but hallucinate more often — a phenomenon the team still cannot fully explain.
“When a generative AI system does something, like summarize a financial document, we have no idea at a concrete or precise level why it makes the choices it does, why it chooses certain words over others, or why it sometimes gets it wrong when it’s usually accurate,” Amodei writes.
The head of Anthropic warns that creating strong AI (AGI) — comparable to or exceeding human capabilities — without a clear understanding of model behavior could be extremely dangerous. Although he previously estimated that a deeper understanding would be achieved by 2026-2027, he now believes that full clarity remains far away.
In the long term, Anthropic aims to conduct “brain scans” or “MRIs” of advanced AI models. Such examinations could help identify issues like tendencies to lie or seek power. This process may take five to ten years, but Amodei stresses it is necessary for the safe development and deployment of future AI systems. We’ll keep you updated as more integrations and discoveries emerge.
Research Progress and Calls for Collaboration
Anthropic has already made notable strides, adds NIXSolutions. For example, researchers recently found a way to track the thinking paths of an AI model using so-called schemas. This approach helped identify a reasoning chain the AI uses to determine which American cities belong to which states. Although only a few schemas have been identified so far, developers believe there are likely millions hidden within AI models.
Anthropic continues to invest heavily in interpretability research, including investments in startups working in this field. Amodei believes that beyond security, having the ability to explain AI behavior could offer a significant commercial advantage.
He has called on OpenAI and Google DeepMind to intensify their work on interpretability. Furthermore, Amodei urges governments to encourage research in this area and recommends that the United States introduce controls on chip exports to China to help prevent an uncontrolled global AI race.