Chapter 15: Attention Is All You Need

From here forward, the history becomes contested. The earlier chapters could maintain a certain distance: the debates were settled, the outcomes known...

Chapter 15: Attention Is All You Need

From here forward, the history becomes contested.

The earlier chapters could maintain a certain distance: the debates were settled, the outcomes known, the historical judgments relatively secure. But the events of the past decade are still being argued. The researchers who made them are still working, still defending their choices, still disputing the interpretations. The harms alleged by critics are still being litigated in courts and contested in policy debates. The questions of what happened and what it meant remain genuinely open.

This does not mean neutrality is impossible. It means that the reader should understand: we are now in territory where the narrator has views, where the sources disagree, where the dust has not settled. The story continues, but it becomes a different kind of story—less chronicle, more contested present.


🎧

The paper had an almost playful title. "Attention Is All You Need," published in June 2017 by eight researchers at Google, most of them young, some still in graduate school. It was a technical contribution to machine translation, proposing an architecture that eliminated recurrent neural networks in favor of something called "self-attention." The authors could not have known they were laying the foundation for everything that would follow.

The insight came from Jakob Uszkoreit, who suspected that attention mechanisms, a technique for allowing neural networks to focus on relevant parts of their input, might be sufficient on their own, without the sequential processing that had defined language models for years. The hypothesis went against conventional wisdom. Even his father, Hans Uszkoreit, a well-known computational linguist, was skeptical. But Uszkoreit pursued it anyway, and when the team needed a name for their architecture, he chose "Transformer" because, as he later said, he "liked the sound of that word."

Ashish Vaswani and Illia Polosukhin designed and implemented the first models. Noam Shazeer contributed the core mathematical insights: scaled dot-product attention, multi-head attention, parameter-free position representations. Niki Parmar tuned countless variants. Ɓukasz Kaiser and Aidan Gomez built the software infrastructure. Llion Jones wrote the initial codebase.

The paper reported that their architecture trained faster than alternatives while achieving better results on translation benchmarks. The improvement was meaningful but not explosive. What mattered was what the architecture enabled: parallelization across sequences, allowing training to scale with available compute in ways that recurrent networks could not.

The Transformer paper cost approximately $900 to produce—a trivial sum even for academic research. Within six years, training a model based on the same architecture would cost over $100 million.


The scaling began almost immediately.

OpenAI released GPT (Generative Pre-trained Transformer) in 2018, demonstrating that language models could be pre-trained on vast amounts of text and then fine-tuned for specific tasks. Google's BERT, released the same year, showed that bidirectional attention—looking at context in both directions—could achieve state-of-the-art results across a wide range of natural language understanding benchmarks.

But it was the scaling itself that became the story. GPT-2, released in 2019, had 1.5 billion parameters, large enough that OpenAI initially withheld the full model, citing concerns about misuse. GPT-3, in 2020, had 175 billion parameters and cost an estimated $2 to $4.6 million to train. The model could write essays, answer questions, generate code, and produce text that many readers could not distinguish from human writing.

The scaling hypothesis emerged as a guiding principle: perhaps the path to more capable AI was simply to make models larger, train them on more data, and use more compute. The architectural innovations mattered less than the exponential increase in resources. If you could afford to train a bigger model, you would get better results.

The costs exploded accordingly. GPT-4, released in 2023, reportedly cost over $100 million to train—Sam Altman publicly confirmed "more than $100 million," and some estimates ranged as high as $540 million. Google's Gemini Ultra was estimated at $191 million. The trajectory suggested that frontier training runs would soon exceed one billion dollars.

The cost structure revealed something important. Computing hardware accounted for 47 to 65 percent of total amortized costs. R&D staff, including equity compensation, accounted for 29 to 49 percent. Energy—the environmental cost that critics highlighted—was only 2 to 6 percent. The bottleneck was not electricity. It was GPUs, and the people who knew how to use them.

This meant that frontier AI research was increasingly concentrated among organizations with access to billions of dollars in capital. Academic researchers, who had driven AI through most of its history, found themselves unable to compete. The scaling hypothesis, if true, implied that the future of AI belonged to whoever could pay for it.


Not everyone accepted the premise.

In late 2020, four researchers submitted a paper to the ACM Conference on Fairness, Accountability, and Transparency. The title was "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" The lead authors were Emily Bender, a linguistics professor at the University of Washington, and Timnit Gebru, then co-lead of Google's Ethical AI team. Margaret Mitchell, Gebru's co-lead, was also an author—listed under the pseudonym "Shmargaret Shmitchell," a detail that would later seem both defiant and prophetic.

The paper made several arguments. First, the environmental and financial costs of large language models were significant and unevenly distributed—borne by the planet and by those without access to the technology. Second, the training data encoded biases that were difficult to identify and impossible to fully remove. Third, and most provocatively, the models did not understand language in any meaningful sense. They were "stochastic parrots," stitching together sequences of linguistic forms observed in their training data "according to probabilistic information about how they combine, but without any reference to meaning."

The metaphor cut to the heart of the scaling hypothesis. If larger models were simply better parrots (more fluent, more convincing, but no closer to understanding) then the billions of dollars being invested might be building an elaborate illusion rather than a path to intelligence.

Google's response transformed an academic debate into a public reckoning.

Management asked Gebru to retract the paper or remove the Google employees' names. Jeff Dean, Google's AI lead, said the paper "didn't meet our bar for publication." Gebru listed conditions under which she would consider staying; Google characterized her response as "accepting her resignation." Nine days before the conference accepted the paper, she was fired as co-director of the Ethics team. Two months later, Margaret Mitchell was also terminated.

Google employees protested, believing the intent was to censor internal criticism of the company's core business. Media coverage in The New York Times, Wired, and elsewhere amplified the controversy. The paper itself—which might have remained an academic contribution read by specialists—became widely known. The term "stochastic parrot" entered common usage; the American Dialect Society designated it the 2023 AI-related Word of the Year.

The firings revealed something about the institutional structures surrounding AI research. Critics could exist within companies, but only to a point. When their work threatened the narrative that justified billions in investment, the institutions responded.


The fundamental critique went deeper than any single paper.

Emily Bender, continuing her work after the controversy, argued that the framing of large language models as "intelligent" was itself a harm. The claims made by companies like OpenAI about "artificial general intelligence" shaped public understanding and regulatory response in misleading ways. "AI hurts consumers and workers and isn't intelligent," she stated bluntly. In 2024, she co-authored "The AI Con," arguing that fears about AI taking over the world were "symptoms of the hype being used by tech corporations to justify data theft, motivate surveillance capitalism, and devalue human creativity."

Timnit Gebru founded the Distributed AI Research Institute (DAIR) after leaving Google. She challenged the focus on hypothetical future harms—superintelligence, existential risk—at the expense of present ones. When prominent AI researchers published an open letter calling for a pause on AI development, citing civilizational risks, Gebru called it "sensationalist" and noted that it "amplified some futuristic, dystopian sci-fi scenario instead of current problems." The harms from AI, she argued, "are real and present and follow from the acts of people and corporations deploying automated systems."

Margaret Mitchell echoed this critique: "Ignoring active harms right now is a privilege that some of us don't have."

The critics were pointing at a pattern that had repeated throughout AI's history. The field's attention gravitated toward dramatic possibilities—machines that think, systems that could surpass human intelligence—while the mundane harms accumulated. Biased hiring algorithms. Surveillance systems disproportionately affecting marginalized communities. Training data scraped from the internet without consent, including copyrighted works, personal blog posts, and (as later investigations revealed) material that should never have existed.

The scaling race proceeded anyway. But the questions the critics raised did not disappear.


In January 2025, a Chinese company called DeepSeek claimed to have trained a model matching GPT-4's capabilities for approximately $5.6 million, a fraction of OpenAI's reported costs. On January 27, NVIDIA lost $589 billion in market capitalization as investors questioned whether the scaling hypothesis, and the infrastructure it demanded, had been necessary at all.

The claim remained contested. But it illustrated the uncertainty at the heart of the enterprise. The transformer architecture was real. The capabilities it enabled were real. But the story told about those capabilities—that scaling was the path to intelligence, that only the richest organizations could build the future, was always partly speculative.

What did "attention is all you need" actually mean? Technically, it meant that a specific mechanism could replace recurrence in neural networks. Metaphorically, it came to mean something different: that with enough attention (enough compute, enough data, enough money) intelligence itself might emerge.

The fundamental critics challenged this metaphor at its root. A stochastic parrot, no matter how large, is still a parrot. The patterns it produces may be beautiful, useful, even convincing. But they are patterns, not understanding. The distinction matters—not just philosophically, but practically, for how we regulate these systems, how we assign responsibility for their outputs, how we decide what to build next.

The transformer revolution was real. The language models it enabled were powerful. The question of what they actually were (intelligent systems approaching human cognition, or very sophisticated pattern-matching machines) remained open.

The architecture had attention. Whether it had understanding was a different question entirely.