Artificial intelligence is often discussed as though it arrived suddenly — a product of recent decades, powered by silicon and venture capital. The reality is considerably richer and more complicated. The intellectual foundations of AI stretch back more than a century, rooted in formal logic, mathematics, and philosophy of mind. Its development has been marked not by a single continuous arc of progress but by cycles: periods of intense optimism followed by funding collapses, quiet rebuilding, and then breakthroughs that reignited the field. Understanding that history is essential for anyone who wants to think clearly about what AI can and cannot do today — and where it is likely to go next.
Philosophical and Mathematical Foundations (Pre-1950)
The dream of artificial thought is older than computing itself. Gottfried Wilhelm Leibniz, writing in the seventeenth century, conceived of a "calculus ratiocinator" — a universal logical calculator that could reduce all reasoning to mechanical symbol manipulation. That vision was more speculative than operational, but it planted a seed that would flower two centuries later in the work of George Boole, whose 1854 work An Investigation of the Laws of Thought demonstrated that logical relationships could be expressed and manipulated algebraically. Boole's insight — that the rules of thought might be captured in a formal symbolic system — became one of the conceptual pillars of modern computing.
The proximate technical predecessor to AI, however, was a 1943 paper by neurophysiologist Warren McCulloch and mathematician Walter Pitts. In "A Logical Calculus of the Ideas Immanent in Nervous Activity," published in the Bulletin of Mathematical Biophysics, McCulloch and Pitts proposed a mathematical model of a neuron as a simple binary threshold unit [1]. They showed that networks of such units could, in principle, compute any logical function. This was not a working machine — it was a theoretical framework — but it established the possibility of mechanical intelligence grounded in the architecture of the biological brain, rather than abstract symbol manipulation alone. Russell and Norvig describe it as the paper that "first pointed the way to viewing the mind as a formal logical system" [14].
The Birth of AI (1950–1956)
The field's founding document is widely considered to be Alan Turing's 1950 paper "Computing Machinery and Intelligence," published in the journal Mind [2]. Turing began with a question — "Can machines think?" — and immediately noted that the question was too philosophically loaded to be useful. He replaced it with what he called the Imitation Game, now universally known as the Turing Test: a machine could be considered intelligent if a human interrogator, communicating via text, could not reliably distinguish it from a human respondent. The paper also anticipates and rebuts many of the objections to machine intelligence that remain in circulation today, including theological objections, claims about consciousness, and the argument from informality of behaviour. Its prescience is remarkable: Turing predicted that by the year 2000, machines would be capable of playing the Imitation Game well enough to fool a human thirty percent of the time.
I propose to consider the question, 'Can machines think?'
— Alan Turing, Computing Machinery and Intelligence, Mind, 1950 [2]The formal founding of AI as a discipline came six years later. In the summer of 1956, John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon convened a two-month workshop at Dartmouth College. The proposal they submitted to the Rockefeller Foundation stated: "We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it" [3]. It was at this conference that the term "artificial intelligence" was coined and established as the name of the new field. McCarthy later said the name was chosen deliberately to distance the enterprise from the field of cybernetics, which was then dominated by Norbert Wiener.
The Golden Age of Early Optimism (1956–1974)
The years immediately following Dartmouth were among the most intellectually exciting in the history of the field. Allen Newell, Herbert Simon, and Cliff Shaw built the Logic Theorist (1956), a program that could prove mathematical theorems from Whitehead and Russell's Principia Mathematica. They followed it with the General Problem Solver (1957), designed as a universal problem-solving architecture that modelled human cognitive processes [14]. Simon famously predicted in 1957 that within ten years a computer would be world chess champion and that machines would be capable of any cognitive work a human could perform.
On the neural network side, Frank Rosenblatt introduced the Perceptron in 1958, described in his paper "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" in Psychological Review [4]. The Perceptron was a single-layer neural network that could learn to classify patterns by adjusting the weights of its connections based on errors in its output — the first machine learning algorithm to draw explicit inspiration from neuroscience. The U.S. Navy funded the project, and the New York Times ran a story claiming the Navy had unveiled "the embryo of an electronic computer that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The hype was considerable.
Herbert Simon predicts machines will outperform humans at all cognitive tasks within ten years. The prediction will prove spectacularly wrong, but it reflects the genuine optimism of the era.
Rosenblatt's Perceptron demonstrated at the Cornell Aeronautical Laboratory — the first hardware implementation of a neural network capable of supervised learning [4].
Joseph Weizenbaum at MIT creates ELIZA, an early natural language processing program capable of simulating conversation by pattern-matching user input to scripted responses. Its DOCTOR script, which mimics a Rogerian psychotherapist, becomes famous — and disturbing to Weizenbaum — when users form emotional attachments to the program.
The First AI Winter (1974–1980)
The optimism of the 1960s collided with reality on multiple fronts simultaneously. In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis of the limitations of single-layer neural networks [5]. They demonstrated that the Perceptron could not learn to classify patterns that were not linearly separable — including the simple exclusive-or (XOR) function. While their analysis was technically correct and mathematically important, it had the unintended effect of dramatically reducing funding and interest in neural network research for more than a decade. Many researchers interpreted the book as a broader indictment of connectionist approaches, though Minsky and Papert had been more circumspect.
The broader reckoning came in 1973, when Sir James Lighthill published a report commissioned by the British Science Research Council evaluating the state of AI research. The Lighthill Report concluded that AI had failed to achieve its promised goals in any of its target areas and that the "combinatorial explosion" problem — the tendency of AI algorithms to require exponentially more computation as problems scaled — was fundamental and not a temporary obstacle [14]. The report triggered substantial cuts in AI funding in the United Kingdom and contributed to a similar contraction in the United States, where DARPA sharply reduced its support for AI projects it regarded as not delivering practical results.
The limitations of machine translation had also become apparent. A 1966 report by the Automatic Language Processing Advisory Committee (ALPAC) concluded that machine translation was slower, less accurate, and twice as expensive as human translation, triggering a decade-long freeze in funding for that domain.
Expert Systems and Revival (1980–1987)
The field's recovery came through a different paradigm: rather than trying to build general intelligence, researchers focused on encoding the specific knowledge of human experts in narrow domains. These "expert systems" used rule-based inference engines to reason over large databases of domain-specific facts. The approach had its roots in the work of Edward Feigenbaum at Stanford, who developed DENDRAL (1965), a system for inferring molecular structure from mass spectrometry data, and later MYCIN (1972–1974), a system for diagnosing bacterial infections and recommending antibiotic treatments [14]. MYCIN achieved diagnostic accuracy comparable to specialist physicians — a landmark result.
The commercial moment arrived in the early 1980s. Digital Equipment Corporation deployed XCON (also known as R1), an expert system that configured VAX computer orders, reportedly saving the company $40 million per year by 1986. Dozens of companies established AI divisions. The expert systems market was estimated at $2 billion by 1988. Japan's Ministry of International Trade and Industry announced the Fifth Generation Computer Project in 1982, a ten-year national program to develop massively parallel Prolog-based machines capable of natural language interaction and expert reasoning. The project prompted defensive responses from the United States and United Kingdom and injected substantial new funding into the field globally.
The Second AI Winter (1987–1993)
Expert systems proved brittle in practice. They were expensive to build, difficult to update, and performed poorly outside their narrow training domain. A physician who saw a patient with an unusual presentation could adapt and reason from first principles; MYCIN could not. The knowledge acquisition bottleneck — the sheer difficulty and cost of encoding expert knowledge in a form the system could use — proved to be a fundamental constraint that no amount of engineering could fully overcome.
The market for specialised AI hardware collapsed in 1987 when cheaper general-purpose workstations became capable of running the same software. Lisp machine companies like Symbolics and Lisp Machines Inc. went out of business. DARPA's Strategic Computing Initiative, which had funded ambitious military AI projects, was redirected toward more conventional computing after its goals proved unattainable on schedule. Japan's Fifth Generation project was quietly wound down in 1992 without achieving its central objectives. Another winter had arrived — less dramatic than the first but arguably more thorough in its discrediting of the dominant approaches.
Statistical Methods and Quiet Progress (1993–2010)
What followed was not the death of AI but its transformation. Researchers who had grown frustrated with hand-crafted rules and brittle symbolic systems turned increasingly to statistical and probabilistic methods that could learn patterns from data rather than requiring them to be explicitly encoded. This shift was as profound as any event in the field's history, though it happened gradually and without fanfare.
The theoretical foundations had already been laid. David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-propagating Errors" in Nature in 1986 [6], introducing the backpropagation algorithm as a practical method for training multi-layer neural networks. Backpropagation made it possible to adjust the weights of all layers in a network simultaneously by propagating error signals backward through the network — solving the credit assignment problem that had stymied earlier approaches. Though the paper appeared during the second winter, its ideas would prove foundational to everything that followed.
Vladimir Vapnik and colleagues developed Support Vector Machines (SVMs) in the early 1990s, publishing a comprehensive account in Vapnik's 1995 book The Nature of Statistical Learning Theory [7]. SVMs offered a principled, theoretically grounded approach to classification that performed well on real-world datasets and became one of the dominant machine learning methods of the late 1990s and 2000s.
In 1997, two milestones marked the field's recovery. IBM's Deep Blue defeated world chess champion Garry Kasparov in a six-game match — the first time a computer had beaten a reigning world champion under standard tournament conditions. The same year, Sepp Hochreiter and Jürgen Schmidhuber published "Long Short-Term Memory" in Neural Computation [8], introducing the LSTM architecture. LSTM networks addressed the vanishing gradient problem that had made it difficult to train recurrent neural networks on long sequences, enabling effective learning from temporal data. LSTM would later underpin breakthroughs in speech recognition, machine translation, and text generation.
The key insight of backpropagation is that error signals can be propagated backward through a network of simple units, allowing every weight to be updated in proportion to its contribution to the output error.
— Rumelhart, Hinton & Williams, Nature, 1986 [6]Yann LeCun and colleagues demonstrated that convolutional neural networks (CNNs) could achieve strong performance on handwritten digit recognition, describing the architecture in a widely influential 1998 paper in the Proceedings of the IEEE [9]. CNNs exploit the spatial structure of images by applying the same learned filter across all positions — a design that drastically reduces the number of parameters compared to fully connected networks and builds in translation invariance. The architecture would remain relatively dormant for another decade before a hardware revolution brought it to the center of the field.
The Deep Learning Revolution (2010–2017)
The decisive moment came in September 2012 at the ImageNet Large Scale Visual Recognition Challenge. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted a deep convolutional neural network — later known as AlexNet — that achieved a top-5 error rate of 15.3%, compared to 26.2% for the next-best entry [10]. The margin was not incremental; it was a discontinuous leap that made clear something fundamental had changed. The key enablers were the availability of large labeled datasets (the ImageNet dataset contained over a million labeled images), increases in GPU computing power that made training deep networks feasible, and architectural refinements including the use of ReLU activation functions and dropout regularisation.
The result triggered a rapid reorientation of the entire field. Within three years, deep learning approaches dominated virtually every benchmark in computer vision, speech recognition, and natural language processing. Google, Facebook, Microsoft, and Baidu established dedicated deep learning research groups and began recruiting aggressively from the small community of researchers who had kept the approach alive through the lean years.
In 2014, Ian Goodfellow and colleagues at the Université de Montréal introduced Generative Adversarial Networks (GANs), described in "Generative Adversarial Nets" presented at NeurIPS [11]. The GAN framework pits two neural networks against each other: a generator that produces synthetic data and a discriminator that attempts to distinguish synthetic from real data. Trained together in a game-theoretic equilibrium, the generator learns to produce increasingly realistic outputs. GANs enabled a new class of capabilities — photorealistic image synthesis, style transfer, and data augmentation — and opened the door to what would later be called generative AI.
In March 2016, Google DeepMind's AlphaGo defeated Lee Sedol, one of the world's strongest Go players, four games to one. The result was published in Nature by Silver et al. [12] and marked a turning point in public understanding of AI capabilities. Go had long been considered a uniquely human game, with a search space so vast that brute-force methods were computationally intractable. AlphaGo's success — combining deep reinforcement learning with Monte Carlo tree search — demonstrated that AI could achieve superhuman performance in a domain requiring long-term strategic reasoning, not just pattern recognition.
The Transformer Era and Large Language Models (2017–Present)
The paper that most directly shaped the current moment in AI was published at NeurIPS in 2017. "Attention Is All You Need" by Vaswani and colleagues at Google Brain introduced the Transformer architecture [13]. Previous sequence models — including LSTMs — processed text sequentially, maintaining a hidden state that was updated at each step. The Transformer dispensed with recurrence entirely, instead using a mechanism called self-attention that allows every position in a sequence to directly attend to every other position simultaneously. This made training dramatically more parallelisable and eliminated the bottleneck of sequential processing, enabling models to be trained on vastly larger datasets than had previously been feasible.
The Transformer architecture became the foundation for a sequence of increasingly capable language models. Google's BERT (2018) was pre-trained on large text corpora using masked language modelling and demonstrated that a single pre-trained model could be fine-tuned to achieve state-of-the-art results across a wide range of NLP tasks. OpenAI's GPT series scaled the approach in a different direction: rather than fine-tuning, GPT-2 (2019) and especially GPT-3 (2020) demonstrated that sufficiently large language models exhibited emergent few-shot learning capabilities — the ability to perform new tasks from just a few examples presented in the prompt, without any gradient-based fine-tuning [14b].
GPT-3, described in Brown et al.'s "Language Models are Few-Shot Learners" at NeurIPS 2020, contained 175 billion parameters and was trained on hundreds of billions of words of text [14b]. Its capabilities — coherent long-form writing, code generation, arithmetic, analogical reasoning — surprised even many experts in the field and sparked a new wave of investment and public interest. The release of ChatGPT in November 2022, built on the successor GPT-3.5 architecture and refined through reinforcement learning from human feedback (RLHF), brought these capabilities to a mass audience for the first time. Within two months, it had reached one hundred million users — the fastest adoption of any consumer application in recorded history.
The period since 2022 has seen rapid scaling and proliferation. GPT-4, Claude, Gemini, Llama, and Mistral — along with multimodal models capable of reasoning across text, images, audio, and code — have transformed the landscape of software development, scientific research, education, and creative work. Alongside these capabilities, concerns about hallucination, bias, misuse, and alignment have moved from academic papers to regulatory agendas. The European Union's AI Act, the first comprehensive legal framework for AI regulation, entered into force in 2024.
Surveying seventy-five years of AI history, several patterns emerge that are worth holding onto as the field continues to develop. First, progress has rarely been linear. The field has passed through at least two major winters — periods in which promised capabilities failed to materialise, funding was withdrawn, and dominant paradigms were discredited. Each winter was followed by a recovery built on ideas that had been developed quietly during the lean years. Backpropagation was published in 1986, at the beginning of the second winter; it became the foundation of the deep learning revolution twenty-five years later.
Second, the relationship between hardware and algorithmic progress is inseparable. AlexNet did not succeed because of a novel architectural idea alone — CNNs had existed since the late 1980s — but because GPUs finally made training them at scale computationally feasible. The current generation of large language models similarly depends on specialised hardware infrastructure that did not exist a decade ago.
Third, and perhaps most importantly, the history of AI is a history of repeatedly underestimated difficulty and repeatedly surprised expectations — in both directions. Problems that seemed tractable (general natural language understanding, commonsense reasoning) proved extraordinarily hard. Problems that seemed uniquely human (playing Go at superhuman levels, generating photorealistic images, writing coherent prose) fell faster than almost anyone predicted. This asymmetry should inform how we think about current claims and current concerns: genuine humility about what we know, combined with genuine seriousness about what we do not.
The field is further from its beginnings than Turing could have imagined, and further from its destination than its most enthusiastic advocates are willing to admit.
- McCulloch, W.S. & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.
- Turing, A.M. (1950). Computing machinery and intelligence. Mind, 59(236), 433–460.
- McCarthy, J., Minsky, M.L., Rochester, N., & Shannon, C.E. (1955). A proposal for the Dartmouth summer research project on artificial intelligence. Reprinted in AI Magazine, 27(4), 12–14 (2006).
- Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.
- Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
- Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
- Vapnik, V.N. (1995). The Nature of Statistical Learning Theory. Springer.
- Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
- Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 25, 1097–1105.
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS), 27.
- Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30.
- Russell, S. & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901.
This article was researched and written by Claude (Anthropic) under the direction of Sheldon Valentine, Founder of Dear Tech. Content is grounded in peer-reviewed academic sources and standard university textbooks in the field of artificial intelligence. All citations are drawn from primary sources.