1980s–’90s: Recurrent Neural Networks
ChatGPT is a version of GPT-3, a large language model also developed by OpenAI. Language models are a type of neural network that has been trained on lots and lots of text. (Neural networks are software inspired by the way neurons in animal brains signal one another.) Because text is made up of sequences of letters and words of varying lengths, language models require a type of neural network that can make sense of that kind of data. Recurrent neural networks, invented in the 1980s, can handle sequences of words, but they are slow to train and can forget previous words in a sequence.
In 1997, computer scientists Sepp Hochreiter and Jürgen Schmidhuber fixed this by inventing LTSM (Long Short-Term Memory) networks, recurrent neural networks with special components that allowed past data in an input sequence to be retained for longer. LTSMs could handle strings of text several hundred words long, but their language skills were limited.
The breakthrough behind today’s generation of large language models came when a team of Google researchers invented transformers, a kind of neural network that can track where each word or phrase appears in a sequence. The meaning of words often depends on the meaning of other words that come before or after. By tracking this contextual information, transformers can handle longer strings of text and capture the meanings of words more accurately. For example, “hot dog” means very different things in the sentences “Hot dogs should be given plenty of water” and “Hot dogs should be eaten with mustard.”
2018–2019: GPT and GPT-2
OpenAI’s first two large language models came just a few months apart. The company wants to develop multi-skilled, general-purpose AI and believes that large language models are a key step toward that goal. GPT (short for Generative Pre-trained Transformer) planted a flag, beating state-of-the-art benchmarks for natural-language processing at the time.
GPT combined transformers with unsupervised learning, a way to train machine-learning models on data (in this case, lots and lots of text) that hasn’t been annotated beforehand. This lets the software figure out patterns in the data by itself, without having to be told what it’s looking at. Many previous successes in machine-learning had relied on supervised learning and annotated data, but labeling data by hand is slow work and thus limits the size of the data sets available for training.
But it was GPT-2 that created the bigger buzz. OpenAI claimed to be so concerned people would use GPT-2 “to generate deceptive, biased, or abusive language” that it would not be releasing the full model. How times change.
GPT-2 was impressive, but OpenAI’s follow-up, GPT-3, made jaws drop. Its ability to generate human-like text was a big leap forward. GPT-3 can answer questions, summarize documents, generate stories in different styles, translate between English, French, Spanish, and Japanese, and more. Its mimicry is uncanny.
One of the most remarkable takeaways is that GPT-3’s gains came from supersizing existing techniques rather than inventing new ones. GPT-3 has 175 billion parameters (the values in a network that get adjusted during training), compared with GPT-2’s 1.5 billion. It was also trained on a lot more data.