ChatGPT and Large Language Models: Six Evolutionary Steps
The evolution of language models is nothing less than a super-charged industrial revolution. Google lit the spark in 2017 with the development of transformer models, which enable language models to focus on, or attend to, key elements in a passage of text. The next breakthrough — language model pre-training, or self-supervised learning — came in 2020 after which LLMs could be significantly scaled up to drive Generative Pretrained Transformer 3 (GPT-3). While large language models (LLMs) like ChatGPT are far from perfect, their development will only accelerate in the months and years ahead. The rapid expansion of the ChatGPT plugin store hints at the rate of acceleration. To anticipate how they will shape the investment industry, we need to understand their origins and their path thus far. So what were the six critical stages of LLMs’ early evolution? The Business of GPT-4: How We Got Here ChatGPT and GPT-4 are just two of the many LLMs that OpenAI, Google, Meta, and other organizations have developed. They are neither the largest nor the best. For instance, we prefer LaMDA for LLM dialogue, Google’s Pathways Language Model 2 (PaLM 2) for reasoning, and Bloom as an open-source, multilingual LLM. (The LLM leaderboard is fluid, but this site on GitHub maintains a helpful overview of model, papers, and rankings.) So, why has ChatGPT become the face of LLMs? In part, because it launched with greater fanfare first. Google and Meta each hesitated to launch their LLMs, concerned about potential reputational damage if they produced offensive or dangerous content. Google also feared its LLM might cannibalize its search business. But once ChatGPT launched, Google’s CEO Sundar Pichai, reportedly declared a “code red,” and Google soon unveiled its own LLM. GPT: The Big Guy or the Smart Guy? The ChatGPT and ChatGPT Plus chatbots sit on top of GPT-3 and GPT-4 neural networks, respectively. In terms of model size, Google’s PaLM 2, NVIDIA’s Megatron-Turing Natural Language Generation (MT-NLG), and now GPT-4 have eclipsed GPT-3 and its variant GPT-3.5, which is the basis of ChatGPT. Compared to its predecessors, GPT-4 produces smoother text of better linguistic quality, interprets more accurately, and, in a subtle but significant advance over GPT-3.5, can handle much larger input prompts. These improvements are the result of training and optimization advances — additional “smarts” — and probably the pure brute force of more parameters, but OpenAI does not share technical details about GPT-4. ChatGPT Training: Half Machine, Half Human ChatGPT is an LLM that is fine-tuned through reinforcement learning, specifically reinforcement learning from human feedback (RLHF). The process is simple in principle: First humans refine the LLM on which the chatbot is based by categorizing, on a massive scale, the accuracy of the text the LLM produces. These human ratings then train a reward model that automatically ranks answer quality. As the chatbot is fed the same questions, the reward model scores the chatbot’s answers. These scores go back into fine-tuning the chatbot to produce better and better answers through the Proximal Policy Optimization (PPO) algorithm. ChatGPT Training Process Source: Rothko Investment Strategies The Machine Learning behind ChatGPT and LLMs LLMs are the latest innovation in natural language processing (NLP). A core concept of NLP are language models that assign probabilities to sequences of words or text — S = (w1,w2, … ,wm) — in the same way that our mobile phones “guess” our next word when we are typing text messages based on the model’s highest probability. Steps in LLM Evolution The six evolutionary steps in LLM development, visualized in the chart below, demonstrate how LLMs fit into NLP research. The LLM Tech (R)Evolution 1. Unigram Models The unigram assigns each word in the given text a probability. To identify news articles that describe fraud in relation to a company of interest, we might search for “fraud,” “scam,” “fake,” and “deception.” If these words appear in an article more than in regular language, the article is likely discussing fraud. More specifically, we can assign a probability that a piece of text is about. More specifically, we can assign a probability that a piece of text is about fraud by multiplying the probabilities of individual words: In this equation, P(S) denotes the probability of a sentence S, P(wi) reflects the probability of a word wi appearing in a text about fraud, and the product taken over all m words in the sequence, determines the probability that these sentences are associated with fraud. These word probabilities are based on the relative frequency at which the words occur in our corpus of fraud-related documents, denoted as D, in the text under examination. We express this as P(w) = count(w) / count(D), where count(w) is the frequency that word w appears in D and count(D) is D’s total word count. A text with more frequent words is more probable, or more typical. While this may work well in a search for phrases like “identify theft,” it would not be as effective for “theft identify” despite both having the same probability. The unigram model thus has a key limitation: It disregards word order. 2. N-Gram Models “You shall know a word by the company it keeps!” — John Rupert Firth The n-gram model goes further than the unigram by examining subsequences of several words. So, to identify articles relevant to fraud, we would deploy such bigrams as “financial fraud,” “money laundering,” and “illegal transaction.” For trigrams, we might include “fraudulent investment scheme” and “insurance claim fraud.” Our fourgram might read “allegations of financial misconduct.” This way we condition the probability of a word on its preceding context, which the n-gram estimates by counting the word sequences in the corpus on which the model was trained. The formula for this would be: This model is more realistic, giving a higher probability to “identify theft” rather than “theft identify,” for example. However, the counting method has some pitfalls. If a word sequence does not occur in the corpus, its probability will be zero, rendering the entire product as zero. As the value of the
ChatGPT and Large Language Models: Six Evolutionary Steps Read More »










