ChatGPT and Large Language Models: Their Risks and Limitations
For more on artificial intelligence (AI) in investment management, check out The Handbook of Artificial Intelligence and Big Data Applications in Investments, by Larry Cao, CFA, from the CFA Institute Research Foundation. Performance and Data Despite its seemingly “magical” qualities, ChatGPT, like other large language models (LLMs), is just a giant artificial neural network. Its complex architecture consists of about 400 core layers and 175 billion parameters (weights) all trained on human-written texts scraped from the web and other sources. All told, these textual sources total about 45 terabytes of initial data. Without the training and tuning, ChatGPT would produce just gibberish. We might imagine that LLMs’ astounding capabilities are limited only by the size of its network and the amount of data it trains on. That is true to an extent. But LLM inputs cost money, and even small improvements in performance require significantly more computing power. According to estimates, training ChatGPT-3 consumed about 1.3 gigawatt hours of electricity and cost OpenAI about $4.6 million in total. The larger ChatGPT-4 model, by contrast, will have cost $100 million or more to train. OpenAI researchers may have already reached an inflection point, and some have admitted that further performance improvements will have to come from something other than increased computing power. Still, data availability may be the most critical impediment to the progress of LLMs. ChatGPT-4 has been trained on all the high-quality text that is available from the internet. Yet far more high-quality text is stored away in individual and corporate databases and is inaccessible to OpenAI or other firms at reasonable cost or scale. But such curated training data, layered with additional training techniques, could fine tune the pre-trained LLMs to better anticipate and respond to domain-specific tasks and queries. Such LLMs would not only outperform larger LLMs but also be cheaper, more accessible, and safer. But inaccessible data and the limits of computing power are only two of the obstacles holding LLMs back. Hallucination, Inaccuracy, and Misuse The most pertinent use case for foundational AI applications like ChatGPT is gathering, contextualizing, and summarizing information. ChatGPT and LLMs have helped write dissertations and extensive computer code and have even taken and passed complicated exams. Firms have commercialized LLMs to provide professional support services. The company Casetext, for example, has deployed ChatGPT in its CoCounsel application to help lawyers draft legal research memos, review and create legal documents, and prepare for trials. Yet whatever their writing ability, ChatGPT and LLMs are statistical machines. They provide “plausible” or “probable” responses based on what they “saw” during their training. They cannot always verify or describe the reasoning and motivation behind their answers. While ChatGPT-4 may have passed multi-state bar exams, an experienced lawyer should no more trust its legal memos than they would those written by a first-year associate. The statistical nature of ChatGPT is most obvious when it is asked to solve a mathematical problem. Prompt it to integrate some multiple-term trigonometric function and ChatGPT may provide a plausible-looking but incorrect response. Ask it to describe the steps it took to arrive at the answer, it may again give a seemingly plausible-looking response. Ask again and it may offer an entirely different answer. There should only be one right answer and only one sequence of analytical steps to arrive at that answer. This underscores the fact that ChatGPT does not “understand” math problems and does not apply the computational algorithmic reasoning that mathematical solutions require. The random statistical nature of LLMs also makes them susceptible to what data scientists call “hallucinations,” flights of fancy that they pass off as reality. If they can provide wrong yet convincing text, LLMs can also spread misinformation and be used for illegal or unethical purposes. Bad actors could prompt an LLM to write articles in the style of a reputable publication and then disseminate them as fake news, for example. Or they could use it to defraud clients by obtaining sensitive personal information. For these reasons, firms like JPMorgan Chase and Deutsche Bank have banned the use of ChatGPT. How can we address LLM-related inaccuracies, accidents, and misuse? The fine tuning of pre-trained LLMs on curated, domain-specific data can help improve the accuracy and appropriateness of the responses. The company Casetext, for example, relies on pre-trained ChatGPT-4 but supplements its CoCounsel application with additional training data — legal texts, cases, statutes, and regulations from all US federal and state jurisdictions — to improve its responses. It recommends more precise prompts based on the specific legal task the user wants to accomplish; CoCounsel always cites the sources from which it draws its responses. Certain additional training techniques, such as reinforcement learning from human feedback (RLHF), applied on top of the initial training can reduce an LLM’s potential for misuse or misinformation as well. RLHF “grades” LLM responses based on human judgment. This data is then fed back into the neural network as part of its training to reduce the possibility that the LLM will provide inaccurate or harmful responses to similar prompts in the future. Of course, what is an “appropriate” response is subject to perspective, so RLHF is hardly a panacea. “Red teaming” is another improvement technique through which users “attack” the LLM to find its weaknesses and fix them. Red teamers write prompts to persuade the LLM to do what it is not supposed to do in anticipation of similar attempts by malicious actors in the real world. By identifying potentially bad prompts, LLM developers can then set guardrails around the LLM’s responses. While such efforts do help, they are not foolproof. Despite extensive red teaming on ChatGPT-4, users can still engineer prompts to circumvent its guardrails. Another potential solution is deploying additional AI to police the LLM by creating a secondary neural network in parallel with the LLM. This second AI is trained to judge the LLM’s responses based on certain ethical principles or policies. The “distance” of the LLM’s response to the “right” response according to the judge AI is fed back into the LLM as part of its training process. This way, when the LLM considers its choice of
ChatGPT and Large Language Models: Their Risks and Limitations Read More »











