NOTE: an update to this post for 2024 was recently released! Some of the information below may be outdated. 

GPT-4

Since April, there have been significant improvements at the lower and middle parts of the technical spectrum, but relatively limited improvements in the upper part of the spectrum. That upper part is, of course, GPT-4.

OpenAI, the company behind ChatGPT and GPT-4, has not exactly been sitting still, but seems to have been focusing on massaging the public perception of LLMs. OpenAI’s CEO, Sam Altman, has openly stated that the launch of GPT-5 is still far away, and that the preparations for the next version have not even started.

In the meantime, even though OpenAI has not disclosed key details about GPT-4’s internal structure and architecture — despite the openness promised by its company name — some details have emerged on how GPT-4 works internally. Allegedly, according to several experts close to OpenAI, GPT-4 actually operates as a combination of eight smaller LLMs that  act as a kind of “expert committee”, where each LLM independently formulates an answer, and the best answer is then cherry-picked.

Another interesting fact is that, according to Sam Altman, the daily operating cost of GPT-4 is “eye-watering” high. Many industry watchers assume that the mere infrastructure costs are so high that OpenAI must be operating at a significant loss with each prompt submitted by end-users. Many therefore assume that OpenAI uses the traditional Silicon Valley technique of capturing market share and making users addicted to the technology, and to postpone the profitability question to the future.

Many users are convinced that they have seen the quality of GPT-4’s answers degrade since March. There have been consistent rumours that OpenAI has silently introduced trimmed-down versions of GPT-4 in order to reduce the operating costs. OpenAI and others contest that this is the case, and explain that these impressions of quality degradation are probably caused by a combination of the inherent unpredictability of GPT-4 (no question is every answered exactly the same) and the “magic” that is wearing off (i.e., once GPT was so magical that small mistakes were easily forgiven).

A faster version of GPT-4, exclusively targeted at enterprise customers, was released at the end of August. Content-wise it is the same as the version of March, but — apart from the speed improvement — it distinguishes itself on enterprise-grade security and privacy. Which brings us to the confidentiality issue.

Confidentiality concerns

When showcasing how our own product portfolio integrates with GPT-4, we consistently get three questions from legal experts:

  • Where does the legal information come from?
  • Is it possible for legal experts to incorporate their own legal knowledge?
  • How should we deal with the confidentiality issues reported by the newspapers?

For all its cleverness, it is indeed remarkable how OpenAI has completely dropped the marketing ball for this issue. The short answer is that OpenAI is indeed learning from the input of its users, similar to how Google learns which answers are preferred by users, by checking whether users reformulate their query or keep clicking on different hyperlinks in order to find their answer.

OpenAI limits the learning process to the end-user versions of ChatGPT and GPT-4. Initially, OpenAI also learned from user queries submitted through its API (i.e., when third party developers incorporate the LLM in their own products), but the company claims to have stopped this practice since the end of March.

Even though end-users can opt-out from the learning process through a simple software setting, and even though it no longer applies at all to the API-use, the damage was done. For most legal experts, the reputation of OpenAI is burned, and — disregarding the quality problem of the hallucinations — many law firms downright refuse to allow any use of LLMs at all, because of the fear that client confidential data would be exposed in the next version of the LLM.

While OpenAI has obviously done this to themselves, the fear in the legal community is ungrounded. LLMs learn from billions of data points, so that adding a particular fragment of client information is like adding a single drop of water to an ocean of existing information. The likelihood that a client’s information is exposed through other means (gossip, data breaches, lost laptop, …) is likely much higher. Moreover, as explained below, LLMs do not literally store text, so the analogy is probably more “a vague drop of water” into an existing ocean.

Customers can also use Microsoft’s version of GPT-4, through Microsoft’s Azure cloud services. Microsoft is not in the business of developing of GPT, so can guarantee in very strong language that it does not reuse any customer data for improving either GPT or any of its own products, and does not share customer data with OpenAI. These confidentiality guarantees are similar to how Microsoft will never reuse a customer’s DOCX-files that are stored in its Office 365 cloud.

Once GPT-5 will be released by OpenAI, Microsoft will simply make a copy of that version, and make it available to its customers. In other words, Microsoft’s customers can “piggyback” on all the data that was implicitly provided by the millions of end-users who interact with the end-user (non-API) version of OpenAI’s ChatGPT and GPT-4.

Regulating LLMs

OpenAI’s CEO testified in front of the US Senate in May, accompanied by a written testimony, proactively calling for some level of regulation, and even proposing the formation of an AI-authority that would license the most powerful AI systems.

OpenAI’s position in this regard appears strange at first sight, but clearly serves its own interests. OpenAI and its investor Microsoft are frontrunners when it comes to investing heavily in “safe AI”, so it is more likely that its competitors would bump into the AI authorities than OpenAI would. Obviously, an even more important reason to call for at least some regulation is to avoid that very strict or rigid regulation would get introduced, or that the public opinion would turn against LLMs once the hype is over and realism kicks in.

That strict regulation is, of course, the European Union’s AI Act, which has been in the making for several years, but was significantly updated in June to reflect the sudden introduction of LLMs. As the most visible company in this space, OpenAI is well aware how the requirements of the AI Act — in terms of testing, liability, sanctions and so on — can undermine its business model. Many commentators expect that the AI Act could become the global “golden standard” for AI legislation, similar to the role of the EU General Data Protection Regulation (GPDR) for privacy and data protection. When witnessing its impact on the business models of Google and Facebook/Meta, it is probably a smart approach for OpenAI to try to get soft regulation instead of nor regulation at all.

Copyright lawsuits

In the public perception, there in an increasing amount of negative sentiments against LLMs, mostly due to the imminent threat of job loss, with GPT for example being the subtext of the recent Hollywood writers’ strike. There are also pending lawsuits against AI-companies, such as the class-action suit against OpenAI by fiction writers such as John Grisham and George R.R. Martin, claiming that GPT’s training material incorporates illegal copies of their works. Previously, lawsuits were also launched against AI art generations (Stable Diffusion and Midjourney) and against Github for training Copilot on open-source lines of code.

Recognising the significant profits that generative AI is promising, online publishers are actively starting to disallow the webscraping done by LLM training bots. For example, the NY Times has updated its terms of use, prohibiting its contents from being used by AI training bots, while Reddit and X (the former Twitter) have changed their API pricing in order to get paid by companies such as OpenAI that want to use their contents for training purposes. OpenAI itself is also taking some proactive steps in this regard, by publishing instructions on how webmasters can easily exclude the contents of their domain from any webscraping at all, similar to how webmasters can opt out from a search engine’s scraping bots.

Such lawsuits are bound to increase, and their legal analysis will also be very interesting because of the way the LLMs technically incorporate texts. In the discussion around literal copying of texts (or images), LLMs operate much more like to human brains then like traditional databases, in the sense that they do not literally store text. Instead, during the months of training, texts are converted into enormous “vectors” (i.e., ordered lists) of numbers that approximate the original texts. During the training process, LLMs are repeatedly fed different combinations of text fragments, and will slightly update the associated number vectors each time to make them fit all those different combinations. This is roughly similar to how a human brain learns that a particular verb can be used in different contexts and with different meanings.

When, after the training process, the LLM is asked to answer a particular question from a user, it will convert the user’s prompt into a number vector, and subsequently search its memory for vectors that are mathematically close. Once found, the LLM will then combine and reconstruct the matching vectors into the final text answer.

A crucial point is that number vectors are much more compact than the literal texts: even the most powerful computers would be unbearably slow when they would have to interactively juggle with the trillions of words that were read during the training process. Instead, the number vectors capture a (detailed) “gist” of the texts, which allows for significant size reductions. The downside is that it is no longer possible to literally reproduce the initial texts: only close approximations of it can be reconstructed. Another consequence of the use of number vectors is the creativity and unpredictability of LLMs. Because they operate by calculating with number approximations of original texts, they will inherently generate slightly different results, even when the user’s prompt would remain identical.

All of this is similar to how human brains operate. When someone is asked to recall last week’s discussion with his wife, he will probably think about the place where the discussion took place, the emotions that were felt, etc. and then reconstruct the discussion based on those elements. If that same person would be asked the same question next year in a different place and different context, he will probably give a somewhat different answer. Similar to how psychological studies have demonstrated that through “anchoring” people will remember different details, LLMs will also give different answers depending on the context, such as the keywords and previous questions that were asked.

It remains to be seen how courts will apply the technicalities behind number vectors to copyright legislation, which traditionally focuses on (semi)literal copying. As ideas are free, copyright legislation will traditionally not prohibit a situation where a person would hear the outline of a story, and then reproduce parts of that story in his own words several weeks later. However, due to their reliance on approximating number vectors, LLMs operate in a way that is closer to “recall of detailed ideas” then to literal copying.

History may repeat itself, as courts and legal scholars will once again be confronted with new fundamental questions, similar to the question that arose in the late nineties, e.g. regarding the illegality of hyperlinking or whether search engines are allowed to store copies of web pages in their indices. While we look at those elements nowadays as being legitimate, this was not necessarily the case back at the time, when viewed through the lens of traditional copyright legislation. Perhaps in 20 years, we will look back at today’s questions regarding the way LLMs happen to technically operate, and again wonder what all the fuss was about.

It will be particularly interesting to see whether the “fair use” doctrine in US copyright legislation — which allows for some limited amount of copying and reuse — will once again allow for much more manoeuvring room, as compared to the exhaustive list of statutory exceptions under EU copyright legislation.

Another question that courts will have to answer is how LLM vendors will deal with takedown requests. Unlike traditional web pages and databases, in which it is easy to find and update all the instances of a certain infringing text, it is impossible for humans to “dive into” an LLM in order to find and remove the infringing material, because all the texts are stored as approximate number vectors that no human can even begin to understand — kind of similar to how nobody can look into your brain or order you to forget a certain fact. Only recently, OpenAI’s own researchers have been able to partially reconstruct how GPT-2 stores knowledge.

Commercial competitors for GPT-4

GPT-4 remains the Rolls Royce among all the LLMs. Its quality remains unmatched, and even though its competitors are coming closer, the gap remains significant for the time being. Perhaps that is also one of the reasons why OpenAI is not in a hurry to introduce GPT-5.

The closest competitor to the GPT-4 family is Google- and Amazon-backed Anthropic, founded by several former OpenAI employees. According to independent reviews, Anthropic’s “Claude” LLM is significantly faster and cheaper than GPT-4, has more recent knowledge (up to early 2023, as opposed to GPT-4’s September 2021), and is slightly better at legal writing.

One of Claude 2's most impressive features is its ability to handle large contexts up to about 75,000 words, which is a significant step up from the 24,000 words that a special, more expensive version of GPT-4 offers. We come back to this topic below because the character limit remains an Achilles heel for all LLMs.

By now it is well-known that Google is playing catch-up in the LLM space — which is ironic because Google researchers came up with the breakthrough “attention” algorithm in 2017, which laid the foundations for today’s LLMs. Unlike Microsoft and OpenAI, Google deliberately held off from launching an LLM, fearing that LLMs would cannibalise its search engine from a commercial perspective, while the LLMs’ hallucinations would undermine the public’s trust. When Google eventually hurried the launch of its Bard chatbot in February, it realised too late that the demo contained an embarrassing factual error. During a relaunch in May at its developers conference, Google announced a series of LLMs (from small to large, and some specialisations, e.g. for the medical industry) and promised to include the LLMs in many of its applications. So far, however, Google’s LLMs have failed to impress. In our daily interactions with legal experts, we also sense that most law firms and corporates are reluctant to entrust Google with their internal data, so it remains to be seen whether Google will catch up in the short term.

Several smaller players receive much less media attention. Cohere offers an array of large language models, and is often used for the niche application of “reranking” search results. German company Aleph Alpha presents itself as the European alternative to the American players.

Open-source LLMs

Probably the biggest development since April is the emergence of the open-source LLMs. They present themselves as alternatives to the commercial players, with two specific advantages: zero licensing cost and the possibility of private hosting. Based on the fear among legal experts regarding OpenAI’s confidentiality hiccups and Google’s ambiguous privacy track record, it should not come as a surprise that many law firms are actively looking into open-source LLMs.

Quite ironically, then, Facebook/Meta is the driving force behind these LLMs, having released a first version of its “Llama” LLM back in February. While from a formal licensing perspective that first version was only provided to some academic researchers and could not be used for commercial purposes, Meta seemed to take an ambiguous position about the licensing and distribution point, remaining completely silent about the leaks that quickly happened. Many assumed this was a commercial tactic to quickly release a non-safe version of an LLM, to test the waters and the public’s perception, without having legal claims for unsafe interactions or copyright issues.

Llama became an instant hit, and was followed by a publicly available second version. It is not an “open-source” version in the strict sense, because its use is prohibited by companies with more than 700 millions active users — obviously targeting Microsoft and Google. In practice, however, everyone treats it as a freely available open-source version.

Llama offers several different models, with parameters ranging between 7 and 70 billion parameters. The smaller models consume proportionally less memory, and are significantly faster, but the trade-off is the reduction in quality.

Before Llama version 2, several other open-source vendors had already appeared (such as Falcon and Mosaic’s MPT) that had either modified the first version of Llama, or were trained on similar data sets. Even traditional companies that are primarily known for their traditional, commercial software sales are suddenly releasing freely available open-source LLMs.

Even though the open-source LLMs are not a match for GPT-4 in terms of quality, their creators claim that the quality is often equal to ChatGPT, and at least better than GPT-3. Some industry analysts, as well as Google insiders, are convinced that in the longer term, commercially available LLMs are destined to be beaten by open-source LLMs, due to their free availability and collaborative nature.

However, the quality claims regarding the open-source LLMs must be taken with a grain of salt, because of testing methodology for LLMs remains a bit of a black box. In principle, the only way to measure whether an LLM can produce good legal content, is to check most answers by hand. In practice, this is not feasible, so developers must create automated tests with easy indicators that can act as a proxy for quality. This allows for some degree of comparison, but in reality users are often underwhelmed when they notice that the quality of the answers generated open-source LLMs remains significantly below GPT-4.

In the daily reality of legal teams, quality of course remains the most important criterion, and unfortunately many of the smaller LLMs (such as the 7 billion parameters version of Llama version 2) are simply not usable for legal tasks. This often comes as a surprise to legal teams, particularly when they have been reading about the positive results that users in other sectors seem to have obtained with their own use cases. The unfortunate reality is that those other user cases often involve tasks where a significant degree of error is allowed, e.g. estimating whether a certain product review is positive or negative, or writing ideas for marketing campaigns.

Accordingly, even when a press release claims that a new open-source LLM approximates the quality of ChatGPT, this does not mean that those open-source LLMs are actually useful yet for complex real-life legal drafting work. Good open-source LLMs are capable of categorising content (e.g., "Does this clause contain a commencement date?") and drafting straightforward individual paragraphs, but do not yet expect them to be able to draft entire documents or exhibit advanced thinking capabilities.

Finally, a very important remark is that almost all the open-source LLMs are English-only, with little knowledge of other languages. For example, 89% of the source material used for training Llama 2 consists of English texts, with other major languages making up less than 2% of the training data. This is yet another reason why the commercial LLMs (GPT-4 and Claude 2) still wear the crown in LLM-town.

Emergence of an ecosystem

LLMs themselves are not primarily intended to be used by end-users. Sure, OpenAI offers a chatbot interface that allows anybody to ask questions and get quick responses. However, that chatbot remains fairly bare-bones in its feature set, despite the various tweaks that have been added since April. OpenAI continues to primarily see itself as a developer of the foundational engine, and wants to leave sufficient room for the thousands of other companies to develop their own products around the LLM engines.

The most important company using OpenAI’s engine is OpenAI’s primary investor Microsoft. It has quickly built the GPT-4 engine in its own Bing search service, which initially seemed to threaten Google. This caused Google to rush its painful introduction of Bard back in February, although the positive effects for Bing seem to have already worn off. Through its Azure cloud services for developers, Microsoft offers its own version of the various GPT flavours, with similar pricing but without the confidentiality worries described above.

Late September, Microsoft also announced its intent to integrate OpenAI’s generative technology directly into Windows as from October, with the various Microsoft Office tools following as from early November for enterprise customers. This means that every legal expert will be able to provide ChatGPT-like instructions directly from within Microsoft Word and Outlook, to for example quickly summarise a certain text, highlight the key topics discussed in a report, or optimise the wording of selected paragraphs. The details remain scarce however, and it remains to be seen whether the actual integration of GPT into Word and Outlook will be as slick as the demo video suggests.

Other than Microsoft, thousands of start-ups have jumped on the LLM train. In practice, good developers can integrate simple text drafting and rephrasing capabilities in a matter of a few hours, but end-users are generally not aware of this. Start-ups and established companies alike abuse this lack of knowledge, claiming that the wonderful capabilities of their new product is the result of intense research, while in reality they simply provide a shiny “wrapper” around a standard LLM engine.

The legal tech sector is of course not an exception, with start-ups offering little more than a simple shell around GPT-4, while simultaneously claiming that their product has advanced legal knowledge built in. It remains to be seen how the legal sector, which has little experience with buying software, will cope with these featherweight products.

Lack of legal knowledge

While lawyers are consistently mentioned as one of the professions most threatened by LLMs, most legal experts seem to be merely superficially interested in actually using LLMs. One of the reasons is probably that the legal knowledge of LLMs is only so-so, particularly for specialised legal domains and small jurisdictions.

It should not come as a surprise that the performance of LLMs is strongest in sectors that have a relatively open data culture. For example, in software development, millions of open technical discussions and billions of lines of programming code are openly available. Even though LLMs cannot yet build entire software packages, astonishing results with shorter pieces of programming code are already possible today, thanks to the existence of this openly available material. Similarly, in the medical sector, we are witnessing the emergence of specialised LLMs, such as Google's MedPalm-2.

In comparison, only limited information is openly available for the legal sector, relatively speaking.

First, the legal sector is hampered by the fragmentation in different jurisdictions. Compare this to medical knowledge, which roughly applies globally: a study done by Spanish ophthalmologists will in most cases be very relevant to their Brazilian colleagues. Conversely, legal domains such as employment law differ drastically between even neighbouring countries, which can easily lead to hallucinations when the LLM would start mixing employment law information from countries that happen to use the same language.

While legislation is usually publicly available online, this is only partially true for case law (despite trends towards more open data). Also, in most jurisdictions, only a small percentage of legal doctrine is publicly available online, with the majority being locked behind the paywalls of legal publishers. Many law firms publish newsletters and blog posts, but the majority is intended as a commercial teaser, for which the truly interesting questions tend to be deliberately left out.

Probably the most interesting missing piece of the legal information puzzle is practical knowledge. In the software development world, millions of online forums exist that provide high-quality Q&A-style discussions, ranging from very practical questions ("What does error message #1239874 mean in software X?") to advice about best practices and real-world back-side stories. Those forums and written knowledge exchanges do exist for some jurisdictions, but their popularity and the volume of content pales in comparison to what is available in other industries. As we all know, true practical legal knowledge tends to be mostly acquired through experience and oral discussions.

Even if GPT-5 would be five times better than GPT-4 from a technical point of view, this lack of legal content will remain a problem, as there are no signs that the amount of publicly available legal content will have drastically increased by 2025. As long as the LLMs lacks access to up-to-date relevant legal information, their use in legal teams will be confined to light drafting tasks. In popular legal domains of large jurisdictions such as the United States and the United Kingdom, for which already quite some legal content is available online, more intense legal tasks (such as research, contract drafting and memo drafting) will obviously also be possible, but also there the quality of GPT-5 will probably remain stuck at the level of a bright-but-unpredictable general assistant.

What is required to elevate LLMs from the level of a somewhat useful general assistant to a truly useful legal assistant, is being able to use your own, enriched LLM that somehow contains your practical and up-to-date knowledge, in addition to the general knowledge it already has.

In our post of April, we presented several different methods for reaching these goals: directly submitting knowledge into an instruction prompt, training a new LLM, finetuning an existing LLM and using semantic search. Let’s look again at each of them.

Training your own LLM

The big advantage of creating your own LLM is that it does not require legal teams to do significant preparatory work. The “lazy upload” approach is to instead take the existing datasets from open-source projects, throw the contents of your entire legal document management system at the training system, perhaps augment it with focused legal content (such as the Cambridge Law Corpus), and then let the software chew on those gigabytes of data. After several weeks, you then get your own LLM as a result, with up-to-date legal information and mimicking your own style.

Many legal teams assume that this approach is probably within reach: if open-source projects can manage to train their own LLM, why would this not be possible for an ambitious legal team?

In our previous post, we warned that training a new LLM is not for the faint of heart, because of all the engineering problems involved.

First, the idea that open-source teams can easily create their own LLMs, must be taken with a grain of salt. In reality, very few of the hyped open-source projects effectively train their own LLM. Instead, most of them merely create an optimised version of an existing LLM — e.g. a version that does not actively refuse to generate content about “dangerous” topics, or a version that better understands a user’s questions.

Only a few technical teams (such as those from Meta) have the technical know-how and the resources available to truly train an LLM from scratch, instead of merely optimising an existing LLM. This should not come as a surprise when you realise that thousands of expensive servers are required to jointly perform the required training calculations over the course of many weeks. The costs run in the millions, which is far beyond what law firms — let alone inhouse legal teams — are willing to spend to create their own LLMs.

In addition, there is enormous scarcity in the hardware components required for training LLMs. While some alternative hardware options are being actively explored, the only viable training hardware at this moment consists of Nvidia’s professional processing units (“GPUs” such as the A100 and H100), for which the cost range is between $10,000 - $40,000. Those components can be rented from specialised providers, but because hundreds (if not thousands) of those GPUs are simultaneously required during the training process, the demand is much higher than the available components, and will stay like this for the next year. Many specialised cloud providers are in fact rationing the hardware components. There is a reason that Nvidia’s quarterly results are skyrocketing, and that Elon Musk is saying that “GPUs are at this point considerably harder to get than drugs”. Even for OpenAI, the GPU shortage is said to be a factor for its relative lack of new product announcements over the last months.

Even so, companies such as Mosaic (creator of the highly regarded MPT open-source LLM) have started to offer off-the-shelve tools to train LLMs from scratch. Through clever optimisations and easy tools, these companies claim to bring down the entire training cost to less than half a million dollars, while a relatively limited amount of technical expertise is required.

So perhaps a future where every legal team can train his own LLM is not that far off, after all? Price drops of factor 10 (i.e., 50,000 dollars to train your own LLM) frequently happen in information technology — e.g. for hard disk and memory capacity — but take at least 7 years in optimal circumstances.

Reinforcement learning

Instead of training an LLM from scratch, most of the open-source projects focus on adding optimisations to existing LLMs. Those optimisations particularly consist of adding a steering layer that integrates a human feedback loop into the answers of the standard version of the LLM.

It turns out that this so-called "reinforcement learning from human feedback" (RLHF — see the extensive introduction) is what separates a useful LLM from an unpredictable LLM. The underlying reason is that, during their training, LLMs are essentially taught to guess the most suitable next word from the number vectors they created. However, what statistical calculations determine to be the "next best word", is ultimately somewhat random, due to the billions of ingested words. An LLM uses several clever algorithms to remove garbage answers from its output, but there remains of significant divide between the answers that were determined by the software to be suitable, and the answers that a human would determine to be relevant.

RLHF is essentially a layer of human feedback in which the standard answers from the LLM are ranked by humans. The commercial LLM vendors (OpenAI, Anthropic, Cohere and Google) are known to integrate thousands of manually rated answers ("high quality", "bad answer structure", "inappropriate content", "unhelpful", "not following instructions", ...) into its LLMs. Most technical experts see the integration of RLHF as one of the most important reasons why ChatGPT is able to generate answers that are so much more useful than the answers generated by GPT-3, even though the underlying training data of both LLMs is the same. However, the investments required for this manual labour are significant, the work tends to be laborious and monotonous, and vendors are criticized for working with poorly paid contractors to drive the cost of this labour.

Open-source LLMs — created by volunteers and non-profit organisations — obviously do not have budgets to pay contractors. Projects such as Vicuna therefore use workarounds that approximate direct human feedback, e.g. by using old chat sessions that were uploaded and rated by community volunteers (e.g., from sharegpt.com). While this is a cheap substitute that is certainly not as good as using dedicated human workers, the quality of the output is effectively enhanced.

The good news is that open-source projects are increasingly coming up with new and clever ways for increasing the quality of their LLMs. The bad news is that RLHF is seen as a cornerstone for LLM quality, and thus constitutes yet another obstacle for legal teams that want to train their LLMs. After all, rating thousands upon thousands of answers in an average law firm is not very realistic.  

Finetuning an existing LLM

Instead of training an LLM from scratch, a legal team can also consider finetuning an LLM. This involves adding a layer on top of an existing LLM. Instead of deeply integrating data into the LLM (as is the case with training an LLM from scratch), one can simply start from an existing model, and add additional data to it. Instead of training a model for months and requiring stellar budgets, finetuning promises to bring LLM heaven to mere mortals, with only a couple of hours of training.

Is the “lazy upload” dream of legal teams now possible through finetuning, either by finetuning an open-source LLM or by finetuning ChatGPT?

Unfortunately, no. As is already evident from the examples given in OpenAI’s guide, finetuning is primarily intended for teaching LLMs how to generate output. Instead of drafting difficult-to-describe prompt instructions, thousands of examples can be given during the finetuning process, from which the LLM can then learn how to generate output and how to behave, e.g. which wording style to use, how output should be formatted, which elements in the texts to pay attention to, how to deal with specific edge cases, and so on. Many users have reported that through finetuning with high-quality examples, less capable LLMs are easily able to surpass the quality of GPT-4. (Note that OpenAI has announced that finetuning GPT-4 will be available by the end of the year.)

A good example of what finetuning is really about, is Meta’s Llama 2. This LLM is available as both a “raw” version and a chat-version. The latter is actually a finetuned version of the former, in which thousands of example chat conversations were used to teach the LLM how to “behave” in chat conversations with users, e.g. which unsafe questions to refuse, how to be helpful towards users, how to remain polite, how to understand references to previous parts of the conversation, and so on. In other words, finetuning is intended to improve the form of the output, not to introduce new facts or teaching new knowledge.

Applied to the legal sector, finetuning would thus be ideal for teaching an LLM tasks such as identifying specific contract clauses to support a due diligence process, formatting legal citations, or optimising the tone of emails that are sent to clients.

Another reason why the “lazy upload” dream of legal teams will probably not come true, is that finetuning will require careful preparation of legal data. Data must not only be presented towards the LLMs in prompt/answer pairs, but should also be carefully cleaned and reviewed, to avoid that errors would sneak into the finetuned LLMs. Due to the relatively limited amount of examples that will be uploaded during finetuning — typically only a few hundreds or thousands of examples — any factual error (e.g., referring to outdated legislation) has the risk of getting amplified when the LLM is actually used. This risk also exists when training an LLM, but due to the sheer amount of data uploaded during the training process (think millions instead of hundreds), the statistical risk that a few bad examples will ruin the quality of the LLM is significantly reduced.

In other words, except for a few specific use cases (e.g., mass-scale document reviewing), legal teams should probably forget about the entire finetuning process, because the intended use cases are not what they probably have in mind.

Semantic vector search

GPT-4’s character limit is a total of about 6,000 words, in which both the prompt and the LLM’s answer must be contained. In practice, this limits the output of GPT-4 to several paragraphs of input text and an equal amount of output text.

With the launch of GPT-4 in March, OpenAI also introduced a “32K” version of GPT-4 that can handle four times the amount of standard GPT-4; a recently updated version of ChatGPT allows for 12,000 words. Since April, there is a race among the LLMs to increase the character limit through a combination of clever tricks and better hardware: Claude 2 can for example handle about 75,000 words. Factor in some hyperbolic projects (like Magic’s LTM-1, which can allegedly handle almost four million words), and one would get the impression that at least this limitation of LLMs will soon be gone.

Unfortunately, this is not yet reality, because the raw character limit is only half the story, according to recent research. What we also noticed ourselves when we tried to use GPT-4 32K for reviewing simple legal documents, is that LLMs suffer from the problem of losing their attention with large texts, particularly in the middle of the text, no matter their character limit. In other words, LLMs will easily find relevant information at the start and near the end of a long text, but often skip information situated in the middle of a text. (Isn’t this yet another striking similarity between LLMs and human beings?)

Asking an LLM to review a 100-page syndicated loan agreement, or asking specific questions about recently introduced pages of detailed legislation, is therefore not yet a reality. However, not all is lost, because semantic vector search can partially alleviate the character limits problem.

The idea behind this technique is that, instead of simply submitting huge amounts of data to an LLM, a smart “pre-filtering” process takes place first, which limits the information to just a few paragraphs. As long as the amount of information submitted to the LLM remains small, the LLM will not suffer from the “lost in the middle” problem and instead provide a high-quality answer.

Over the last months, it has become evident that semantic vector search is almost always the best approach for integrating new knowledge into LLMs.

To prepare for semantic search, source texts — case law, legislation, legal memos, contracts, clauses, etc. — first need to be split into smaller segments. Each segment is subsequently submitted to an LLM, which converts the segment of text into number vectors, which are then stored by a dedicated “vector database”.

If you are interested: check out a good introduction to such databases, although you should be aware that this technology currently seems to be over-hyped due to venture capitalist investments.

This entire conversion process is relatively simple, fast, and cheap, because the LLM merely has to correlate the segment of text with the existing vector numbers in its memory. As there is no need for any “reasoning” or language generation, a light-weight LLM can be used, instead of a heavyweight LLM such as GPT-4.

Once the vector database is filled with information, it can be used to answer questions. When a legal expert would ask a question, that question is first converted into number vectors by the light-weight LLM. The resulting number vector is then submitted to the vector database, which will search across the millions of stored number vectors, and then return the text fragments associated with the best vector matches. Both the question and the preselected text fragments are then submitted to a heavyweight LLM, which will answer the user’s question using the text fragments as new information.

Over the last couple of months, this pre-filtering procedure — “retrieval-augmented generation” or “RAG” — has become the standard approach in the industry for inserting new knowledge into LLMs, and many services have been introduced that rely on this approach.

For example, Microsoft offers the Azure Cognitive Search service, which allows corporate users to “interact“ with gigabytes of their own content. The service essentially ingests various files (Word, Excel, Powerpoint, PDF, …) found on the corporate network, splits those files into individual paragraphs, converts them into number vectors, and ultimately feeds those paragraph into GPT-4 for intelligent interaction. Through RAG, corporate users can ask corporate-specific questions, which GPT-4 will answer through a combination of its general knowledge and the corporate-specific information that was selected by the Cognitive Search service (e.g., background information about the company, historical data, policies, etc.).

The popular ChatPDF service uses a similar approach for allowing end-users to “interact” with the content of a PDF (e.g., asking questions or explanations), while products such as helpdesk products allow website visitors and chatbot users to interact with the content available on a support website.

RAG is by no means perfect. One of its drawbacks is that it requires to split a text into smaller segments, such as individual paragraphs. As long as the end-user’s question is sufficiently correlated with one or more specific segments, RAG works fine. The approach breaks down, however, in situations where the user’s question is only partially answered by each individual segment, because the number vectors of each individual segment would then not sufficiently match the number vector of the question, so none of the relevant segments are selected even though they would match when being combined.

Using RAG is also not feasible when the aggregate text length of all matching segments exceeds the LLM’s character limit.

Conclusion & outlook

LLMs are quickly finding their place in all kinds of sectors. As usual, the legal sector is running behind when it comes to technical adoption. This may prove to become particularly bothersome, because legal work is among the most frequently cited types of work that will be impacted by LLMs.

Most legal teams are still acting as bystanders (or even downright non-believers) when it comes to LLMs, due to the sector’s conservative approach, strong risk-aversion, and general negativity towards automation. Nevertheless, we do notice that the legal world is adopting LLMs, even if at a slow pace.

Through our contacts with law firms, we have learned that a meaningful percentage is indeed experimenting with LLMs, with almost all having strong innovation departments. Most firms are experimenting silently, others are more open about their research and innovation. Some interesting community efforts are also arising.

Legal teams of any size are strongly advised to start experimenting with LLMs as soon as possible, to avoid running behind peers and neighbouring sectors such as accountancy and consultancy. It will take significantly more time for legal teams to prepare their knowledge for use by a future LLM, then it will for that future LLM to appear on the market.

Even today, with all the technical limitations described above, there are many good use cases for legal teams:

  • Automatically rewriting texts, for example to change the perspective from the client’s perspective to the third person.
  • Drafting a table with a timeline of events, based on a set of input texts (e.g., a set of emails forwarded by a client in the context of litigation).
  • Interacting with the contents of a long document (e.g., asking questions about a PDF).
  • Summarising long texts (including court cases). Studies find that humans often prefer LLM-generated summaries to human-generated summaries.
  • Redrafting contract clauses to make them much shorter, or instead much longer. As long as no specific legal knowledge is required, LLMs can do a very good job.
  • Drafting marketing material, particularly the kind that nobody likes to write, such as pitching texts and submissions for the legal directories.
  • Translating short legal documents. Unlike standard translation services (such as Google Translate and DeepL), LLMs will remain consistent in their terminology, and will happily take into account specific drafting instructions.
  • Acting as a general legal sparring partner during brainstorming sessions. Unlike most highly specialised lawyers today, LLMs can also act as legal generalists, integrating ideas from other legal areas. Perhaps even more interesting is that LLMs will have in-depth knowledge about other disciplines (e.g., biology or information technology), and can raise interesting questions or remarks with that knowledge.

None of those are real “killer” cases. Combined with a good dose of uncertainty & doubt, this is probably the reason why so many legal teams are still waiting to jump onboard. This is really unfortunate, because LLMs are the exact opposite of a solution in search of a problem. It’s the solution to far more problems than its developers even knew existed. Law is a profession of words, and LLM just happen to be great at dealing with language. As soon as you start experimenting with LLMs, you will quickly find workflow improvements that you probably had not even considered optimising. It has been less than a year since chatGPT woke the world to this new era of artificial intelligence, and exciting times are ahead.