Generative AI in Legal — Fall 2024 Update

It’s been about a year since we posted our Fall 2023 update on Generative AI. In this blog post, we will discuss what happened in the last twelve months and what you can expect in the near future.

We also hosted a webinar on the topics discussed below and what they mean for lawyers.

Hardware costs are dropping

Training an LLM used to be extremely expensive due to the cost and scarcity of the hardware required for it (specifically the NVIDIA “H100” GPUs). The extremely high unit costs have gone now, because the cost of renting an H100 has been slashed. The market has flipped from a shortage ($8 per hour) to oversupply, with relatively cheap pricing ($2 per hour) as a result.

Of course, even though the unit cost has gone down, training LLMs is not exactly cheap, because you need hundreds if not thousands of these devices, as well as expensive engineers, during many months to train a new model. This entire exercise needs to be frequently repeated to refresh the knowledge of the LLMs are try out better techniques, so all of this boils down to cash drains of many millions. It is therefore frequently predicted that creating generically applicable “foundational” models (such as OpenAI’s GPT, Anthropic’s Claude and Google’s Gemini) will continue to be reserved for companies with deep pockets, because training them probably requires billions per year.

Somewhat paradoxically, however, the costs have come down to a level where it has become possible for smaller venture-backed companies to train their own specialised LLMs from scratch (i.e., not simply finetune an existing foundational model). Harvey is the obvious example here because of its ties with OpenAI, but there are other such as the German Noxtua LLM. We have not yet reached a stage where it is commercially viable to offer highly specialised LLMs for small markets (e.g., a French M&A LLM or a Spanish tax law LLM), but we expect that such LLMs will be offered in two to three years, when hardware costs have dropped further and legal publishers have caught up with LLM techniques.

Of course, in light of the ever-more ubiquitous nature of LLMs (even confessions can be done using LLMs), the LLM-hardware space remains very lucrative. There’s a reason why NVIDIA has surpassed Apple as the most valuable company in the world and competitors are racing into this same market. Some of those competitors (such as AMD) offer products that are roughly equivalent, others focus on solving specific pain points (e.g., Groq and Cerebras focus on answering speed), while NVIDIA is also launching promising new products.

Open-source models are becoming mainstream

Training your own specialised LLM continues to require massive investments that are beyond the reach of most organisations, including most vendors in the legaltech area. Many legaltech vendors claim that they are “training their own model”, but in reality this usually means that they’re doing some machine learning analysis of relevant data, for which the input or output is then combined with a commercially available LLM. Technically this qualifies as “building your own model”, so from a marketing perspective it’s a great idea to use this terminology, as most legal professionals wouldn’t understand the difference anyway. However, technically speaking, these approaches are miles away from building your own Large Language model.

The alternative to training your own model is to modify an open-source model through finetuning. Those open-source models are offered by a limited number of companies — such as Meta, Alibaba and Stable Diffusion — which have the deep pockets and technical capacity to train new models. While the models are offered for free, the companies behind them benefit indirectly through a combination of thought leadership and commercial goodwill.

Through finetuning, the behaviour of models can be adapted. For example, models can be taught to use or instead avoid a certain style of answer (e.g., to sound legalese or to instead avoid it). However, finetuning a model does have downsides, because it overrules the answer that the LLM would normally give. The experience over the last two years has taught that the quality also frequently drops with finetuned models, because the overruled answer is not always the best answer, as Google shamefully learned. Many see finetuning like limiting the freedom of who’s behind the wheel: you can avoid that the car would ever make a sharp turn, but occasionally those sharp turns are simply necessary.

For most models, only the trained end-result of the model gets released, i.e. the so-called “weights”. Only a few organisations also release the texts and software that were used during the training. The difference between the two approaches has sparked a semantic discussion in the open-source community, because in order to be truly “open source” according to the traditional definition, the base materials should also be released. It is therefore proposed to differentiate between “open weights” models and truly “open source” models, but the cat is already out of the bag with companies such as Meta consistently using the “open source” label for their “open weight” LLama-model, and most users adopting this terminology as well.

The open source models are also crucial for offering models that can be run on low-performance hardware, such as consumer laptops or even cell phones. For these devices, scaled-down versions of open source models such as Meta’s Llama are increasingly used. The advantages are that no network connectivity is required and that the user’s privacy is better protected, because no data gets sent to a third party.

Microsoft has also published specifications for AI-enhanced personal computers with Windows 11 that have hardware components dedicated to running local LLMs (the so-called “Neural Processing Units” / “NPUs”); Apple has similarly been adding AI-focused hardware to its iPhones and MacBooks. So far however, these initiatives have been mostly marketing and hype — selling a vision of what you can expect in the future. Practical applications are still relatively limited results have been underwhelming in more general applications.

Model sizes have not (really) grown

The fundamental breakthrough observation in the development of language models has been the observation that every jump in data size — e.g., from 1.5 billion parameters with GPT-2 to 175 billion with GPT-3 — has corresponded to a new level of intelligence. Accordingly, there has been a race among LLM vendors to increase the size of the LLMs.

However, it seems the industry has reached a limit, because of a combination of diminishing returns, hardware constraints and energy consumption. Essentially, to double the quality of the results, you need a model that is an order of magnitude (10x) larger. However, each time you double the size of the model, the hardware requirements more than double, so we quickly run out of capable hardware: to double the quality, you need hardware that is much more than ten times as powerful.

GPT-4, which appeared on the market in March 2023, is believed to have around 1000 billion parameters, but it is split up in a separate LLM per knowledge domain (“mixture of experts”), with each LLM likely to be around 200 billion parameters. Since its arrival some competitors have appeared on the market that offer more than 200 billion parameters — e.g. Google Bard (1600 billion), Meta Llama 3 (405 billion) and Anthropic Claude Opus (500 to 2000 billion) — but their performance is substantially below (Bard), somewhat below (Llama) or roughly on par with (Claude Opus) GPT-4’s performance. It is also striking that Anthropic is pushing its smaller model (Claude Sonnet) because it is not only twice as fast as its larger sibling, but also offers significantly better output quality.

All of this remains guesswork, because for competitive reasons the large LLM-vendors deliberately remain silent about the exact size of their foundational models. Even so, the race for the largest model size has therefore been replaced by a race in other areas, such as speed, quality and input size. Sure, a minimum data size is really necessary for the LLM to be useful, because it needs to understand the question it is being asked and needs to be able to formulate a proper answer.

As for impact on legal domain, it really depends on the task at hand. For straightforward legal drafting and reviewing, larger data sizes do not really lead to better results, because there is no need for very deep reasoning. Instead, speed and context windows are much more important.

Focus on better data sources

Instead of focusing on the sheer size of LLMs, vendors have focused on better sources for their data, for three reasons.

A first reason is data quality, following the rule of garbage-in, garbage-out. LLMs can offer good quality if they are fed with deep discussions and well-written deep coverage of subjects, as commonly found on discussion forums, books, scientific contributions and newspaper articles. It is therefore not surprising that OpenAI and Microsoft have secured partnerships with companies such as Associated Press, Axel Springer, the Financial Times and Reddit. These partnerships are quite expensive to maintain, while offering LLM is already a very loss-leading exercise. We therefore expect that the cost of future foundational models will go up in the not so distant future to compensate for these expensive partnerships. At the time of writing, OpenAI has indeed indicated that it is considering a price increase.

The second reason why the LLM-vendors focus on better data sources is to avoid litigation. It remains to be seen how the pending lawsuits against OpenAI & Microsoft, OpenAI, Stability AI and Anthropic will turn out. Copyright legislation was never designed to deal with situations where a copyrighted work would be used as mere “fertilizer” together with millions of other copyrighted works. Even if it would qualify as fair use under US copyright law, the stricter EU copyright framework (which lacks such a flexibility) may spoil the party of the LLM-vendors.

Training data shortage

The third reason is that those partnerships allow the LLM-vendors to include additional data that is simply not available on the public internet, because it is locked behind digital paywalls. The commercial LLM-vendors have reached a stage where they are running out of available data that meets their minimum quality threshold. Accordingly, any additional data of decent quality is highly welcomed.

In the legal sector, the data shortage remains a Very Big Problem that is hampering the widespread uptake of LLMs. In most scientific sectors, for example, data can be applied worldwide — experimental chemical results obtained in Australia are equally valid in Spain. In the legal sector, however, we are fragmented into separate jurisdictions, so that for most legal tasks it’s not a good idea to mix data from different jurisdictions.

Even more problematic is the confidentiality issue, which causes the most interesting data to remain behind the digital bars of companies and law firms. To solve this, an LLM-vendor would not only have to strike costly partnership deals with international law firms, but probably also have to create separate LLMs for separate jurisdictions, causing training costs to explode.

Even before the arrival of the LLMs, there were examples where “deep learning” could deliver results that went significantly beyond what human curation of knowledge can do. The unexpected arrival of LLMs has made this even more clear. But it assumes that data is available, and for some jurisdictions and legal domains this is problematic. Jurisdictions and legal domains that are largely case law driven may be impacted less, but legal tasks where high-quality data is locked behind digital bars will remain problematic to enjoy all benefits that LLMs can provide.

Larger input sizes, but decreasing quality

In many legal tasks we are dealing with large amounts of text. However, there’s a limit to the size of the information that can be submitted to LLMs: the so-called “context window”.

For GPT-4, this limit was initially around 12 pages of text, with a special version that could process about 50 pages. Vendors such as Anthropic then tried to differentiate their products by offering context windows of around 200 pages. Today, the 200 pages limit (usually expressed as a “128K token limit”) has become somewhat of a standard for commercial LLM-vendors, as it is offered by GPT-4o and Llama 3.1, with Anthropic Claude’s going up to 300 pages of text and Google’s Gemini Pro 1.5 to about 3,000 pages.

Scientific studies quickly pointed out that the quality of the LLM’s answers can significantly drop with large input sizes, particularly in the middle of the input text (the “lost in the middle” problem). In the last year, LLM-vendors have focused on remediating this problem, but even with the best commercial LLMs, the quality of the output remains inversely related to the size of the input. Sometimes it’s possible to chop up the input text and sequentially submit the individual segments to an LLM, but this is not possible for all tasks. New techniques such as deliberate information-intensive training and attention calibration mechanisms are promising, although they seem to be more of a stopgap solution than a fundamental fix.

Even with the significantly larger input sizes, it remains impossible to simply submit every relevant document to an LLM as part of the prompt. Not only would this lead to the quality deterioration issue described above: it would also quickly become relatively costly and slow, at about $0.25 per 100 pages and processing times that can easily take half a minute for each and every submission made towards the LLM.

This is also the reason why legal professionals must be cautious with some vendors’ claims about being able to perform compliance checks against specific rule sets or legislation. Except when LLMs have been specifically trained on such texts, or such texts as very widely discussed online (as is for example the case with the EU GDPR), vendors will have to resort to workarounds to submit both the ruleset and the document to be checked to the LLM. Those workarounds — such as segmentation & semantic search — frequently give “OK” results, but must not be used in situations where a high degree of accuracy is required.

Semantic search has become a staple, but is not a silver bullet

The technical solution to this problem is to filter the input, so that only the most relevant parts get submitted to the LLM, in order to reduce the input size. While many techniques can be used for such filtering, the most common technique is to split the input into text segments and then store each segment together with its semantic vector. This semantic vector is a numeric representation — typically a series of between 100 and 1,000 numbers — that is generated by a kind of “light LLM” and represents the semantic location of the text fragment in a semantic space. For example, the semantic vector of a paragraph that talks about liability would be closely located to a paragraph about indemnity claims, but far away from a paragraph about human rights.

When the user wants to submit a question to the LLM, the software application will then search in the semantic vector database for those text segments that are semantically related to the user’s question. Only the top-matching text segments then get submitted to the LLM.

The performance of the light LLMs that are being used for converting text into semantic vectors (the “embedding models”) has reached a plateau in the last 12 months, with newer embedding models providing only slightly better performance than previous ones. Quality improvements are still being made, but this entire process remains a stopgap solution to deal with the fundamental limitations of LLMs (i.e. the limited size of the context window and training being extremely slow and expensive). Technical questions such as how to best split a text into segments have also not been solved; they may seem trivial but significantly influence the quality, as recently demonstrated by a technical report from international law firm Addleshaw Goddard.

Two years ago, semantic search was still praised as opening a new world of intelligent search with better results than traditional keyword-based searches. In practice, however, semantic searches often include surprising results that baffle humans because it’s so utterly unclear why the results are included. The reason is that the “intelligence” that gets included through the semantic vectors is only a very thin layer, miles away from the intelligence level offered by regular LLMs. Semantic search engines have the advantage of being able to search in millions of text fragments in less than a second — something that LLMs will never be able to do — but their thin layer of intelligence remains too much of a black box for humans.

We have also noticed a surprising psychological problem, where user expectations are simply much higher with semantic search. It seems that an average user somehow accepts the limitations of “dumb” keyword-based searches, because many IT-applications in day-to-day life don’t offer anything better. However, as soon as users notice that the search results are semi-intelligent (not simply keyword-based), they seem to expect Google-like quality and then get disappointed when the results aren’t so intelligent after all. That’s where semantic search tends to break down, because even though there’s some level of “understanding”, it doesn’t go very deep. For example, when a user would search for “indemnification by the customer”, semantic search technology will probably also include results that talk about indemnification by the supplier, because both are semantically close to each other, with similar wording. The average legal professional will however be very disappointed with such search results, because they are the opposite of what was expected.

Hybrid search algorithms that combine semantic search and traditional keyword-based search are becoming more widespread. Another technique is to use a “re-ranker”, where an intermediary LLM that is fast, yet offers more intelligence, reorders the search results, so that the text fragments that effectively get sent to the real LLM are more relevant.

Every legal professional’s dream to simply dump all its “knowledge” (read: the raw contents of the organisation’s email and case management system) and have it be intelligently processed, will therefore remain a dream. Many vendors have promised to implement this dream, most of them using some variation of semantic search, but this problem continues to be a very thorny one to tackle.

Output size limitations persist

While the input-size limitations of LLMs is well-known by now, and semi-solutions such as semantic prefiltering are widely applied, the output-size limitations get much less attention. Few legal professionals know that in practice the output size of most commercial LLMs is limited to about 6 pages of text. Even Google’s Gemini Pro (which allows 3,000 pages of input) is limited to about 12 pages of text.

During summer, OpenAI launched an experimental model GPT-4 Long Output that offer 16x the output size, but at a significantly higher cost. At the time of writing, this model is still experimental and access to it remains limited.

The underlying reason for this limitation is the hardware that is required. For both input-size and output-size the hardware requirements increase exponentially, because they essentially quadruple each time the input-size or output-size doubles.

LLM-based applications must continue to work around these limitations. When there are few interdependencies between the various parts of a long document, the solution is of course simple: repetitively ask the LLM to produce the next part of the document.

However, many long legal documents do have significant dependencies between their various parts, such as cross-references, definition lists, terminology consistency requirements, and the typical approach of having an introductory section that describes the general principles which are then detailed or carved-out in some subsequent part many pages later. For these situations, brittle orchestration layers must be built to align the interdependent parts, which kind of works but remains an unsatisfactory solution.

New legal tech products continue appearing on the market

The arrival of LLMs has invited many new applications to appear on the market.

The first type legaltech applications depends on functionality that simply wasn’t possible in the past, because it required a level of intelligence that was unavailable. In the legaltech arena, this is true for many applications, because law is a profession of words, so native language understanding is key. Applications such as deep document review, legal research assistants and document summarisation could not exist in the past. Sure, for a long time, applications have existed that promised some of the benefits, but they relied on superficial tricks that only brought you so far, so they were never really useful outside a few niche cases. Established vendors have also embraced LLMs to enable them to develop the modules that were the “missing link” to make their applications truly useful for legal professionals.

Here, we can witness the typical phenomenon that the original inventor of a technology cannot imagine all the diverse scenarios where its new technology will be applied. As recently acknowledged by one of the authors of the seminal research paper on the LLM-technology, “I [was] amazed by just how quickly people got super creative using that stuff”. LLMs are already used in both big ways and small corners, and we have probably only scratched the surface.

The second type of new legaltech applications consists of “LLM wrappers”: applications that almost entirely depend on LLMs for all their functionality. The market is getting flooded with these applications. LLMs enable those applications to quickly appear on the market, because it’s just so simple, as all the hard work gets offloaded to the LLM-engine. The problem with these applications is that they have no technical moat and that it is very easy to build a competing product. Expect that many of these applications will go away once their vendors learn that (1) the legal community is a particularly demanding audience, (2) that the legal sector is notoriously resistant to change, and (3) that the market is very crowded.

Fear for data reuse

What hasn’t changed in the last year is the legal community’s fear for confidentiality problems. That’s the one central question that gets is asked in every software demo: “Do we need to worry that our confidential data will end up in the next version of [insert LLM of choice here]?”

This fear is completely justified for free LLM-based applications, such as the free version of ChatGPT and popular search engines such as Bing and Google that are increasingly powered by LLMs. In accordance with the adage that if you’re not paying for the product, you are the product, those products will use your questions and your reaction to their responses as raw training material for the next version of their LLM. This is one of the ways in which the LLM vendors are trying to tackle the data shortage problem that we’ll discuss below.

The legal press has done an impressive job of deeply embedding this fear into the hearts and minds of legal professionals, back when back LLMs appeared on the market in early 2023. Even though many legal professionals still don’t understand LLM-technology after all those months, they do continue to be profoundly affected by this fear, probably because confidentiality is such an essential value in the legal profession.

On the one hand, this fear is completely unjustified. The entire trillion-dollar business model of Anthropic, Google, Microsoft, OpenAI and all the other LLM-vendors depends on trust, so each of them is quick to point out in their legal fineprint that in the paid version of their LLM, no data will ever get reused for training the next version. Similarly, in the minimal user interface of the enterprise-version of ChatGPT, precious screen real estate is used to show the message that your “chats aren’t used to train our models”. As far as we are aware, there have been no incidents and not even the slightest indications that the large LLM-vendors would disregard their promises in this regard. (Of course, all the LLM-vendors reserve the right to keep data for a few days to provide support to customers or guard against harmful content, but that’s not really different from the standard operations in the IT-industry.)

On the other hand, some fear remains justified. Gathering training data is the main raison d’être for the free versions of the LLMs. Second, even when the LLM-vendors themselves will not reuse your data for training their next model, you should remain aware what your application vendor is doing. Legaltech apps are flooding the market, with many vendors having no existing links with the industry and no intention on staying for a long time in this industry. Gathering massive amounts of data to either build your own “legal GPT”, or harvesting data to allow some other company to build it, is one of the quickest way to garner the interest of venture capitalists.

The fear for data reuse and confidentiality leaks therefore remains deeply embedded in the minds of legal professionals. Over the last year, we have witnessed a small contingent of businesses that exploit this fear to sell expensive LLM-filters and “anonymisation” engines to allegedly protect against the looming danger of the LLM-vendors, by scrubbing your data before it gets sent to the LLM-engine. We consider these filters and anonymisation engines to be digital snake oil: expensive tools that are not harmful on themselves, but really boil down to a solution in search of a problem. As you can see in the overview below, the danger does not emanate from the LLM-vendors, but instead from the application-vendors and users.

Security issues due to oversharing

Linked to the data reuse issue is the security issue. We’re currently in a phase where LLM technology is so overhyped that it is being applied everywhere, up to the smallest corners. Vendors are over-promising that LLMs can unlock so many benefits, that they are pushing IT-managers to apply it everywhere. No data should be left behind, LLMs must be brought in touch with every byte present in an organisation.

Not surprisingly, accidents are happening because of this over-sharing, where LLMs give employees access to HR-documents and contents of CEO emails, because access rights were not properly configured. As IT-experts continue to be pushed by their managers to apply LLMs in the strangest of places, we expect that these problems will continue to rise in the next year. Microsoft is very aware of this danger but is fighting an uphill battle.

As a legaltech vendor, we see this problem as much more dangerous than the data reuse problem discussed above. Security and ease-of-use are a difficult marriage, and legal professionals strongly tend to favour the latter over the former, e.g. when we talk with our customers about how to set up repositories of legal drafting histories and notice how the “share everything” approach almost always trumps data isolation.

We continue to be amazed how schizophrenic legal professionals are about the confidentiality subject. Customers who may have spent an entire hour discussing this data reuse issue with us will simultaneously be the same ones who, after having signed the contract, will send us substantial amounts of highly sensitive information from their own clients, through unsecure channels such as email. It’s probably a side-effect of frequently negotiating non-disclosure agreements, but we truly observe an over-reliance on contractual confidentiality obligations, while those are just one factor — and far from being the most important one — that contributes to security and confidentiality. Security consultants will have a commercial field day once they discover the legal community’s conceptual approach to security.

Benchmarking madness

In the technical community LLM benchmarking has become a bit of a joke. About every week some shiny new LLM is announced, for which the press release claims that it is “SOTA” (State Of The Art). This claim is then backed up by some benchmark results that use beautiful graphs to illustrate how the new model is better than a handful of competing LLMs.

By now, every technical expert who follows the LLM-space knows that it’s very difficult to objectively benchmark LLMs. There tend to be so many subjective elements in the assessment that it’s not difficult to find a few examples where the new model indeed has a slightly better result than previous models, even when in practical interactions the LLM really isn’t that good. Moreover, there’s the inherent problem that new LLMs may have ingested the contents of the benchmarking tests during their training, similar to how a student who gets access to the previous year’s exam questions will tend to have better grades.

LLM-experts therefore do not rely on the traditional benchmarks anymore, and instead wait for a few weeks and check the results of Chatbot arena or one of the LLM leaderboards, where human users subjectively evaluate the results of an LLM. It’s simply too difficult to have objective benchmarks — so subjective, human assessments of a LLMs are necessary.

Slow but increasing maturity

If we compare where we are today (November 2024) with the moment that GPT-4 was introduced (March 2023), we can conclude that an evolution has taken place. LLMs have become faster, more capable (through higher context windows) and more versatile (through various integrations). Various models have appeared that are offer similar quality as GPT, with a mix of commercial foundational models (Anthropic, Google), open-source foundational models (Llama and Qwen) and a few specialised models (e.g., Harvey, Noxtua and KL3M).

The availability of LLMs has also improved, although vendors keep struggling with occasional outages. Neither OpenAI nor Anthropic reach the industry-standard 99.9% availability. Rate-limitations also continue to plague users, with vendors being accused of using dumbed-down versions of their LLMs at moments of high traffic.

The sector is getting more mature, with better tooling for software developers and boring but necessary initiatives like better integrations and interoperability standards, such as Anthropic’s recently announced Model Context Protocol.

The fundamental limitations of LLMs have not been resolved, however. Hallucinations have decreased but certainly not disappeared in the foundational models. LLMs remain relatively slow, compared to the speed of other software. They can digest much more information, but throwing thousands of pages at them still requires patchy solutions such as prefiltering using semantic search. They allow you to get OK results quickly, but remain highly unpredictable in small corners where they cannot be forced to behave you like ask them to behave, even if you rewrite the prompt twenty different times.

We therefore conclude that LLM performance is reaching a “plateau” phase.
We expect further evolutions, but not new revolutions in the short term, and no new big jumps like the ones we had from GPT-2 to GPT-3, and from GPT-3 to GPT-4.

Current state of affairs

So, where does this put us at this moment in the legal sector?

As legaltech vendors, we continue to struggle with the tension between the high expectations from users and the workarounds that we have to apply behind-the-scenes. Perhaps we would therefore summarise the situation as “wonderful at first glance, still frustrating at second glance”. LLMs can make for impressive demos that initially blow legal professionals away, but once they sit down and carefully analyse the results, there remains a nagging feeling that we’re not quite there yet.

For example, in contract review, we notice that reasonably good results can be achieved when a small contract gets subject to a set of customer-specified rules. However, when larger contracts are submitted, the results tend to deteriorate and inexplicable errors sneak in. A workaround could be to split a long contract into individual segments and to sequentially submit each segment to an LLM. However, often this simply doesn’t work, because a legal topic (e.g., a party’s contractual liability) is often spread across multiple clauses. Similar to how you would potentially get significant errors when you would split a single contract in five segments and ask five juniors to review their segment, you run the risk of getting similar errors when sending separate segments to LLMs.

LLMs also continue to be somewhat imprecise when it comes to small details, even in smaller documents. Usually this shouldn’t bother legal professionals too much for general drafting/reviewing tasks but for some legal tasks this lack of precision does cause disturbances. For example, a typical paragraph of text in an MS Word file will often contain quite some markup-codes, such as the paragraph’s number, cross-references, bold/italic/underline-codes, inserted changes/deletions, and so on. To add to that frustration: the imprecision is always in a different corner. It also continues to be amazing how unpredictable LLMs remain in their understanding of complex prompts, such as the ones that are submitted to LLMs behind the scenes in a legal tech application. A particular headscratcher we repeatedly run into, is language detection. Foundational LLMs such as GPT-4o can understand and draft texts in any human language, but if you ask them, as part of a complex prompt, to draft the answer “in the language of the input text”, they fail all too frequently. Usually the LLM then responds in English, probably because we write our instruction prompt in English. Strangely enough, by rephrasing the prompt to explicitly say “write your answer in French” instead of “write your answer in the language of the input text provided to you”, the problem disappears. Why?

It's one of these problems that end-users cannot understand, because they don’t run into these issues in typical usage of ChatGPT, as the prompts they submit are orders of magnitude simpler, usually being limited to a handful of instructions or questions.

We therefore expect that in the next two years the LLMs will, of course, become ubiquitous in the legal sector, but will remain concentrated on a variety of smaller tasks — very useful, quality-enhancing and time-saving. But we do not expect LLMs to take over a lawyer’s more challenging tasks.

The legal community remains divided

At the same time, what LLMs can already do today for legal teams is so compelling that it’s a bad idea to ignore them.

Even so, the legal community remains divided on this subject.

We continue to be amazed how many legal professionals remain completely clueless about GenAI, up to the point where many haven’t even used ChatGPT as a consumer and don’t understand even the most basic concepts, e.g. on how LLMs are trained. Every day, we receive emails such as “Can you guarantee me that the output of your product is completely compliant with Italian employment law?” and “We have 20 different share purchase agreements from the past. Can your product automatically select the best clauses of each and merge them together into a single document?”

Clearly, those users don’t have the slightest idea of where to apply LLMs, and where to avoid them. When we explain to such users that LLMs are not (yet) capable of such tasks, we notice a layered response. Initially, the users tend to be very disappointed, because somehow they had hoped that LLMs would solve their workload and fix their internal knowledge management issues. However, those same users then usually also realise that it’s probably a good thing that the LLMs cannot (yet) do these complex legal tasks, as that would threaten the human’s job.

At the level of an organisation, we have noticed that only a small minority of organisations actively refuse any use of LLMs. Nevertheless, compliance-driven organisations tend to err on the side of caution, with exhaustive, checkbox-driven screenings that frequently take months and drive potential users into despair. But things are getting slowly better, and we expect that these compliance pains will mostly go away over time.

Future developments

To finish, let’s look at new development for which we have great expectations in the coming months.

Agentic GenAI

A large new trend in the legal sector's AI developments is that of “agentic GenAI”, where LLMs and other software components will work together to execute a complex task without any human oversight. A good example is legal research, where a human lawyer would traditionally search in a database for relevant case law and legal doctrine, analyse each search result, perform some additional searches, and ultimately write a memo about the findings. In an agentic system, a software component will replace the human, asking the LLM to formulate queries and analyse the results, submitting additional queries based on these results, and perhaps even write the memo.

Agentic GenAI may sound incredibly cool, and it’s definitely the future, but there are some technical limitations that prevent its widespread adoption in the legal world.

To be truly useful many third party services will have to be adapted to offer server-to-server (“API”) interactions, so an agent can directly access information from such service. Some large legal research databases already offer such possibility already, but smaller, specialised national research databases and local case management systems lack such functionality. GenAI systems will then have to fall back to brittle alternatives such as trying to use the corresponding websites directly through the LLM’s recently acquired “vision”-capabilities (as if they were a human being operating the website). To this end, Anthropic recently started offering by “Computer Use” feature.

Another hurdle for agentic use in the legal sector is the limited context window. When the information to be temporarily remembered gets large — something which is bound to happen with legal research — it will need to be summarised at some point. LLMs are very good at summarisation, but details will inherently get lost. In an agentic setup there will potentially be hundreds of interactions with an LLM, with each interaction resulting in new information. At some point, the intermediate results must be summarised, and at some further point the summaries themselves must be summarised to remain within the context window.

Small errors and missed details will add up and can easily lead to hit & miss results. The results will depend on the difficulty of the task and how many “hops” are necessary. For example, if the answer to the research question can be directly compiled from search results, then the results will likely be good — e.g. asking for cases where local courts allowed the dismissal of an employee due to substance abuse. Conversely, if the initial search results will open new legal questions, for which the answers will on their turn trigger yet other questions, ... then at some point the agent will need to start summarising search results that were obtained using a query that was itself formulated using a summary of previous intermediate results. At that point, errors will start compounding and the results may be poor.

If you’ve already used ChatGPT to replace Google, you will probably have noticed this hit & miss yourself. If you ask an easily understandable question for which not too many “hops” are necessary and the results are easily interpretable (e.g., “Find a local pizzeria that offers pizzas with both pepperoni and olives”), it will give very good results in a few seconds. Conversely, if you give it a query for which it has to deeply think about each result and combine them with yet other information to be researched from other websites (e.g., “What’s the best air purifier for pet owners in small apartments?”), it will give poor results. It is often claimed that LLMs will replace the traditional web-searches and end Google’s monopoly, but we expect that both will thrive side by side, like TV and radio sharing the airwaves.

Deep reasoning models

In September 2024 OpenAI released its “O1” model, which is claimed to offer deeper reasoning possibilities. Essentially, this model takes a lot more time to think carefully about the instruction given to it, often going over the same problem many times in order to arrive at the result.

OpenAI advertises the O1-model for use in positive sciences, e.g. for solving problems in physics, chemistry and biology. For some complex legal tasks, the deep reasoning also seems to help, although the results remain inconsistent. In our experience, many legal tasks where LLMs are currently involved do not require deep reasoning, so that O1’s slowness becomes a source of frustration. Also, because it requires exponentially more hardware resources, OpenAI imposes limits on the amount of submissions. Various technical limitations also limit the widespread use of O1, but it’s a promising new development that suggests in which direction OpenAI will go with GPT-5. Meanwhile, open source models seem to be catching up already with similar “deep reasoning” models.

The jury is therefore still out on whether LLMs can really think, or whether they don’t even reach the intelligence level of our pets. Some claim that the approach of models such as O1 isn’t all that spectacular (“it’s merely pressuring an LLM to repetitively think about the same problem 50x”), and just a way for OpenAI to keep some enthusiasm because GPT-5 will probably not be a big jump in quality due to diminishing returns. At least when it comes to deep mathematical thinking, LLMs apparently continue to disappoint. So-called AGI (artificial general intelligence) is probably not to be expected for tomorrow.