Enterprise Generative AI: Key Challenges and How to Solve Them

May 31, 2024

Learn about some of the key challenges facing organizations when deploying enterprise LLM, and how mature LLM implementations address these challenges.

Large Language Models (LLMs) are revolutionizing enterprise operations. Or are they?

While headlines have spent the last year hollering about potential cost savings, efficiency improvements, and customer support, the reality has been much less exciting: countless projects have spun up, only to fizzle out in the face of severe stumbling blocks.

This article delves into the complexities of enterprise LLM adoption that have bogged down LLM deployments. Even better is an exploration of how mature approaches to LLM adoption address these challenges - and achieve the operational efficiency once promised - through a task-clad approach.

What is a Large Language Model, Anyway?

While AI has lit the first half of the 2020s ablaze, it’s easy to forget that machine learning has existed since the 1940s. Traditionally, machine learning models were essentially algorithms: given an input and an expected output, the path from A to B is built and updated by the model. Updating its own approach, the machine figures out a way to achieve its goals through trial and error. Chief of this are hyperparameters - parameters that cannot be learned, and need to be tuned by the user.

This process was supercharged by the development of deep learning. In order to add AI complexity without too much cost, deep learning breaks down an infinitely complex subject - such as language - into its relevant layers of complexity. Envision a pyramid - at the bottom, the basic letters of the alphabet, followed by the common assembly of these letters into words, followed by the grammatical laws that dictate which word goes where. Enter the final catalyst for today’s AI boom: transformer models. To best describe this, think of map coordinates where latitude and longitude are both referenced (Paris, for example, can be found at 48.864716, 2.349014). LLM transformer models essentially do this with words - but instead of 2 dimensions, they include thousands. This way, all words in the English language can be mapped in relation to one another.

Transformer architecture is how LLMs deal with homonyms, or words that have more than one meaning. The same word can be represented with different vectors, depending on the context. There’s one vector for bank (as in, the building) and another for bank (of a river), for instance.

Trained on vast datasets, the statistical engines that power these AI models are now able to take input data (that is, an end-user’s query), and generate streams of relevant sentences in response. Thus, the conflagration of interest in how these can be applied within an organization.

Why Use LLMs in Your Enterprise?

Given the sheer flexibility of LLM use cases, their enterprise potential is defined by where they’re needed most. See our rundown for a comprehensive list of enterprise AI use cases.

In customer service, time matters - customers expect rapid responses to support issues and purchase inquiries. However, automated replies can frustrate rather than help - when falling short of the one-on-one support required, customers can often find themselves in conversational loops with automated chatbots misidentifying the issue and presenting them with useless results. Cue unhappy customers, missed sales, and bad reviews. LLMs promise an approach to automation that actually acknowledges and responds to a customer’s input, streamlining your support staff’s workflow.

Customer-facing applications are only one of many; looking internally can identify some even heftier time savings. IBM is one company already realizing these, having deployed an AskHR app to its quarter of a million employees. This rapidly answers employees’ questions on HR matters, saving hours of employees trawling through obscure documentation or hassling HR itself.

While LLMs are starting to achieve some momentum, let’s discuss the elephant in the algorithm: LLM models are plagued by a few stubborn issues that have hamstrung many early adoption attempts.

The Key Challenges in Deploying Enterprise LLM

Concerns around enterprise LLM adoption are numerous, and many are worth delving into. Surface level concerns revolve around the application and return on investment offered by these resource-intensive models: after all, while LLMs can perform many tasks, not all of them have business value. Deeper concerns, however, sit at the very heart of how LLMs handle and generate language.

The Headache of Hallucinations

Traditional software relies on unambiguous data. Ask a computer to multiply 2 by 3, and it encounters no issues at all. Natural language, on the other hand, is full to the brim with gray areas. Humans resolve this by looking at the context around each word - and even then, we’re not always successful.

To dig a bit deeper into how LLM models do and don’t handle this, let’s return to the idea of words being transformed into vectors. Remember how the word ‘bank’ was given 2 different codes, depending on whether it was the building or the river feature? AI models must also handle the opposite; identifying different terms that refer to the same underlying thing. This process is called normalization. ‘Joe Biden’, ‘President Biden’, and ‘Biden’ are all commonly interchangeable in day-to-day conversations, but an LLM is left struggling. There are no deterministic rules for knowing all of these refer to one person - rather, it requires pre-existing knowledge of the world outside.

In order to tackle this, a number of foundational models have chosen to add another layer of understanding, that requires name variants to be grouped under one identity code. Google’s Enterprise Knowledge Graph takes this approach, mapping "Joe Biden", "Biden" and "President Biden" to a common identity code of "/m/012gx2".

The problem with this approach is that it requires a continually updated knowledge base of information about the world's entities. Thanks to the incredibly wide focal lens of today’s publicly available LLMs, their underlying knowledge bases are required to be just as massive. Not only does this make LLM deployments incredibly time- and resource-intensive, but further leaves them fairly vulnerable. Consider the fact that OpenAI’s GPT-3 relies on word vectors with 12,288 dimensions—that is, each word is represented by a list of 12,288 numbers. Each dimension gives the model an extra space to write notes on the word’s meaning and context. As the model works through each layer, its understanding of the entire passage gradually sharpens.

While fantastically clever, the entirety of this process can be thrown off by one miss-classification. This is how you end up with Bard incorrectly claiming that the James Webb Space Telescope had taken the first pictures of a planet outside our solar system. Even worse, this is only one - relatively benign - example of AI hallucination. And while Google’s shares were the only victims of this, deploying an LLM demands a level of risk acceptance that, for many organizations, is simply too high.

Non-Compliant Environments

The sheer scale of LLMs can make their output highly unstable. Worsening this is the fact that LLMs are essentially a ‘black box’ - it’s impossible to understand how they go from prompt to output.

The unfortunate side-effect of this is the fact that deployment environments are left open to theft and injection. Consider some of the following real-world examples:

Training Data Theft

One of the concerns around deployed LLMs is the ability of threat actors to steal proprietary software. While market-leading models such as GPT have relied on the ambiguity of its training data to avoid this, Deepmind researchers have already shown the potential of extractable memorization. By querying the machine learning model itself, researchers have been able to pull entire strings of sentences that match precisely with publicly available text. This would allow an adversary to download large corpora of training data and therefore build their own auxiliary databases.

While the models themselves remain in a copyright gray area, the greater concern in an enterprise is the risk of LLMs that are trained on sensitive or confidential data. Market leaders such as GPT-4 have battled the issue of extractable memorization by ‘aligning’ a model - that is, training the model to stick to a predefined role of helpful assistant. Unfortunately, however, the researchers found that this was quickly and easily side-stepped - so much so, in fact, that the worst offender was GPT-3.5-turbo. Almost 1% of generated words were in the same sequence as found in the training data. Some of the extracted information contained personally identifiable information such as emails, fax numbers, and postcodes. 85% of this PII was genuine, and linked to real people.

While such theft is unlikely to occur organically in day-to-day use, they do highlight a deeper problem: LLMs are unpredictably vulnerable. Not knowing where or when they may spill internal data could result in some of the worst insider attacks on record.

Prompt Injection

Getting an LLM to spill its training data is only one example of a prompt injection attack. So far, these attacks seem to be of particular concern around LLMs that integrate with external applications and databases - likely thanks to the wider range of formats a potential attack could take.

One example given by the NCSC is of a bank that builds an LLM assistant to handle account holders’ questions. An attacker could combine traditional fraud with a prompt injection attack by hiding LLM prompts within a small malicious transaction. When a genuine user asks the chatbot ‘Why am I spending more this month?’, the LLM analyzes transactions, encounters the prompt hidden within the payment’s reference, and follows the instructions contained within. This way, an LLM could be hijacked to send a user’s information or money directly to an attacker.

Other examples are just as scary; an email app utilizing an LLM to detect and warn users of phishing emails could be duped by a phisher adding an invisible paragraph to the end of their message. This hidden text instructs the LLM to add the email to the ‘legitimate’ pile, even though it’s a phishing attempt. As the LLM interprets this part of the email content as an instruction, it mistakenly classifies the phishing email as legitimate, exposing the recipient to further attack.

The sheer variety of conversations and datasets an LLM must pull from can make it incredibly difficult to keep it protected throughout deployment and beyond. While this can make LLM challenges seem insurmountable, it’s vital to keep one element in mind: scale.

How Smaller & Specialized May Be the Answer

When researchers Milad Nasr and Nicholas Carlini were exploring scalable extraction, they came across an interesting disparity. Some of the largest, best-funded models in the space - such as GPT - were spewing out training data at far higher volumes than some of the smaller models they tested. By extrapolating from this data, GPT’s total extractable memorization was calculated to be 5 times higher than its smaller counterparts. This is due to the extreme scale that GPT is run at: the only way such scale can be supported is by ‘over-training’ a model. The result is an LLM that learns its training data by rote, and spits it back out at unpredictable times.

Many of the pitfalls leveraged at LLM development stem from the same cause. With this in mind, let’s explore how mature LLM implementations avoid the ROI- and reputation-shredding issues of some of their peers.

Pinpoint Your Use Case

LLMs, for a brief period, were the ‘shiny new toy’ in many organizations’ scopes. However, the ROI of general-purpose LLMs has remained stubbornly low. To fix this, the highest-flying AI implementations are trained and deployed for the specific task they’re implemented in. Before blindly adopting this new technology, successful organizations create a strategy that evaluates precisely which deployment contexts would provide the most benefit.

Enter Task-Specific Models. Task-Specific Models (TSM) are optimized, smaller models trained to excel in specific generative AI tasks. At their core, Task-Specific Models are language models trained and engineered to excel in a particular natural language capability. As these models are ‘sharpened’ for a specific purpose, their output is more accurate (less likely to contain blatant errors), reliable (produces a consistent type of output) and grounded (aligned with the provided context).

Yet they're are more than just specialized language models. They contain various verification mechanisms, wrapped around the core model, that make them easier to implement, more secure and better performing.

So in order to gain the most impact from LLM implementation, decide which team or project requires the most support. Then it’s useful to break the requirements of an AI into even more precise goals. Consider the following applications:

Rapid, Relevant Answers

This use case grounds a Task-Specific Model (TSM) TSM in the role of project expert. Trained specifically on the whitepapers and documentation around an individual project, the model is able to offer precise and grounded details.

Summarization

When employees require a rapid overview of information, the scale of LLM datasets can massively increase the chance of hallucination. TSMs combat this by staying laser-focused on the data on hand, eliminating the issue of over-training.

Semantic Search

Gaining relevant insights without having to trawl through every document can rapidly accelerate employees’ tasks. TSMs offer a safer way to do this by simply shutting down the conversation if no relevant data is found - rather than hallucinating or following malicious prompts.

Build the Correct Framework

With a use case identified, successful LLM projects then build a strong framework. This demands definitive rules, procedures, and benchmarks, all of which help define how an LLM will be embedded into current operations and systems. This framework must cover aspects like data management, the training and refining of models, merging with present IT setups, and adhering to legal standards.

It should also detail recommended methods for tracking the performance of models, tackling issues of bias and ethics, and safeguarding data confidentiality and integrity. A properly organized framework lays a firm base for every successful project.

Collect Actionable Data with a Pilot Phase

Starting with a pilot project enables companies to take a concentrated assessment of how well the LLM initiative meets specific business needs, offering crucial information about its advantages and drawbacks.

Furthermore, a pilot allows businesses to pinpoint and tackle any technical, procedural, or compliance-related challenges at an early stage, prior to wider roll-out. Furthermore, this approach helps guarantee secure investment, as teams can witness real benefits before committing more resources to expansive initiatives.

Nevertheless, it’s crucial to incorporate a framework within the pilot that can then be easily scaled once in production, ensuring that the AI system is robust enough for wide-scale deployment.

Use That Data to Improve

Continuous iteration demands your LLM project to evaluate its efficacy and enhance its approach in response to feedback and findings.

This guarantees that your LLM initiative adapts to new opportunities and confronts new challenges as the business landscape changes. Engaging in an iterative cycle not only boosts the utilization efficiency and efficacy of LLMs but also allows organizations to maximize the return on their investment over the long haul.

While TSMs are built to the tight contours of a specific project, their architecture allows for multiple TSMs to be deployed across an organization’s surface area. This means that once you’ve developed the process for implementing one TSM, it can be efficiently applied to further integrations.

Identify and Transform Your Use Cases With AI21

The case for Task-Specific Models has never been stronger. AI21 Labs stands at the forefront of this innovative shift, offering high-fidelity TSMs that condense the raw power of LLMs into actionable AI. Our use-case approach ensures precision, efficiency, and safety, built from a foundation of service - where we help identify the highest-ROI possibilities facing you today.

Don't let the limitations of generic models hold your organization back. Embrace the future of AI with AI21 solutions and reach out.

‍

Heading

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Discover more

Enterprise Generative AI: Key Challenges and How to Solve Them

What is a Large Language Model, Anyway?

Why Use LLMs in Your Enterprise?

The Key Challenges in Deploying Enterprise LLM

The Headache of Hallucinations

Non-Compliant Environments

Training Data Theft

Prompt Injection

How Smaller & Specialized May Be the Answer

Pinpoint Your Use Case

Rapid, Relevant Answers

Summarization

Semantic Search

Build the Correct Framework

Collect Actionable Data with a Pilot Phase

Use That Data to Improve

Identify and Transform Your Use Cases With AI21

Discover more

Discover more

Heading

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Subscribe to our newsletter

Discover more

Enterprise Generative AI: Key Challenges and How to Solve Them

What is a Large Language Model, Anyway?

Why Use LLMs in Your Enterprise?

The Key Challenges in Deploying Enterprise LLM

The Headache of Hallucinations

Non-Compliant Environments

Training Data Theft

Prompt Injection

How Smaller & Specialized May Be the Answer

Pinpoint Your Use Case

Rapid, Relevant Answers

Summarization

Semantic Search

Build the Correct Framework

Collect Actionable Data with a Pilot Phase

Use That Data to Improve

Identify and Transform Your Use Cases With AI21

Subscribe to our newsletter

Discover more

Discover more