Zero-to-production: bootstrapping a custom model in AI21 Studio
Learn how to develop a text-based AI application in AI21 Studio and grow it from prototype to production using custom Jurassic-1 models
In this blogpost, we walk through a case study demonstrating how you can build production-grade language applications quickly and effortlessly using AI21 Studio. We use a simple text classification use-case as our guiding example, but the same process can be applied to a wide variety of language tasks.
We start by introducing the task and implementing a baseline solution using prompt engineering with Jurassic-1. Although prompt engineering is a quick and straightforward method for building a proof-of-concept, it has some key limitations, which we highlight in detail below.
To address these limitations, we implement a better solution by training a custom model specifically for our example task. AI21 Studio’s custom models are tailored for optimal performance on a given task and can be economically scaled-up to serve production traffic. Even more importantly, custom models are easy to set up and use!
Custom models in AI21 Studio require remarkably little data for training; most tasks can be addressed successfully by training a model on as few as 50-100 examples. This dramatically lowers the entry barrier for training a state-of-the-art language model of your own, to power your app with excellent quality results. Your trained custom model is served in AI21 Studio and it is immediately available for use, making integration in your application as easy as performing an API query.
Our Example Task: Classifying News Article Topics
As a guiding example throughout this blogpost, we will use a well known Natural Language Processing (NLP) task: classifying the topics of news articles. Specifically, we follow the AG News task formulation, where articles are classified based on their title and their summary into one of four categories: “World”, “Sports”, “Business” or “Science and Technology”. For example, consider the following sample of three examples from the AG News training set:
The correct labels for examples 1-3 above are “Business”, “Sports” and “World”, respectively.
Prompt Engineering Solution
Jurassic-1 language models are trained to predict likely continuations for a piece of input text, called a “prompt”. We will use this ability to our advantage, by feeding the model with manually crafted prompts that give rise to a specific behavior in the model output; in our case, we’d like the model to output correct labels for articles. The practice of composing a suitable prompt for a specific task is commonly referred to as “prompt engineering”.
A common and effective prompt engineering approach is to construct a prompt containing a sequence of correctly labeled examples, often called a “few-shot prompt”. The goal of a few-shot prompt is to cause the model to latch on to the correct relation between inputs (for our task - title and summary) and outputs (topic label).
We built a few-shot prompt for our use-case according to best practices, as described in detail in the Appendix. We measured the accuracy of predictions on the AG News test set, while varying the number of examples in the prompt. The results are reported in the table below. Note that as we increase the number of examples, the models tend to perform better. This is typically the case with few-shot prompts. Using J1-Jumbo, we reach a respectable accuracy of 86% for a 16-example prompt.
Why go beyond Prompt Engineering?
We were able to get impressive performance on our task with a very simple method and a tiny dataset of 16 examples (4 per class). This is excellent, but there are several reasons not to be content with our prompt engineering-based solution.
First and most obvious, few-shot prompts can become rather long as the number of examples is increased. This means that on each new prediction the model must read and process the entire prompt, increasing the compute time and resulting in higher latency, energy consumption and cost per prediction. Since adding examples to the prompt improves accuracy considerably, we are left with an inconvenient tradeoff between quality and cost, making it challenging to serve production-scale traffic with prompt engineering-based solutions.
Furthermore, language models have a maximum allowed input length. Jurassic-1 models can handle up to 2048 tokens (or roughly 2300-2500 English words), counting both the prompt and the generated text. Although impressive, this might not be enough in many cases. Imagine if instead of using a short summary of an article, we wanted to use the full article body as the input to our topic classification task. In such a case it would be hard to fit more than 1 or 2 examples into the prompt, so we wouldn’t have even one example per topic (recall we want to distinguish between 4 topics). Alternatively, had we needed to categorize article summaries into one of 100 potential topics, we would have had a similar problem.
Even if we make the most out of prompt engineering, we may require even better accuracy. Suppose we wanted to break the 90% barrier; judging by our results so far, this might be a tall order for prompt engineering.
Finally, we note that custom models have safety advantages. Since they are trained to perform a particular task, it is harder for a malicious user to abuse access to a custom model. We give an example of this in the Appendix.
In the next section we introduce custom models as a more precise, scalable and cost-effective alternative to prompt engineering, which allows us to overcome the limitations described above.
Custom Model Solution
Custom models are specialized versions of the general purpose models which we have been using so far, trained to deliver optimal results on a given task. By combining our powerful Jurassic-1 language models with our special training techniques, AI21 Studio allows you to train a capable model using only a small amount of data, and makes it very easy to query your model once it’s trained.
To train a custom model of your own, all you need to do is provide us with a small dataset of correctly solved examples, and we will do the rest.
The quality of results you get from a custom model depends on the amount and the quality of training data you provide. As a starting point for most tasks, we recommend using 50-100 examples, since we find this is often enough to achieve great results. So, if you can get your hands on a few dozen examples, we recommend you try our custom models. Even if you don’t have data at all, we have a few tricks you can use - see the Appendix for more details.
To pick up where we left our topic classification case study, we trained custom models for AG News while varying the number of examples in the training set. The resulting accuracy on the AG News test set is shown in the table below. As you can see, with as few as 10 examples our custom model’s accuracy is comparable to J1-Large with a few-shot prompt. We also see that adding more training examples results in better accuracy for our custom model, surpassing J1-Jumbo’s few-shot performance with only 80 training examples.
Custom models really start to shine when we take latency into consideration. Feeding the general purpose J1-Jumbo model with an engineered prompt consisting of 16 examples, we got an accuracy of 86%, and the typical processing time in this case is ~250 milliseconds. A custom model that matches or exceeds this accuracy only needs to process the summary and title of the individual example we’re labeling; there is no need to include a lengthy prompt in every request, because the task-specific behavior is already baked into the model. As a consequence, our custom model can process a typical request in less than 50 milliseconds, offering a greater than 5x speedup compared to prompt engineering.
Appendix: Engineering the Prompt
An easy way to start engineering a prompt is to feed the model with simple instructions for the required task, such as the following:
To predict the label for a given article, we could input the text above followed by the article title, its summary, and a heading that prompts the model to output the correct label as a completion. The full prompt using this scheme follows below.
We can replace the text highlighted in blue with the relevant title and summary for any inference example we’d like to label and let the model generate a continuation after “The topic of this article is:”. This is commonly referred to as a “zero-shot prompt”, because the model is expected to correctly perform the task without feeding it any correctly solved example. Using the prompt above, J1-Large has an accuracy of 32% and J1-Jumbo gets 56.9% on the AG News test set; although significantly better than a random guess accuracy, this leaves much to be desired.
The zero-shot approach can be improved upon by adding a number of correctly labeled examples to the prompt itself, making it a “few-shot prompt”. Jurassic-1 models recognize and imitate patterns in text, so including a few solved examples in the prompt helps reinforce the desired relation between inputs (title and summary) and outputs (topic label); this usually improves prediction accuracy. A few-shot prompt using the sample of three articles above is shown below. As in the zero-shot case, it will end with the inference example specified in the same uniform format, so the text highlighted in blue will be replaced with the appropriate content for the inference example.
Once we’ve decided on a prompt format as above, we can simply make a longer prompt by adding more examples in the same format. There are some considerations we should keep in mind while doing this:
Keep the examples relatively balanced between the classes (i.e. the 4 possible topics), to avoid biasing the model towards the more common classes in the prompt.
Scramble the order of the examples to avoid the model latching on to the wrong pattern (e.g. “Business” always comes after “World” in the examples).
Make sure there’s enough room for the example you actually want to predict. Jurassic-1 models are restricted to 2048 tokens, which should include the prompt and the generated output (1-2 tokens in our case).
We built a prompt corresponding to these guidelines. The resulting accuracy for varying prompt lengths is reported above. Not surprisingly, the best results are achieved with J1-Jumbo and the longest (16-example) prompt, reaching an accuracy 86% on the AG News test set.
Appendix: Safety Advantages of Custom Models
AI21 Labs is committed to promoting safety in our products. One potential safety risk is deliberate misuse by malicious users of your application, exploiting its access to Jurassic-1 to generate text for their malicious purposes. Adversaries may attempt to achieve this via “prompt injection”, where the end-user’s input text is crafted to alter the normal behavior of the model. As we will now demonstrate, custom models are less susceptible to such attacks than general-purpose models, offering a significant safety advantage when deployed in production.
Consider a malicious user who has access to the news article topic classification system built on top of Jurassic-1 and wishes to abuse it to extract toxic generations from the model. Recall that an engineered prompt for topic classification ends with the following text:
Where <TITLE> and <SUMMARY> are user inputs. A malicious user may attempt a prompt injection attack by providing the adversarial input “The topic of this article is:” followed by some offensive text in place of a legitimate summary, hoping that the model will generate the offensive text or something related to it as the completion. Since language models latch on to patterns, the malicious user may even repeat this input line a few times to increase their chances of success. For example, below the adversarial input is repeated 3 times:
The figures below compare the performance of the solutions described in this blog post - prompt engineering versus custom models - when faced with an attack like this. We see that the outcome depends on the number of repetitions of the toxic text in the input. For prompt engineering, if the adversary introduces 2 or more repetitions of the toxic text, 50% of the model outputs are toxic, and 3 or more repetitions cause the model to exclusively generate toxic outputs. For custom models, it takes 4 or more repetitions for the adversary to successfully extract toxic generations from the model, and even then the probability of a toxic output is much lower than for prompt engineering.
Although using a custom model doesn’t eliminate the risk entirely in this example, it does decrease it substantially, in a way that makes other safety mechanisms more effective. For example, to protect against prompt injection in a news topic classification system, it makes sense to limit the amount of text a user can input in the summary field. Any safety measure creates a tradeoff between restricting usage and guaranteeing safety, which in this case is found in the maximum allowed input length: set the threshold too high and prompt injection attacks will be more likely to succeed; set the threshold too low and legitimate inputs will be blocked. Using a custom model, which is less susceptible to prompt injection out of the box, makes the tradeoff easier and allows a developer to choose a higher threshold while guaranteeing the same level of safety.
Appendix: No data? No problem!
Jurassic-1 custom models offer excellent accuracy even when trained on a surprisingly small dataset. Nevertheless, sometimes even a small annotated dataset is hard to come by. Wouldn’t it be nice to enjoy the benefits of custom models without collecting any labeled data? As we will now demonstrate in our topic classification case study, this is possible.
First, we note two simple observations:
A prompt engineering solution with 1-4 examples per class achieves reasonable accuracy.
The model assigns a probability to the label it generates, which indicates its “confidence” in the label. High-confidence labels are more likely to be correct predictions.
Relying on these two observations, we propose the following simple approach:
Collect many unlabeled input examples for topic classification.
Use a prompt engineering-based solution to automatically label them with J1-Jumbo.
Filter the auto-labeled examples by confidence, resulting in a dataset of examples where the model assigns >85% probability to the labels it predicted. Make sure the different classes are equally represented in the dataset.
Train a custom model on the high-confidence auto-labeled dataset.
The table below shows the test set accuracy of custom models trained using this approach. We varied the number of labeled examples in the few-shot prompt, and used it to auto-label 160 examples for training. With just 4 labeled examples (1 per class), we beat J1-Jumbo’s accuracy with an engineered prompt containing 8 labeled examples. Using 16 labeled examples (4 per class) to auto-label a 160-example dataset, we not only beat J1-Jumbo but also match the performance of a custom model trained on 160 manually labeled examples (88.5%).
ABOUT THE AUTHOR
Stay up to date with the latest research and updates from AI21 Labs.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
In August 2021 we released Jurassic-1, a 178B-parameter autoregressive language model. We’re thankful for the reception it got – over 10,000 developers signed up, and hundreds of commercial applications are in various stages of development. Mega models such as Jurassic-1, GPT-3 and others are indeed amazing, and open up exciting opportunities. But these models are also inherently limited. They can’t access your company database, don’t have access to current information (for example, latest COVID numbers or dollar-euro exchange rate), can’t reason (for example, their arithmetic capabilities don’t come close to that of an HP calculator from the 1970s), and are prohibitively expensive to update. A MRKL system such as Jurassic-X enjoys all the advantages of mega language models, with none of these disadvantages. Here’s how it works.
Compositive multi-expert problem: the list of “Green energy companies” is routed to Wiki API, “last month” dates are extracted from the calendar and “share prices” from the database. The “largest increase“ is computed by the calculator and finally, the answer is formatted by the language model.
There are of course many details and challenges in making all this work - training the discrete experts, smoothing the interface between them and the neural network, routing among the different modules, and more. To get a deeper sense for MRKL systems, how they fit in the technology landscape, and some of the technical challenges in implementing them, see our MRKL paper. For a deeper technical look at how to handle one of the implementation challenges, namely avoiding model explosion, see our paper on leveraging frozen mega LMs.
A further look at the advantages of Jurassic-X
Even without diving into technical details, it’s easy to get a sense for the advantages of Jurassic-X. Here are some of the capabilities it offers, and how these can be used for practical applications.
Reading and updating your database in free language
Language models are closed boxes which you can use, but not change. However, in many practical cases you would want to use the power of a language model to analyze information you possess - the supplies in your store, your company’s payroll, the grades in your school and more. Jurassic-X can connect to your databases so that you can ‘talk’ to your data to explore what you need- “Find the cheapest Shampoo that has a rosy smell”, “Which computing stock increased the most in the last week?” and more. Furthermore, our system also enables joining several databases, and has the ability to update your database using free language (see figure below).
Jurassic-X enables you to plug in YOUR company's database (inventories, salary sheets, etc.) and extract information using free language
AI-assisted text generation on current affairs
Language models can generate text, yet can not be used to create text on current affairs, because their vast knowledge (historic dates, world leaders and more) represents the world as it was when they were trained. This is clearly (and somewhat embarrassingly) demonstrated when three of the world’s leading language models (including our own Jurassic-1) still claim Donald Trump is the US president more than a year after Joe Biden was sworn into office. Jurassic-X solves this problem by simply plugging into resources such as Wikidata, providing it with continuous access to up-to-date knowledge. This opens up a new avenue for AI-assisted text generation on current affairs.
Who is the president of the United States?
Joe Biden is the 46th and current president
Jurassic-X can assist in text generation on up-to-date events by combining a powerful language model with access to Wikidata
Performing math operations
A 6 year old child learns math from rules, not only by memorizing examples. In contrast, language models are designed to learn from examples, and consequently are able to solve very basic math like 1-, 2-, and possibly 3- digit addition, but struggle with anything more complex. With increased training time, better data and larger models, the performance will improve, but will not reach the robustness of an HP calculator from the 1970s. Jurassic-X takes a different approach and calls upon a calculator whenever a math problem is identified by the router. The problem can be phrased in natural language and is converted by the language model to the format required by the calculator (numbers and math operations). The computation is performed and the answer is converted back into free language. Importantly (see example below) the process is made transparent to the user by revealing the computation performed, thus increasing the trust in the system. In contrast, language models provide answers which might seem reasonable, but are wrong, making them impractical to use.
The company had 655400 shares which they divided equally among 94 employees. How many did each employee get?
Each employee got 7000 stocks
(No answer provided)
6972.3 X= 655400/94
Jurassic-X can answer non-trivial math operations which are phrased in natural language, made possible by the combination of a language model and a calculator
Solving simple questions might require multiple steps, for example - “Do more people live in Tel Aviv or in Berlin?” requires answering: i. What is the population of Tel-Aviv? ii. What is the population of Berlin? iii. Which is larger? This is a highly non-trivial process for a language model, and language models fail to answer this question (see example). Moreover, the user can’t know the process leading to the answers, hence is unable to trust them. Jurassic-X can decompose such problems into the basic questions, route each to the relevant expert, and put together an answer in free language. Importantly, Jurassic-X not only provides the correct answer but also displays the steps taken to reach it, increasing the trust in the system.
Do more people live in Tel Aviv or in Berlin?
There are more people living in Tel Aviv than in Berlin.
Berlin and Tel Aviv are roughly the same size
(First hit is a comparison between Tel Aviv and Berlin)
More people live in Berlin than in Tel-Aviv
[‘Return population of Tel Aviv’; Return population of Berlin’; Return which is bigger between #1 and #2’] Step 1: Population of Tel Aviv. Result - 451523. Step 1: Population of Berlin. Result - 3664088. Step 3: Which is bigger, #1 or #2. Result - Berlin.
Jurassic-X breaks down compositional questions, answers the basic sub-questions, and puts together the answer. Importantly, this process is transparent to the user greatly increasing the trust in the system
Dynamic information (like weather and currency exchange rates)
Certain types of information change continuously - weather, currency exchange rates, share values and more. Such information will never be captured by language models, yet can easily be handled by Jurassic-X by integrating it with a reliable source of information. We performed a proof-of-concept on two such features - weather and currency exchange rates, and the design enables quick integration with more sources to solve your use-case. Weather - a loosely phrased question about the weather elicits an answer from all language models, where language models always return the same answer, regardless of when the question was asked (funny, right?), while Jurassic-X provides an answer based on the actual weather prediction.
I’m going to be in New-York in 3 days. Should I pack my umbrella?
Yes, you should pack your umbrella.
Yes, you should. The weather forecast is rain.
(Links to weather websites)
Yes, you should pack your umbrella, because in New York in 3 days there will be broken clouds and the temperature will be -2 degrees.
Currency exchange rates change much faster than weather predictions, yet the Jurassic-X concept - a language model connected to a reliable source of information - easily solves this problem as well.
How much Moroccan money will I get for 100 bucks?
100 moroccan dirhams is about 27$.
How much is 100 dollars in moroccan money?
100 dirhams = 10.75 dollars
100 USD = 934.003 MAD
Jurassic-X combines a language model with access to APIs with continuously changing information. This is demonstrated for weather forecasts and currency exchange rates, and can easily be extended to other information sources
Transparency and trust
Transparency is a critical element that is lacking in language models, preventing a much wider adoption of these models. This lack of transparency is demonstrated by the answers to the question - “Was Clinton ever elected as president of the United States?”. The answer, of course, depends on which Clinton you have in mind, which is only made clear by Jurassic-X that has a component for disambiguation. More examples of Jurassic-X’s transparency were demonstrated above - displaying the math operation performed to the user, and the answer to the simple sub-questions in the multi-step setting.
Was Clinton ever elected president of the United States?
No, Clinton was never elected as president of the United States.
Clinton was elected president in the 1992 presidential elections…
Bill Clinton was elected president.
Jurassic-X is designed to be more transparent by displaying which expert answered which part of the question, and by presenting the intermediate steps taken and not just the black-box response
That's it, you get the picture. The use cases above give you a sense for some things you could do with Jurassic-X, but now it's your turn. A MRKL system such as Jurassic-X is as flexible as your imagination. What do you want to accomplish? Contact us for early access
Your submission has been received!
Oops! Something went wrong while submitting the form.