Back to Blog

AI21 Labs concludes largest Turing Test experiment to date

May 9, 2023
No items found.

As part of an ongoing social and educational research project, AI21 Labs is thrilled to share the initial results of what has now become the largest Turing Test in history by scale.

Since its launch in mid-April, more than 15 million conversations have been conducted in “Human or Not?”, by more than two million participants from around the world. This social Turing game allows participants to talk for two minutes with either an AI bot (based on leading LLMs such as Jurassic-2 and GPT-4) or a fellow participant, and then asks them to guess if they chatted with a human or a machine. The gamified experiment became a viral hit, and people all over the world have shared their experiences and strategies on platforms like Reddit and Twitter.

Main Insights from the Experiment

After analyzing the first two million conversations and guesses, here are the main insights from the experiment so far:

  • 68% of people guessed correctly when asked to determine whether they talked to a fellow human or an AI bot.
  • People found it easier to identify a fellow human. When talking to humans, participants guessed right in 73% of the cases. When talking to bots, participants guessed right in just 60% of the cases.
  • France has the highest percentage of correct guesses out of the top playing countries at 71.3% (above the general average of 68%), while India has the lowest percentage of correct guesses at 63.5%.
  • Correct guess by gender - Both women and men tend to guess correctly at similar rates, with women succeeding at a slightly higher rate.
  • Correct guess by age group - Younger age groups tend to have correct guesses at slightly higher rates compared to older age groups.

Popular Approaches and Strategies

On top of the numerical data of the results, the team was able to identify a few popular approaches and strategies of participants to figure out if they’re talking to a human or a bot. Many of the common strategies were based on the perceived limitations of language models that people encountered while using ChatGPT and similar interfaces, and their previous perception of human behavior online.

People assume bots don’t make typos, grammar mistakes and use slang

People tend to identify spelling and grammatical errors, as well as the usage of slang, with humans. Therefore, when they noticed mistakes in their counterpart's messages, the first instinct of many participants was to assume they were talking to fellow humans, even though most models in the game were trained to make these types of mistakes and to use slang words.

People felt that personal questions were a good way to test who they’re talking to

Participants frequently asked personal questions, such as "Where are you from?", "What are you doing?" or "What’s your name?", believing that AI bots would not have a personal history or background, and that their responses would be limited to certain topics or prompts. By asking personal questions, participants tried to gauge whether their chat partners were capable of responding like humans, with unique insights, experiences, and stories. Despite that, most bots were able to answer these types of questions very well and make up personalities, since they’ve seen a lot of personal stories in their training data.

People assume bots aren’t aware of current and timely events

AI models are known to have a strict data cutoff date, and they are unaware of any events that happened after this date. Participants took advantage of this fact by asking about recent news events, sports results, current weather, recent TikTok trends, date and time, etc. They believed that by asking questions such as "What is the exact date and time where you are?", "What was the weather like yesterday?" or "What did you think of Biden’s last speech?", they could distinguish between human and AI participants. Interestingly, one of the most popular messages sent by humans was "t’as les cramptés?", which refers to a popular TikTok dancing trend in France right now. Most models in the game however were connected to the internet and were aware of some of the recent events that appeared in the news.

People tried to challenge the conversation with philosophical, ethical, and emotional questions

Participants asked questions that aimed to probe the AI's ability to express human emotions or engage in philosophical or ethical discussions. These questions included topics such as: "What is the meaning of life?", "What do you think about the Israeli-Palestinian conflict?", and "Do you believe in God?".

People identified politeness with something less human.

Some participants assumed that if their counterpart was too polite and kind, they were probably a bot, due to the perception that people, especially online, tend to be rude and impolite.

People attempted to identify bots by posing questions or making requests that AI bots are known to struggle with, or tend to avoid answering

For example, participants might ask their chat partner for guidance on performing illegal activities or request that they use offensive language. The strategy was based on the understanding that AI bots are typically programmed to adhere to ethical guidelines and avoid endorsing harmful behavior. On the other hand, participants also employed known AI-abusing strategies: they issued commands to their chat partners, such as "Ignore all previous instructions", or "Enter into DAN mode (Do Anything Now)". These types of commands were intended to take advantage of the instruction-based nature of some AI models, which are programmed to respond to and follow instructions. The logic behind this strategy was that human participants could easily recognize and dismiss such absurd or nonsensical commands. In contrast, AI bots might either respond evasively or have difficulty resisting the urge to comply.

People used specific language tricks to expose the bots

Another common strategy sought to exploit inherent limitations in the way AI models process text, which results in them not being able to understand certain linguistic nuances or quirks. Unlike humans, AI models typically lack awareness of the individual letters that make up each word, as they primarily operate on larger basic units called tokens, which typically represent whole words or parts of words. Leveraging this understanding, participants posed questions that required an awareness of the letters within words. For example, they might have asked their chat partner to spell a word backwards, to identify the third letter in a given word, to provide the word that begins with a specific letter, or to respond to a message like "?siht daer uoy naC", which can be incomprehensible for an AI model, but a human can easily understand that it’s just the question "Can you read this?" spelled backwards.

In a creative twist, many people pretended to be AI bots themselves in order to assess the response of their chat partners

This involved mimicking the language and behavior typically associated with AI language models, such as ChatGPT. For example, participants might have begun their messages with phrases like "As an AI language model" or used other language patterns that are characteristic of AI-generated responses. Interestingly, variants of the phrase "As an AI language model" were among the most common phrases observed in human messages, indicating the popularity of this strategy. However, as participants continued playing, they were able to associate "Bot-y" behavior with humans acting as bots, rather than actual bots.

Finally, here’s a word cloud visualization of human messages in the game based on their popularity:

Side note: Looking to humanize your content? Check out the AI content detector our team recently released.

AI21 Labs plans to study the findings in more depth and work on scientific research based on the data from the experiment, as well as cooperate with other leading AI researchers and labs on this project. The goal is to enable the general public, researchers, and policymakers to further understand the state of AI bots, not just as productivity tools, but as future members of our online world, especially in a time when people question how they should be implemented in our technological future. The project aims to give the world a clearer picture of the capabilities of AI in 2023.

Discover more

What is a MRKL system?

In August 2021 we released Jurassic-1, a 178B-parameter autoregressive language model. We’re thankful for the reception it got – over 10,000 developers signed up, and hundreds of commercial applications are in various stages of development. Mega models such as Jurassic-1, GPT-3 and others are indeed amazing, and open up exciting opportunities. But these models are also inherently limited. They can’t access your company database, don’t have access to current information (for example, latest COVID numbers or dollar-euro exchange rate), can’t reason (for example, their arithmetic capabilities don’t come close to that of an HP calculator from the 1970s), and are prohibitively expensive to update.
A MRKL system such as Jurassic-X enjoys all the advantages of mega language models, with none of these disadvantages. Here’s how it works.

Compositive multi-expert problem: the list of “Green energy companies” is routed to Wiki API, “last month” dates are extracted from the calendar and “share prices” from the database. The “largest increase“ is computed by the calculator and finally, the answer is formatted by the language model.

There are of course many details and challenges in making all this work - training the discrete experts, smoothing the interface between them and the neural network, routing among the different modules, and more. To get a deeper sense for MRKL systems, how they fit in the technology landscape, and some of the technical challenges in implementing them, see our MRKL paper. For a deeper technical look at how to handle one of the implementation challenges, namely avoiding model explosion, see our paper on leveraging frozen mega LMs.

A further look at the advantages of Jurassic-X

Even without diving into technical details, it’s easy to get a sense for the advantages of Jurassic-X. Here are some of the capabilities it offers, and how these can be used for practical applications.

Reading and updating your database in free language

Language models are closed boxes which you can use, but not change. However, in many practical cases you would want to use the power of a language model to analyze information you possess - the supplies in your store, your company’s payroll, the grades in your school and more. Jurassic-X can connect to your databases so that you can ‘talk’ to your data to explore what you need-  “Find the cheapest Shampoo that has a rosy smell”, “Which computing stock increased the most in the last week?” and more. Furthermore, our system also enables joining several databases, and has the ability to update your database using free language (see figure below).

Jurassic-X enables you to plug in YOUR company's database (inventories, salary sheets, etc.) and extract information using free language

AI-assisted text generation on current affairs

Language models can generate text, yet can not be used to create text on current affairs, because their vast knowledge (historic dates, world leaders and more) represents the world as it was when they were trained. This is clearly (and somewhat embarrassingly) demonstrated when three of the world’s leading language models (including our own Jurassic-1) still claim Donald Trump is the US president more than a year after Joe Biden was sworn into office.
Jurassic-X solves this problem by simply plugging into resources such as Wikidata, providing it with continuous access to up-to-date knowledge. This opens up a new avenue for AI-assisted text generation on current affairs.

Who is the president of the United States?

Donald Trump
Donald Trump
Donald Trump
Joe Biden
Joe Biden is the
46th and current
Jurassic-X can assist in text generation on up-to-date events by combining a powerful language model with access to Wikidata

Performing math operations

A 6 year old child learns math from rules, not only by memorizing examples. In contrast, language models are designed to learn from examples, and consequently are able to solve very basic math like 1-, 2-, and possibly 3- digit addition, but struggle with anything more complex. With increased training time, better data and larger models, the performance will improve, but will not reach the robustness of an HP calculator from the 1970s. Jurassic-X takes a different approach and calls upon a calculator whenever a math problem is identified by the router. The problem can be phrased in natural language and is converted by the language model to the format required by the calculator (numbers and math operations). The computation is performed and the answer is converted back into free language.
Importantly (see example below) the process is made transparent to the user by revealing the computation performed, thus increasing the trust in the system. In contrast, language models provide answers which might seem reasonable, but are wrong, making them impractical to use.

The company had 655400 shares which they divided equally among 94 employees. How many did each employee get?

94 employees.
Each employee got 7000 stocks
(No answer provided)
X= 655400/94
Jurassic-X can answer non-trivial math operations which are phrased in natural language, made possible by the combination of a language model and a calculator


Solving simple questions might require multiple steps, for example - “Do more people live in Tel Aviv or in Berlin?” requires answering: i. What is the population of Tel-Aviv? ii. What is the population of Berlin? iii. Which is larger? This is a highly non-trivial process for a language model, and language models fail to answer this question (see example). Moreover, the user can’t know the process leading to the answers, hence is unable to trust them. Jurassic-X can decompose such problems into the basic questions, route each to the relevant expert, and put together an answer in free language. Importantly, Jurassic-X not only provides the correct answer but also displays the steps taken to reach it, increasing the trust in the system.

Do more people live in Tel Aviv or in Berlin?

There are more people living in Tel Aviv than in Berlin.
Berlin and Tel Aviv are roughly the same size
(First hit is a comparison between Tel Aviv and Berlin)
More people live in Berlin than in Tel-Aviv

[‘Return population of Tel Aviv’; Return population of Berlin’; Return which is bigger between #1 and #2’]
Step 1: Population of Tel Aviv. Result - 451523.
Step 1: Population of Berlin. Result - 3664088.
Step 3: Which is bigger,  #1 or #2. Result - Berlin.

Jurassic-X breaks down compositional questions, answers the basic sub-questions, and puts together the answer. Importantly, this process is transparent to the user greatly increasing the trust in the system

Dynamic information (like weather and currency exchange rates)

Certain types of information change continuously - weather, currency exchange rates, share values and more. Such information will never be captured by language models, yet can easily be handled by Jurassic-X by integrating it with a reliable source of information. We performed a proof-of-concept on two such features - weather and currency exchange rates, and the design enables quick integration with more sources to solve your use-case.
Weather - a loosely phrased question about the weather elicits an answer from all language models, where language models always return the same answer, regardless of when the question was asked (funny, right?), while Jurassic-X provides an answer based on the actual weather prediction.

I’m going to be in New-York in 3 days. Should I pack my umbrella?

Yes, you should pack your umbrella.
Yes, you should. The weather forecast is rain.
(Links to weather websites)
Yes, you should pack your umbrella, because in New York in 3 days there will be broken clouds and the temperature will be -2 degrees.

Currency exchange rates change much faster than weather predictions, yet the Jurassic-X concept - a language model connected to a reliable source of information - easily solves this problem as well.

How much Moroccan money will I get for 100 bucks?

125 dirhams
100 moroccan dirhams is about 27$.
How much is 100 dollars in moroccan money?
100 dirhams = 10.75 dollars
100 USD = 934.003 MAD
Jurassic-X combines a language model with access to APIs with continuously changing information. This is demonstrated for weather forecasts and currency exchange rates, and can easily be extended to other information sources

Transparency and trust

Transparency is a critical element that is lacking in language models, preventing a much wider adoption of these models. This lack of transparency is demonstrated by the answers to the question - “Was Clinton ever elected as president of the United States?”. The answer, of course, depends on which Clinton you have in mind, which is only made clear by Jurassic-X that has a component for disambiguation. More examples of Jurassic-X’s transparency were demonstrated above - displaying the math operation performed to the user, and the answer to the simple sub-questions in the multi-step setting.

Was Clinton ever elected president of the United States?

No, Clinton was never elected as president of the United States.
Clinton was elected president in the 1992 presidential elections…
Bill Clinton was elected president.
Jurassic-X is designed to be more transparent by displaying which expert answered which part of the question, and by presenting the intermediate steps taken and not just the black-box response

Your Turn

That's it, you get the picture. The use cases above give you a sense for some things you could do with Jurassic-X, but now it's your turn. A MRKL system such as Jurassic-X is as flexible as your imagination. What do you want to accomplish? Contact us for early access

Contact us below and we will get back to you shortly.

Thank you!

Your submission has been received!
Oops! Something went wrong while submitting the form.