This study examines the performance of AI21 Summarize API, powered by a task-specific summarization model, and compares it to general-purpose Large Language Models (LLMs), specifically davinci-003 and gpt-3.5-turbo available via OpenAI API. We apply both human evaluation and automatic metrics to evaluate the quality of summaries generated by different models.
We use source texts from an established academic benchmark (XSum) as well as proprietary real-world data. Sensitivity of different prompting methods for LLMs is also investigated. We find that AI21 Summarize API generally performs better or on par with both OpenAI LLMs, across different tests. For real world data, human evaluation shows a preference for AI21 Summarize API over OpenAI LLMs, regardless of prompting method. In particular, AI21 Summarize API exhibits a significantly lower rate of unreliable summaries with incorrect information ("hallucinations") and/or misleading re-arrangement of source facts ("reasoning violations"). AI21 Summarize API also outperforms OpenAI LLMs in terms of automatic metrics on the same data, irrespective of prompting method.