Exploring AI-Generated Summaries: A Comparative Analysis of Human vs. Language Model Summaries

Reading long forms of text, whether blogs or research papers, can be time-consuming and sometimes even tiring. The abstracts in research…

Oct 21, 2023

Reading long forms of text, whether blogs or research papers, can be time-consuming and sometimes even tiring. The abstracts in research papers typically provide an idea of the paper’s content. However, I occasionally ponder whether, if left to my own devices, I would write a different summary. Or, if I were to ask AI models to summarize the papers for me, how different would the result be from the given abstract? Before delving into my curiosity-driven experiment, let’s first explore the two types of automatic summarization techniques: abstractive and extractive.

In abstractive summarization, the model takes the context of the given text and generates new sentences while preserving the essence of the original text. Technically speaking, for an NLP model or a Language Model (LLM) to produce a human-like summary, a comprehensive understanding of the context is crucial, and this is where the attention mechanism comes into play. Transformer models that are built upon such mechanisms, such as BERT, GPT, and others, are advanced enough to generate abstractive summaries. For further information, you can refer to the blog post titled “Understanding Automatic Text Summarization-2: Abstractive Methods.”

On the other hand, extractive summarization primarily involves the extraction of essential sentences from the original text without any paraphrasing or natural language generation. The core aspect of this technique lies in how sentences are scored to distinguish between important and less important ones. Each sentence receives a score based on various criteria. For instance, sentences containing keywords and phrases receive higher scores, while redundant sentences are penalized to maintain diversity. Sentence positioning in the original text also affects the scoring, with beginning and ending sentences receiving higher scores. Other factors like sentence length and linguistic features may also be considered. Extractive summarization is typically more suitable for generating top highlights from the original text, such as news headlines or personalized ads. BERT, RNN, and graph-based algorithms are often employed for this type of summarization. For a deeper technical understanding of this technique, check out the paper by Derek Miller.

Let’s get into the experiment I conducted to understand the difference between a human-generated summary and one generated by an LLM. In this experiment, I used the text summarization model from Cohere on the research papers from the Association for Computational Linguistics (ACL) repository.

Here are the steps I took to parse the PDF files:

To enable Cohere’s summarize endpoint to function, some preprocessing of the dataset (research papers) was necessary. Specifically, this involved converting the PDF files into text files. The input format accepted by the model is a string.
The second step involved writing a Python script designed to read the PDF files, parse them using the Python package known as pdfminer, and save the contents in text format.
The objective was to exclude the abstract from the paper and train the model on the content found in the “Introduction” through the “Results” section. Sections like Discussion, Conclusion, References, and Appendix were intentionally excluded. The script was hard-coded to align with the formatting of ACL papers, tailoring it for this specific project.
The output produced by this script was saved as individual text files for each paper.

Extractive Summarization:

After preparing the data, the subsequent step was to access Cohere’s API in order to utilize their summarize model. I referred to the documentation available on Cohere’s website, following the provided instructions and examining the details regarding the parameters used in the co.summarize endpoint. The code can be found in my GitHub repository. The output generated based on the specified parameter settings was as follows:

Input Paper: Generating Datasets with Pretrained Language Models

Abstract from the paper: To obtain high-quality sentence embeddings from pre-trained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, fine-tuning, or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

Output: We explore unsupervised methods for generating sentence embeddings that mimic the way humans create text pairs. We demonstrate that recent advances in language modeling using attention mechanisms can be applied to this task. To build large datasets, we mimic the creation of unlabeled text examples by crowd workers and replace the human annotators with large language models. We show that the resulting datasets can be used for training and fine-tuning plms for a variety of semantic similarity tasks.

Abstractive Summary:

The output provided above constitutes an extractive summary, so comparing it directly with the abstract may not be entirely equitable. Consequently, I turned to the Cohere playground to prompt the model to generate an abstract using the co.generate method, resulting in the following output:

A couple of noteworthy aspects to consider: you have the flexibility to adjust the desired word count for your summary and select from various baseline models for execution. In this particular instance, I directed the model to generate an abstractive summary of the research paper. The context I supplied was the preprocessed text obtained from parsing the PDF file. The code used for the model is as follows:

import cohere
co = cohere.Client('') # This is your trial API key
response = co.generate(
model='command',
prompt='Can you generate an abstractive summary from a research paper? 
 Content of the research paper',
max_tokens=2440,
temperature=0.9,
k=0,
stop_sequences=[],
return_likelihoods='NONE')
print('Prediction: {}'.format(response.generations[0].text))

In conclusion, the abstractive summary generated from Cohere Playground is robust, effectively conveying the paper’s subject matter, methodology, dataset, and results. This project was primarily driven by my curiosity to experiment with Cohere’s command model in the playground and their co.summarize endpoint to observe the differences between human-generated summaries and those produced by LLMs. In my opinion, human-generated summaries still hold an edge, but Cohere’s model’s output is notably strong. There remains significant potential for improvement in this code, including further model fine-tuning and more comprehensive data preprocessing. Currently, the preprocessed file includes content from tables and figures within the paper. To satisfy my ongoing curiosity, exploring the use of OpenAI models, Google’s Bard, and others, for summary generation and seeking summaries from ChatGPT for comparative analysis is an enticing prospect.

P.S.: A special thanks to my friend Alicia for helping me write the PDF parsing code as well as brainstorming ideas for this project.

Sonam’s Substack

Discussion about this post