Your TLDR by an ai: a Deep Reinforced Model for Abstractive Summarization

13 min read
Your TLDR by an ai: a Deep Reinforced Model for Abstractive Summarization

The last few decades have witnessed a fundamental change in the challenge of taking in new information. The bottleneck is no longer access to information; now it’s our ability to keep up. We all have to read more and more to keep up-to-date with our jobs, the news, and social media. We’ve looked at how AI can improve people’s work by helping with this information deluge and one potential answer is to have algorithms automatically summarize longer texts.

Training a model that can generate long, coherent, and meaningful summaries remains an open research problem. In fact, generating any kind of longer text is hard for even the most advanced deep learning algorithms. In order to make summarization successful, we introduce two separate improvements: a more contextual word generation model and a new way of training summarization models via reinforcement learning (RL).

The combination of the two training methods enables the system to create relevant and highly readable multi-sentence summaries of long text, such as news articles, significantly improving on previous results. Our algorithm can be trained on a variety of different types of texts and summary lengths. In this blog post, we present the main contributions of our model and an overview of the natural language challenges specific to text summarization.

Figure 1: Illustration of our model generating a multi-sentence summary from a news article. For each generated word, the model pays attention to specific words of the input and the previously generated output.

Extractive vs. abstractive summarization

Automatic summarization models can work in one of two ways: by extraction or by abstraction. Extractive models perform "copy-and-paste" operations: they select relevant phrases of the input document and concatenate them to form a summary. They are quite robust since they use existing natural-language phrases that are taken straight from the input, but they lack in flexibility since they cannot use novel words or connectors. They also cannot paraphrase like people sometimes do. In contrast, abstractive models generate a summary based on the actual “abstracted” content: they can use words that were not in the original input. This gives them a lot more potential to produce fluent and coherent summaries but it is also a much harder problem as you now require the model to generate coherent phrases and connectors.

Even though abstractive models are more powerful in theory, it is common for them to make mistakes in practice. Typical mistakes include incoherent, irrelevant or repeated phrases in generated summaries, especially when trying to create long text outputs. They historically lacked a sense of general coherence, flow and readability. In this work, we tackle these issues and design a more robust and coherent abstractive summarization model.

In order to understand our new abstractive model, let’s first define the basic building blocks and then introduce our new training scheme.

Reading and generating text with encoder-decoder models

Recurrent neural networks (RNNs) are deep learning models that can process sequences (e.g. text) of variable length and compute useful representations (or hidden state) for each phrase. These networks process each element of the sequence (in this case, each word) one by one; for each new input in the sequence, the network outputs a new hidden state as a function of that input and the previous hidden state. In this sense, the hidden state calculated at each word is a function of all the words read up to that point.

Figure 2: A recurrent neural network reads an input sentence by applying the same function (in green) on individual words.

RNNs can also be used to generate output sequences in a similar fashion. At each step, the RNN hidden state is used to generate a new word that is added to the final output text and fed in as the next input.

Figure 3: RNNs can generate output sequences, and re-use the output word as the input of the next function.

The input (reading) and output (generating) RNNs can be combined in a joint model where the final hidden state of the input RNN is used as the initial hidden state of the output RNN. Combined in this way, the joint model is able to read any text and generate a different text from it. This framework is called an encoder-decoder RNN (or Seq2Seq) and is the basis of our summarization model. In addition, we replace the traditional encoder RNN by a bidirectional encoder, which uses two different RNNs to read the input sequence: one that reads the text from left-to-right (as illustrated in Figure 4) and another that reads from right-to-left. This helps our model to have a better representation of the input context.

Figure 4: Encoder-decoder RNN models can be used to solve sequence-to-sequence tasks in natural language such as summarization.

A new attention and decoding mechanism

To make our model outputs more coherent, we allow the decoder to look back at parts of the input document when generating a new word with a technique called temporal attention. Instead of relying entirely on its own hidden state, the decoder can incorporate contextual information about different parts of the input with an attention function. This attention is then modulated to ensure that the model uses different parts of the input when generating the output text, hence increasing information coverage of the summary.

In addition, to make sure that our model doesn't repeat itself, we also allow it to look back at the previous hidden states from the decoder. In a similar fashion, we define an intra-decoder attention function that can look back at previous hidden states of the decoder RNNs. Finally, the decoder combines the context vector from the temporal attention with the one from the intra-decoder attention to generate the next word in the output summary. Figure 5 illustrates the combination of these two attention functions at a given decoding step.

Figure 5: Two context vectors (marked “C”) are computed from attending over the encoder hidden states and decoder hidden states. Using these two contexts and the current decoder hidden state (“H”), a new word is generated (on the right) and added to the output sequence.

Supervised learning vs. reinforcement learning

To train this model on real-world data like news articles, a common way is to use the teacher forcing algorithm: a model generates a summary while using a reference summary, and the model is assigned a word-by-word error (or “local supervision”, as shown in Figure 6) each time it generates a new word.

Figure 6: Model training with supervised learning. Each generated word gets a training supervision signal, calculated by comparing it against the ground truth summary word at the same position.

This method can be used to train any sequence generation model based on recurrent neural networks, with very decent results. However, for our particular task, summaries don't have to match a reference sequence word by word in order to be correct. As you can imagine, two humans may generate very different summaries of the same news article, sometimes using different styles, words or sentence orders, while still being considered good summaries. The problem with teacher forcing here is that as soon as the first few words are generated, the training is misguided: it sticks strictly to the one officially correct summary and cannot adjust to a potentially correct but different beginning.

Taking this into consideration, we can do better than the word-by-word approach of teacher forcing. A different kind of training called reinforcement learning (RL) can be applied here. At first, the RL algorithm lets the model generate its own summary, then it uses an external scorer to compare the generated summary against the ground truth. This scorer then indicates to the model how "good" the generated summary was. If the score is high, then the model can update itself to make such summaries more likely to appear in the future. Otherwise, if the score is low, the model will get penalized and change its generation procedure to prevent similar summaries. This reinforced model is very good at increasing the summarization score that evaluates the entire sequence rather than a word-by-word prediction.

Figure 7: In reinforcement learning, the model doesn’t have a local supervision signal for every predicted word, but instead is trained with a reward signal that depends on the entire output and the reference summary.

How to evaluate summarization

What exactly is this scorer, and how does it tell if summaries are "good"? Since asking a human to manually evaluate millions of summaries is long and impractical at scale, we rely on an automated evaluation metric called ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE works by comparing matching sub-phrases in the generated summaries against sub-phrases in the ground truth reference summaries, even if they are not perfectly aligned. Different variants of ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) all work in the same fashion but use different sub-sequence lengths.

While ROUGE scores have a good correlation with human judgment in general, the summaries with the highest ROUGE aren't necessarily the most readable or natural ones. This became an issue when we trained our model to maximize the ROUGE score with reinforcement learning alone. We observed that our models with the highest ROUGE scores also generated barely-readable summaries.

To bring the best of both worlds, our model is trained with teacher forcing and reinforcement learning at the same time, being able to make use of both word-level and whole-summary-level supervision to make it more coherent and readable. In particular, we find that ROUGE-optimized RL helps improve recall (i.e all important information that needs to be summarized is indeed summarized) and word level learning supervision ensures good language flow, making the summary more coherent and readable.

Figure 8: Combination of supervised learning (in red) and reinforcement learning (in purple), showing how our model can learn both local and global rewards and optimize both for readability and overall ROUGE score.

Until recently, the highest ROUGE-1 score for abstractive summarization on the CNN/Daily Mail dataset was 35.46. The combination of our intra-decoder attention RNN model with joint supervised and RL training improves this score to 39.87, and 41.16 with RL only. Figure 9 shows other summarization scores for existing models and ours. Even though our pure RL model has higher ROUGE scores, our supervised+RL model has a higher readability, hence is more relevant for this summarization task. Note that See et al. use a slightly different data format, hence their results are not directly comparable with ours and the others but still give a good reference point.

Model ROUGE-1 ROUGE-L
Nallapati et al. 2016 (abstractive) 35.46 32.65
Nallapati et al. 2017 (extractive baseline) 39.2 35.5
Nallapati et al. 2017 (extractive) 39.6 35.3
See et al. 2017 (abstractive) 39.53* 36.38*
Our model (RL only) 41.16 39.08
Our model (supervised + RL) 39.87 36.90
Figure 9: Summarization results on the CNN/Daily mail dataset, comparing our model against existing abstractive and extractive approaches.

Sample outputs

What does such a large improvement mean in terms of real summaries? Here we show a couple of multi-sentence summaries based on documents from the development split of the dataset. Our model and its simpler baselines generated these, after training on the CNN/Daily Mail dataset. As you can see, the summaries have significantly improved but there’s still more work needed to make them perfect.

As a non-banking institution, Google Wallet, along with competitors PayPal and Venmo, is not legally required to be federally insured. With the new change to its policy, funds in wallet balance are protected if anything were to happen to the company like bankruptcy. the mobile transfer service (similar to PayPal and Venmo) in multiple federally-insured banking institutions. Google's user agreement says funds are not protected by the federal deposit insurance corporation. four balls in a single shot-and for those who miss it there's a slow motion version.

Article Summary (ground truth) Summary (our model)
Google Wallet says it has changed its policy when storing users' funds as they will now be federally-insured (file photo) For those who use Google Wallet, their money just became safer with federal-level insurance. Google confirmed to Yahoo Finance in a statement that its current policy changed - meaning the company will store the balances for users of the mobile transfer service (similar to PayPal and Venmo) in multiple federally-insured banking institutions. This is good news for people who place large amounts of money in their Wallet Balance because the Federal Deposit Insurance Corporation insures funds for banking institutions up to $250,000. Currently, Google's user agreement says funds are not protected by the FDIC. However, a Google spokesperson told Yahoo Finance that the current policy has changed. (...)
Talk about a chain reaction! This is the moment a billiards player performs a complex trick shot by setting up a domino train to pot four balls. Video footage shows a white ball being rolled down a positioned cue. It then bounces off one side of the red-clothed table and hits the first in a long line of dominoes. One by one the small counters fall down, tapping balls into various pockets as they go. First a yellow, then a blue, then a red. Finally, the last domino gently hits an orange ball, causing it to roll down another positioned cue lying on the table. The orb then knocks a green ball into the center pocket. In less than 30 seconds the stunt comes to a close. (...) Video footage shows a white ball being rolled down a jumper. It then bounces off one side of the red-clothed table and hits the first in a long line of dominoes. One by one the small counters fall down, tapping balls into pockets as they go-first a yellow. It comes to a close. The clip was uploaded by youtube user honda4ridered.
Kelly Osbourne didn't always want to grow up to be like her famous mom - but in a letter published in the new book A Letter to My Mom, the TV personality admitted that she is now proud to be Sharon Osbourne's daughter. For author Lisa Erspamer's third collection of tributes, celebrities such as Melissa Rivers, Shania Twain, will.i.am, Christy Turlington Burns, and Kristin Chenoweth all composed messages of love and gratitude to the women who raised them. And the heartwarming epistolary book, which was published last week, has arrived just in time for Mother's Day on May 10. 'Like all teenage girls I had this ridiculous fear of growing up and becoming just like you,' Kelly Osbourne wrote in her letter, republished on Yahoo Parenting. 'I was so ignorant and adamant about creating my "own" identity.' Scroll down for video Mini-me: In Lisa Erspamer's new book A Letter to My Mom, Kelly Osbourne (R) wrote a letter to her mother Sharon (L) saying that she's happy to have grown up to be just like her (...) Author Lisa Erspamer invited celebrities and a number of other people to write heartfelt notes to their mothers for her new book a letter to my mom. Stars such as Melissa Rivers, will.i.am, and Christy Turlington participated in the moving project. Kelly didn't always want to grow up to be like her famous mom. Lisa Erspamer's third collection of tributes, celebrities such as Melissa rivers, Shania Twain, will.i.am, Christy Turlington, and Kristin Chenoweth all composed messages of love and gratitude to the women who raised them. Kelly wrote a letter to her mom before Joan's death last year. She has arrived just in time for Mother's Day on May 10.
Figure 10: More examples of summaries generated by our model, compared against summaries written by humans for the same article.

In order to illustrate the impact of our main contributions on text summarization, Figure 11 shows how the output of our model diverges if we remove intra-attention and RL training.

Article
Tony Blair has said he does not want to retire until he is 91 – as he unveiled plans to set up a ‘cadre’ of ex-leaders to advise governments around the world. The defiant 61-year-old former Prime Minister said he had ‘decades’ still in him and joked that he would ‘turn to drink’ if he ever stepped down from his multitude of global roles. He told Newsweek magazine that his latest ambition was to recruit former heads of government to go round the world to advise presidents and prime ministers on how to run their countries. In an interview with the magazine Newsweek Mr Blair said he did not want to retire until he was 91 years old Mr Blair said his latest ambition is to recruit former heads of government to advise presidents and prime ministers on how to run their countries Mr Blair said he himself had been ‘mentored’ by US president Bill Clinton when he took office in 1997. And he said he wanted to build up his organisations, such as his Faith Foundation, so they are ‘capable of changing global policy’. Last night, Tory MPs expressed horror at the prospect of Mr Blair remaining in public life for another 30 years. Andrew Bridgen said: ‘We all know weak Ed Miliband’s called on Tony to give his flailing campaign a boost, but the attention’s clearly gone to his head.’ (...)
Summary (ground truth, written by a human)
The former Prime Minister claimed he has 'decades' of work left in him. Joked he would 'turn to drink' if he ever stepped down from global roles. Wants to recruit former government heads to advise current leaders. He was 'mentored' by US president Bill Clinton when he started in 1997.
Summary (our model)
Blair said he did not want to retire until he was 91 years old. 61-year-old former prime minister said he would 'turn to drink' if he ever stepped down from his own. He said he wanted to build up his charity to advise presidents and prime ministers on how to run their countries. Mr Blair says he is to recruit former heads of government to go round the world to advise ministers. He says he wants to emulate ex-Israeli president Shimon Peres.
Summary (without intra-attention and reinforcement learning)
61-year-old former prime minister said he did not want to retire until he was 91 years old. He said he wanted to build up his organisations, such as his Faith Foundation. He said he wanted to emulate ex-Israeli president Shimon Peres. Mr Blair said he wanted to emulate ex-Israeli President Shimon Peres. He said he wanted to be seeing someone when he took office in 1997. Mr Blair said he wanted to be seeing someone when he took office in 1997. Mr Blair said he wanted to
Figure 11: Example summaries generated by our model, with and without our main contributions. New words that are not present in the original article are shown in green. Repeated phrases in the summaries are shown in red.

Conclusion

Our model significantly improves the state-of-the-art in multi-sentence summary generation, outperforming existing abstractive models and extractive baselines. We believe that our contributions - the intra-decoder attention module and the combined training objective - could improve other sequence generation tasks, especially for long outputs.

Our work also touches on the limit of automatic evaluation metrics such as ROUGE, and shows that better metrics are required to evaluate - and optimize - summarization models. An ideal metric will correlate well with human judgment both in terms of summary coherence and readability. When such a metric is used with our reinforced summarization model summaries may improve even further.

Citation credit

Romain Paulus, Caiming Xiong, and Richard Socher. 2017.
A Deep Reinforced Model for Abstractive Summarization

Acknowledgements

Special thanks to Melvin Gruesbeck for his help with visuals and figures.