Fine-Tuning a Summarization Model: A Practical Guide
NLP
Summarization
Transformers
Fine-tuning
Machine Learning
Python
Hugging Face
AI

Fine-Tuning a Summarization Model: A Practical Guide

Learn how to fine-tune a pre-trained language model for text summarization, creating your own custom summarizer.

March 23, 2024
3 minutes

Fine-Tuning a Summarization Model: A Practical Guide ✨

Introduction

Automatic text summarization is a crucial task in Natural Language Processing (NLP). It involves condensing a longer piece of text into a shorter, coherent, and informative summary. While pre-trained models like those from Hugging Face's Transformers library can perform summarization, fine-tuning them on a specific dataset allows you to create a model tailored to your particular needs and writing style.

This guide will walk you through the process of fine-tuning a pre-trained summarization model using the Hugging Face Transformers library and a publicly available dataset. We'll focus on a practical, hands-on approach, providing code examples and explanations along the way. We won't be focusing on specific names, but rather on the general process. This guide assumes a basic understanding of Python and machine learning concepts.

Prerequisites

Before we begin, make sure you have the following installed:

  • Python 3.7+: A recent version of Python is recommended.
  • pip: The Python package installer.
  • Required Libraries: We'll install these using pip.
    • transformers (Hugging Face Transformers library)
    • datasets (Hugging Face Datasets library)
    • rouge_score (for evaluation)
    • accelerate (for optimized training)
    • sentencepiece (for tokenization)
    • gdown (for downloading from Google drive)
  • A GPU (Recommended): Fine-tuning large language models is computationally intensive. A GPU (e.g., NVIDIA GPU with CUDA support) will significantly speed up the training process. Services like Google Colab (with GPU enabled) or Kaggle Kernels provide free access to GPUs.

Step 1: Installing the Necessary Libraries

Let's install the required libraries using pip. It's highly recommended to create a virtual environment for this project to avoid conflicts with other Python projects.

1
pip install transformers datasets transformers[torch] evaluate rouge_score accelerate==0.20.3 sentencepiece gdown

Note:

  • after installing the requirements, you should restart the current session.
  • if you are using Colab, you should change the runtime type to "GPU".

Step 2: Loading the Dataset

We will use two datasets for this demonstration, and you can choose the one that best suits your needs:

  1. SumArabic Dataset: This dataset contains over 80,000 articles with their shorter versions. The summaries are extremely short.
  2. Arabic-article-summarization-30-000: This dataset contains 8,378 articles with their summaries.

First, let's download and load the SumArabic dataset.

1
# Download data from drive
2
import gdown
3
import zipfile
4
import os
5
from datasets import load_dataset
6
7
# Ensure we're in the right directory (adjust as needed)
8
if not os.path.exists("/kaggle/working/content"):
9
os.makedirs("/kaggle/working/content", exist_ok=True)
10
11
os.chdir("/kaggle/working")
12
# Download the dataset
13
!gdown 18hoo7Tql8NRMjLvabWYgigkfDrqds4m1
14
# Unzip the dataset
15
with zipfile.ZipFile("/kaggle/working/SumArabic.zip", 'r') as zip_ref:
16
zip_ref.extractall("/kaggle/working/content")
17
18
print("=========================")
19
!du -sh /kaggle/working/content/SumArabic
20
21
data_dir = "/kaggle/working/content/SumArabic"
22
23
sumArabic = load_dataset("json",
24
data_files={
25
"train": f"{data_dir}/sumarabic-1.0-train.jsonl",
26
"validation": f"{data_dir}/sumarabic-1.0-valid.jsonl",
27
"test": f"{data_dir}/sumarabic-1.0-test.jsonl"
28
})
29
print(sumArabic.keys())
30
print(sumArabic["train"].num_rows)
31
print(sumArabic["validation"].num_rows)
32
print(sumArabic["test"].num_rows)
33
34
# Print the first example in the training set
35
print(sumArabic["train"][0])

Now, let's load the Arabic-article-summarization-30-000 dataset:

1
# Load ar_article_sum
2
from datasets import load_dataset
3
ar_article_sum = load_dataset("Abdelkareem/Arabic-article-summarization-30-000")
4
5
print(ar_article_sum["train"][0])
6

Choose one of these datasets for the rest of the tutorial. The code will assume you've stored your chosen dataset in a variable called dataset. For example:

1
# Choose either SumArabic or ar_article_sum
2
# dataset = sumArabic
3
dataset = ar_article_sum

Step 3: Loading the Pre-trained Model and Tokenizer

We'll use the UBC-NLP/AraT5v2-base-1024 model, a T5-based model pre-trained on a large Arabic corpus. It's suitable for various text generation tasks, including summarization. We'll load both the model and its corresponding tokenizer.

1
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
2
3
model_name = "UBC-NLP/AraT5v2-base-1024"
4
5
tokenizer = AutoTokenizer.from_pretrained(model_name)
6
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
7
8
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

The DataCollatorForSeq2Seq is a crucial component. It handles the padding and batching of input sequences and labels, ensuring they have the same length, which is a requirement for sequence-to-sequence models.

Step 4: Preprocessing the Data

Before we can train the model, we need to preprocess the data. This involves:

  1. Adding a prefix: We add a prefix like "summarize: " to the input text. This helps the model understand that it should perform a summarization task.
  2. Tokenizing: Converting the text (both input and summary) into numerical representations (tokens) that the model can understand. We use the tokenizer associated with our pre-trained model.
  3. Truncating: Limiting the length of the input and output sequences to a maximum length. This is necessary because transformers have a limited input context size.

Here's the preprocessing function, adapted for both datasets. We use a separate function for each to handle their different structure.

1
prefix = "لخص :" # "summarize: " in Arabic
2
3
def process_function_ar_article_sum(examples):
4
inputs = [prefix + doc for doc in examples["Processed Text"]] # Use "Processed Text"
5
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
6
labels = tokenizer(text_target=examples["summarizer"], max_length=220, truncation=True)
7
model_inputs["labels"] = labels["input_ids"]
8
return model_inputs
9
10
def process_function_sumArabic(examples):
11
inputs = [prefix + doc for doc in examples["text"]]
12
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
13
labels = tokenizer(text_target=examples["headline"], max_length=128, truncation=True) # Use "headline"
14
model_inputs["labels"] = labels["input_ids"]
15
return model_inputs
16
17
# Choose the appropriate preprocessing function based on your chosen dataset.
18
if dataset == ar_article_sum:
19
process_function = process_function_ar_article_sum
20
elif dataset == sumArabic:
21
process_function = process_function_sumArabic
22
else:
23
raise ValueError("Invalid dataset choice.")
24
25
tokenized_dataset = dataset.map(process_function, batched=True)

We use the .map() function to apply the preprocessing function to all examples in the dataset efficiently. The batched=True argument speeds up the process by applying the function to batches of examples.

Step 5: Defining the Evaluation Metric (ROUGE)

We'll use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric to evaluate the quality of the generated summaries. ROUGE compares the generated summary to a reference summary (the ground truth) and measures the overlap of n-grams (sequences of words).

1
import evaluate
2
import numpy as np
3
4
rouge = evaluate.load("rouge")
5
6
def compute_metrics(eval_pred):
7
predictions, labels = eval_pred
8
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
9
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
10
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
11
12
result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
13
14
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
15
result["gen_len"] = np.mean(prediction_lens)
16
17
return {k: round(v, 4) for k, v in result.items()}

This function decodes the model's output, calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L), and computes the average generation length.

Step 6: Setting up the Training Arguments

The Seq2SeqTrainingArguments class from the Transformers library allows us to configure various aspects of the training process.

1
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
2
3
training_args = Seq2SeqTrainingArguments(
4
output_dir="my_summarizer", # Directory to save the fine-tuned model
5
evaluation_strategy="epoch", # Evaluate at the end of each epoch
6
learning_rate=2e-5,
7
per_device_train_batch_size=2, # Batch size per GPU for training
8
per_device_eval_batch_size=2, # Batch size per GPU for evaluation
9
weight_decay=0.001, # Weight decay for regularization
10
save_total_limit=3, # Limit the number of saved checkpoints
11
num_train_epochs=3, # Number of training epochs
12
predict_with_generate=True, # Generate summaries during evaluation
13
# fp16=True, # Use mixed-precision training (if your GPU supports it)
14
push_to_hub=False, # Set to True to push the model to the Hugging Face Hub
15
)

Key arguments:

  • output_dir: Where to save the fine-tuned model.
  • evaluation_strategy: How often to evaluate the model (here, after each epoch).
  • learning_rate: The learning rate for the optimizer.
  • per_device_train_batch_size and per_device_eval_batch_size: Batch sizes. Adjust these based on your GPU memory.
  • num_train_epochs: The number of times to iterate over the entire training dataset.
  • predict_with_generate: Enables summary generation during evaluation, which is necessary for ROUGE calculation.
  • fp16=True: Enables mixed-precision training, which can significantly speed up training and reduce memory usage if your GPU supports it (e.g., NVIDIA GPUs with Tensor Cores). If you don't have a compatible GPU, remove this line.
  • push_to_hub: If set to True, the model and tokenizer will be automatically pushed to your Hugging Face Hub account after training (you'll need to be logged in).

Step 7: Creating the Trainer and Training

We now create a Seq2SeqTrainer instance, passing in the model, training arguments, datasets, tokenizer, data collator, and evaluation function. Then, we start the training process.

1
trainer = Seq2SeqTrainer(
2
model=model,
3
args=training_args,
4
train_dataset=tokenized_dataset["train"],
5
eval_dataset=tokenized_dataset["test"], # Use the test set for evaluation
6
tokenizer=tokenizer,
7
data_collator=data_collator,
8
compute_metrics=compute_metrics,
9
)
10
11
trainer.train()

This is the most time-consuming part. The training time will depend on the dataset size, the model size, the batch size, and the number of epochs. With a GPU, it can take anywhere from a few minutes to several hours. Without a GPU, it could take significantly longer.

Step 8: Evaluating the Model (Optional)

Although we evaluate during training, you can evaluate the final model on a separate test set (if you have one) or re-evaluate on the validation set:

1
# Optional: Evaluate after training is complete
2
results = trainer.evaluate()
3
print(results)
4

Step 9: Using the Fine-Tuned Model

Once the model is fine-tuned, you can use it to generate summaries:

1
from transformers import pipeline
2
3
summarizer = pipeline("summarization", model="my_summarizer") # Or path to your saved model
4
5
text = """
6
ضع هنا النص الذي تريد تلخيصه.
7
"""
8
9
summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
10
11
print(summary[0]['summary_text'])
12

Replace "my_summarizer" with the path to your saved model directory if you didn't push it to the Hugging Face Hub.

Step 10: Saving and Loading (Local)

The trainer.train() method automatically saves the best model (based on the evaluation metric) and checkpoints during training. You can also explicitly save the model and tokenizer:

1
# Save the model and tokenizer
2
trainer.save_model("my_summarizer")
3
tokenizer.save_pretrained("my_summarizer")

To load the model later:

1
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
2
3
model = AutoModelForSeq2SeqLM.from_pretrained("my_summarizer")
4
tokenizer = AutoTokenizer.from_pretrained("my_summarizer")
5

Conclusion

This guide provides a comprehensive overview of fine-tuning a pre-trained summarization model. By following these steps, you can create a custom summarizer tailored to your specific needs. Remember to experiment with different hyperparameters (learning rate, batch size, number of epochs) and datasets to achieve the best results. Using a GPU is highly recommended for faster training. The Hugging Face Transformers library makes this process relatively straightforward, even for those with limited machine-learning experience. ✅✨

Share
Comments are disabled