Fine-Tuning a Summarization Model: A Practical Guide

Fine-Tuning a Summarization Model: A Practical Guide ✨

Introduction

Automatic text summarization is a crucial task in Natural Language Processing (NLP). It involves condensing a longer piece of text into a shorter, coherent, and informative summary. While pre-trained models like those from Hugging Face's Transformers library can perform summarization, fine-tuning them on a specific dataset allows you to create a model tailored to your particular needs and writing style.

This guide will walk you through the process of fine-tuning a pre-trained summarization model using the Hugging Face Transformers library and a publicly available dataset. We'll focus on a practical, hands-on approach, providing code examples and explanations along the way. We won't be focusing on specific names, but rather on the general process. This guide assumes a basic understanding of Python and machine learning concepts.

Prerequisites

Before we begin, make sure you have the following installed:

Python 3.7+: A recent version of Python is recommended.
pip: The Python package installer.
Required Libraries: We'll install these using pip.
- transformers (Hugging Face Transformers library)
- datasets (Hugging Face Datasets library)
- rouge_score (for evaluation)
- accelerate (for optimized training)
- sentencepiece (for tokenization)
- gdown (for downloading from Google drive)
A GPU (Recommended): Fine-tuning large language models is computationally intensive. A GPU (e.g., NVIDIA GPU with CUDA support) will significantly speed up the training process. Services like Google Colab (with GPU enabled) or Kaggle Kernels provide free access to GPUs.

Step 1: Installing the Necessary Libraries

Let's install the required libraries using pip. It's highly recommended to create a virtual environment for this project to avoid conflicts with other Python projects.

1pip install transformers datasets transformers[torch] evaluate rouge_score accelerate==0.20.3 sentencepiece gdown

Note:

after installing the requirements, you should restart the current session.
if you are using Colab, you should change the runtime type to "GPU".

Step 2: Loading the Dataset

We will use two datasets for this demonstration, and you can choose the one that best suits your needs:

SumArabic Dataset: This dataset contains over 80,000 articles with their shorter versions. The summaries are extremely short.
Arabic-article-summarization-30-000: This dataset contains 8,378 articles with their summaries.

First, let's download and load the SumArabic dataset.

1# Download data from drive
2import gdown
3import zipfile
4import os
5from datasets import load_dataset
6
7# Ensure we're in the right directory (adjust as needed)
8if not os.path.exists("/kaggle/working/content"):
9    os.makedirs("/kaggle/working/content", exist_ok=True)
10
11os.chdir("/kaggle/working")
12# Download the dataset
13!gdown 18hoo7Tql8NRMjLvabWYgigkfDrqds4m1
14# Unzip the dataset
15with zipfile.ZipFile("/kaggle/working/SumArabic.zip", 'r') as zip_ref:
16    zip_ref.extractall("/kaggle/working/content")
17
18print("=========================")
19!du -sh /kaggle/working/content/SumArabic
20
21data_dir = "/kaggle/working/content/SumArabic"
22
23sumArabic = load_dataset("json",
24            data_files={
25                "train": f"{data_dir}/sumarabic-1.0-train.jsonl",
26                "validation": f"{data_dir}/sumarabic-1.0-valid.jsonl",
27                "test": f"{data_dir}/sumarabic-1.0-test.jsonl"
28            })
29print(sumArabic.keys())
30print(sumArabic["train"].num_rows)
31print(sumArabic["validation"].num_rows)
32print(sumArabic["test"].num_rows)
33
34# Print the first example in the training set
35print(sumArabic["train"][0])

Now, let's load the Arabic-article-summarization-30-000 dataset:

1# Load ar_article_sum
2from datasets import load_dataset
3ar_article_sum = load_dataset("Abdelkareem/Arabic-article-summarization-30-000")
4
5print(ar_article_sum["train"][0])
6

Choose one of these datasets for the rest of the tutorial. The code will assume you've stored your chosen dataset in a variable called dataset. For example:

1# Choose either SumArabic or ar_article_sum
2# dataset = sumArabic
3dataset = ar_article_sum

Step 3: Loading the Pre-trained Model and Tokenizer

We'll use the UBC-NLP/AraT5v2-base-1024 model, a T5-based model pre-trained on a large Arabic corpus. It's suitable for various text generation tasks, including summarization. We'll load both the model and its corresponding tokenizer.

1from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
2
3model_name = "UBC-NLP/AraT5v2-base-1024"
4
5tokenizer = AutoTokenizer.from_pretrained(model_name)
6model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
7
8data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

The DataCollatorForSeq2Seq is a crucial component. It handles the padding and batching of input sequences and labels, ensuring they have the same length, which is a requirement for sequence-to-sequence models.

Step 4: Preprocessing the Data

Before we can train the model, we need to preprocess the data. This involves:

Adding a prefix: We add a prefix like "summarize: " to the input text. This helps the model understand that it should perform a summarization task.
Tokenizing: Converting the text (both input and summary) into numerical representations (tokens) that the model can understand. We use the tokenizer associated with our pre-trained model.
Truncating: Limiting the length of the input and output sequences to a maximum length. This is necessary because transformers have a limited input context size.

Here's the preprocessing function, adapted for both datasets. We use a separate function for each to handle their different structure.

1prefix = "لخص :"  # "summarize: " in Arabic
2
3def process_function_ar_article_sum(examples):
4    inputs = [prefix + doc for doc in examples["Processed Text"]] # Use "Processed Text"
5    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
6    labels = tokenizer(text_target=examples["summarizer"], max_length=220, truncation=True)
7    model_inputs["labels"] = labels["input_ids"]
8    return model_inputs
9
10def process_function_sumArabic(examples):
11    inputs = [prefix + doc for doc in examples["text"]]
12    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
13    labels = tokenizer(text_target=examples["headline"], max_length=128, truncation=True) # Use "headline"
14    model_inputs["labels"] = labels["input_ids"]
15    return model_inputs
16
17# Choose the appropriate preprocessing function based on your chosen dataset.
18if dataset == ar_article_sum:
19    process_function = process_function_ar_article_sum
20elif dataset == sumArabic:
21    process_function = process_function_sumArabic
22else:
23    raise ValueError("Invalid dataset choice.")
24
25tokenized_dataset = dataset.map(process_function, batched=True)

We use the .map() function to apply the preprocessing function to all examples in the dataset efficiently. The batched=True argument speeds up the process by applying the function to batches of examples.

Step 5: Defining the Evaluation Metric (ROUGE)

We'll use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric to evaluate the quality of the generated summaries. ROUGE compares the generated summary to a reference summary (the ground truth) and measures the overlap of n-grams (sequences of words).

1import evaluate
2import numpy as np
3
4rouge = evaluate.load("rouge")
5
6def compute_metrics(eval_pred):
7    predictions, labels = eval_pred
8    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
9    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
10    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
11
12    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
13
14    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
15    result["gen_len"] = np.mean(prediction_lens)
16
17    return {k: round(v, 4) for k, v in result.items()}

This function decodes the model's output, calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L), and computes the average generation length.

Step 6: Setting up the Training Arguments

The Seq2SeqTrainingArguments class from the Transformers library allows us to configure various aspects of the training process.

1from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
2
3training_args = Seq2SeqTrainingArguments(
4    output_dir="my_summarizer",  # Directory to save the fine-tuned model
5    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
6    learning_rate=2e-5,
7    per_device_train_batch_size=2,  # Batch size per GPU for training
8    per_device_eval_batch_size=2,   # Batch size per GPU for evaluation
9    weight_decay=0.001,            # Weight decay for regularization
10    save_total_limit=3,             # Limit the number of saved checkpoints
11    num_train_epochs=3,            # Number of training epochs
12    predict_with_generate=True,      # Generate summaries during evaluation
13    # fp16=True,                    # Use mixed-precision training (if your GPU supports it)
14    push_to_hub=False,              # Set to True to push the model to the Hugging Face Hub
15)

Key arguments:

output_dir: Where to save the fine-tuned model.
evaluation_strategy: How often to evaluate the model (here, after each epoch).
learning_rate: The learning rate for the optimizer.
per_device_train_batch_size and per_device_eval_batch_size: Batch sizes. Adjust these based on your GPU memory.
num_train_epochs: The number of times to iterate over the entire training dataset.
predict_with_generate: Enables summary generation during evaluation, which is necessary for ROUGE calculation.
fp16=True: Enables mixed-precision training, which can significantly speed up training and reduce memory usage if your GPU supports it (e.g., NVIDIA GPUs with Tensor Cores). If you don't have a compatible GPU, remove this line.
push_to_hub: If set to True, the model and tokenizer will be automatically pushed to your Hugging Face Hub account after training (you'll need to be logged in).

Step 7: Creating the Trainer and Training

We now create a Seq2SeqTrainer instance, passing in the model, training arguments, datasets, tokenizer, data collator, and evaluation function. Then, we start the training process.

1trainer = Seq2SeqTrainer(
2    model=model,
3    args=training_args,
4    train_dataset=tokenized_dataset["train"],
5    eval_dataset=tokenized_dataset["test"],  # Use the test set for evaluation
6    tokenizer=tokenizer,
7    data_collator=data_collator,
8    compute_metrics=compute_metrics,
9)
10
11trainer.train()

This is the most time-consuming part. The training time will depend on the dataset size, the model size, the batch size, and the number of epochs. With a GPU, it can take anywhere from a few minutes to several hours. Without a GPU, it could take significantly longer.

Step 8: Evaluating the Model (Optional)

Although we evaluate during training, you can evaluate the final model on a separate test set (if you have one) or re-evaluate on the validation set:

1# Optional: Evaluate after training is complete
2results = trainer.evaluate()
3print(results)
4

Step 9: Using the Fine-Tuned Model

Once the model is fine-tuned, you can use it to generate summaries:

1from transformers import pipeline
2
3summarizer = pipeline("summarization", model="my_summarizer")  # Or path to your saved model
4
5text = """
6ضع هنا النص الذي تريد تلخيصه.
7"""
8
9summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
10
11print(summary[0]['summary_text'])
12

Replace "my_summarizer" with the path to your saved model directory if you didn't push it to the Hugging Face Hub.

Step 10: Saving and Loading (Local)

The trainer.train() method automatically saves the best model (based on the evaluation metric) and checkpoints during training. You can also explicitly save the model and tokenizer:

1# Save the model and tokenizer
2trainer.save_model("my_summarizer")
3tokenizer.save_pretrained("my_summarizer")

To load the model later:

1from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
2
3model = AutoModelForSeq2SeqLM.from_pretrained("my_summarizer")
4tokenizer = AutoTokenizer.from_pretrained("my_summarizer")
5

Conclusion

This guide provides a comprehensive overview of fine-tuning a pre-trained summarization model. By following these steps, you can create a custom summarizer tailored to your specific needs. Remember to experiment with different hyperparameters (learning rate, batch size, number of epochs) and datasets to achieve the best results. Using a GPU is highly recommended for faster training. The Hugging Face Transformers library makes this process relatively straightforward, even for those with limited machine-learning experience. ✅✨

Share this Article

Comments are disabled