Fine-Tuning a Summarization Model: A Practical Guide ✨
Introduction
Automatic text summarization is a crucial task in Natural Language Processing (NLP). It involves condensing a longer piece of text into a shorter, coherent, and informative summary. While pre-trained models like those from Hugging Face's Transformers library can perform summarization, fine-tuning them on a specific dataset allows you to create a model tailored to your particular needs and writing style.
This guide will walk you through the process of fine-tuning a pre-trained summarization model using the Hugging Face Transformers library and a publicly available dataset. We'll focus on a practical, hands-on approach, providing code examples and explanations along the way. We won't be focusing on specific names, but rather on the general process. This guide assumes a basic understanding of Python and machine learning concepts.
Prerequisites
Before we begin, make sure you have the following installed:
- Python 3.7+: A recent version of Python is recommended.
- pip: The Python package installer.
- Required Libraries: We'll install these using pip.
transformers
(Hugging Face Transformers library)datasets
(Hugging Face Datasets library)rouge_score
(for evaluation)accelerate
(for optimized training)sentencepiece
(for tokenization)gdown
(for downloading from Google drive)
- A GPU (Recommended): Fine-tuning large language models is computationally intensive. A GPU (e.g., NVIDIA GPU with CUDA support) will significantly speed up the training process. Services like Google Colab (with GPU enabled) or Kaggle Kernels provide free access to GPUs.
Step 1: Installing the Necessary Libraries
Let's install the required libraries using pip. It's highly recommended to create a virtual environment for this project to avoid conflicts with other Python projects.
1pip install transformers datasets transformers[torch] evaluate rouge_score accelerate==0.20.3 sentencepiece gdown
Note:
- after installing the requirements, you should restart the current session.
- if you are using Colab, you should change the runtime type to "GPU".
Step 2: Loading the Dataset
We will use two datasets for this demonstration, and you can choose the one that best suits your needs:
- SumArabic Dataset: This dataset contains over 80,000 articles with their shorter versions. The summaries are extremely short.
- Arabic-article-summarization-30-000: This dataset contains 8,378 articles with their summaries.
First, let's download and load the SumArabic
dataset.
1# Download data from drive2import gdown3import zipfile4import os5from datasets import load_dataset67# Ensure we're in the right directory (adjust as needed)8if not os.path.exists("/kaggle/working/content"):9os.makedirs("/kaggle/working/content", exist_ok=True)1011os.chdir("/kaggle/working")12# Download the dataset13!gdown 18hoo7Tql8NRMjLvabWYgigkfDrqds4m114# Unzip the dataset15with zipfile.ZipFile("/kaggle/working/SumArabic.zip", 'r') as zip_ref:16zip_ref.extractall("/kaggle/working/content")1718print("=========================")19!du -sh /kaggle/working/content/SumArabic2021data_dir = "/kaggle/working/content/SumArabic"2223sumArabic = load_dataset("json",24data_files={25"train": f"{data_dir}/sumarabic-1.0-train.jsonl",26"validation": f"{data_dir}/sumarabic-1.0-valid.jsonl",27"test": f"{data_dir}/sumarabic-1.0-test.jsonl"28})29print(sumArabic.keys())30print(sumArabic["train"].num_rows)31print(sumArabic["validation"].num_rows)32print(sumArabic["test"].num_rows)3334# Print the first example in the training set35print(sumArabic["train"][0])
Now, let's load the Arabic-article-summarization-30-000
dataset:
1# Load ar_article_sum2from datasets import load_dataset3ar_article_sum = load_dataset("Abdelkareem/Arabic-article-summarization-30-000")45print(ar_article_sum["train"][0])6
Choose one of these datasets for the rest of the tutorial. The code will assume you've stored your chosen dataset in a variable called dataset
. For example:
1# Choose either SumArabic or ar_article_sum2# dataset = sumArabic3dataset = ar_article_sum
Step 3: Loading the Pre-trained Model and Tokenizer
We'll use the UBC-NLP/AraT5v2-base-1024
model, a T5-based model pre-trained on a large Arabic corpus. It's suitable for various text generation tasks, including summarization. We'll load both the model and its corresponding tokenizer.
1from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq23model_name = "UBC-NLP/AraT5v2-base-1024"45tokenizer = AutoTokenizer.from_pretrained(model_name)6model = AutoModelForSeq2SeqLM.from_pretrained(model_name)78data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
The DataCollatorForSeq2Seq
is a crucial component. It handles the padding and batching of input sequences and labels, ensuring they have the same length, which is a requirement for sequence-to-sequence models.
Step 4: Preprocessing the Data
Before we can train the model, we need to preprocess the data. This involves:
- Adding a prefix: We add a prefix like "summarize: " to the input text. This helps the model understand that it should perform a summarization task.
- Tokenizing: Converting the text (both input and summary) into numerical representations (tokens) that the model can understand. We use the tokenizer associated with our pre-trained model.
- Truncating: Limiting the length of the input and output sequences to a maximum length. This is necessary because transformers have a limited input context size.
Here's the preprocessing function, adapted for both datasets. We use a separate function for each to handle their different structure.
1prefix = "لخص :" # "summarize: " in Arabic23def process_function_ar_article_sum(examples):4inputs = [prefix + doc for doc in examples["Processed Text"]] # Use "Processed Text"5model_inputs = tokenizer(inputs, max_length=1024, truncation=True)6labels = tokenizer(text_target=examples["summarizer"], max_length=220, truncation=True)7model_inputs["labels"] = labels["input_ids"]8return model_inputs910def process_function_sumArabic(examples):11inputs = [prefix + doc for doc in examples["text"]]12model_inputs = tokenizer(inputs, max_length=1024, truncation=True)13labels = tokenizer(text_target=examples["headline"], max_length=128, truncation=True) # Use "headline"14model_inputs["labels"] = labels["input_ids"]15return model_inputs1617# Choose the appropriate preprocessing function based on your chosen dataset.18if dataset == ar_article_sum:19process_function = process_function_ar_article_sum20elif dataset == sumArabic:21process_function = process_function_sumArabic22else:23raise ValueError("Invalid dataset choice.")2425tokenized_dataset = dataset.map(process_function, batched=True)
We use the .map()
function to apply the preprocessing function to all examples in the dataset efficiently. The batched=True
argument speeds up the process by applying the function to batches of examples.
Step 5: Defining the Evaluation Metric (ROUGE)
We'll use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric to evaluate the quality of the generated summaries. ROUGE compares the generated summary to a reference summary (the ground truth) and measures the overlap of n-grams (sequences of words).
1import evaluate2import numpy as np34rouge = evaluate.load("rouge")56def compute_metrics(eval_pred):7predictions, labels = eval_pred8decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)9labels = np.where(labels != -100, labels, tokenizer.pad_token_id)10decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)1112result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)1314prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]15result["gen_len"] = np.mean(prediction_lens)1617return {k: round(v, 4) for k, v in result.items()}
This function decodes the model's output, calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L), and computes the average generation length.
Step 6: Setting up the Training Arguments
The Seq2SeqTrainingArguments
class from the Transformers library allows us to configure various aspects of the training process.
1from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer23training_args = Seq2SeqTrainingArguments(4output_dir="my_summarizer", # Directory to save the fine-tuned model5evaluation_strategy="epoch", # Evaluate at the end of each epoch6learning_rate=2e-5,7per_device_train_batch_size=2, # Batch size per GPU for training8per_device_eval_batch_size=2, # Batch size per GPU for evaluation9weight_decay=0.001, # Weight decay for regularization10save_total_limit=3, # Limit the number of saved checkpoints11num_train_epochs=3, # Number of training epochs12predict_with_generate=True, # Generate summaries during evaluation13# fp16=True, # Use mixed-precision training (if your GPU supports it)14push_to_hub=False, # Set to True to push the model to the Hugging Face Hub15)
Key arguments:
output_dir
: Where to save the fine-tuned model.evaluation_strategy
: How often to evaluate the model (here, after each epoch).learning_rate
: The learning rate for the optimizer.per_device_train_batch_size
andper_device_eval_batch_size
: Batch sizes. Adjust these based on your GPU memory.num_train_epochs
: The number of times to iterate over the entire training dataset.predict_with_generate
: Enables summary generation during evaluation, which is necessary for ROUGE calculation.fp16=True
: Enables mixed-precision training, which can significantly speed up training and reduce memory usage if your GPU supports it (e.g., NVIDIA GPUs with Tensor Cores). If you don't have a compatible GPU, remove this line.push_to_hub
: If set toTrue
, the model and tokenizer will be automatically pushed to your Hugging Face Hub account after training (you'll need to be logged in).
Step 7: Creating the Trainer and Training
We now create a Seq2SeqTrainer
instance, passing in the model, training arguments, datasets, tokenizer, data collator, and evaluation function. Then, we start the training process.
1trainer = Seq2SeqTrainer(2model=model,3args=training_args,4train_dataset=tokenized_dataset["train"],5eval_dataset=tokenized_dataset["test"], # Use the test set for evaluation6tokenizer=tokenizer,7data_collator=data_collator,8compute_metrics=compute_metrics,9)1011trainer.train()
This is the most time-consuming part. The training time will depend on the dataset size, the model size, the batch size, and the number of epochs. With a GPU, it can take anywhere from a few minutes to several hours. Without a GPU, it could take significantly longer.
Step 8: Evaluating the Model (Optional)
Although we evaluate during training, you can evaluate the final model on a separate test set (if you have one) or re-evaluate on the validation set:
1# Optional: Evaluate after training is complete2results = trainer.evaluate()3print(results)4
Step 9: Using the Fine-Tuned Model
Once the model is fine-tuned, you can use it to generate summaries:
1from transformers import pipeline23summarizer = pipeline("summarization", model="my_summarizer") # Or path to your saved model45text = """6ضع هنا النص الذي تريد تلخيصه.7"""89summary = summarizer(text, max_length=130, min_length=30, do_sample=False)1011print(summary[0]['summary_text'])12
Replace "my_summarizer"
with the path to your saved model directory if you didn't push it to the Hugging Face Hub.
Step 10: Saving and Loading (Local)
The trainer.train()
method automatically saves the best model (based on the evaluation metric) and checkpoints during training. You can also explicitly save the model and tokenizer:
1# Save the model and tokenizer2trainer.save_model("my_summarizer")3tokenizer.save_pretrained("my_summarizer")
To load the model later:
1from transformers import AutoModelForSeq2SeqLM, AutoTokenizer23model = AutoModelForSeq2SeqLM.from_pretrained("my_summarizer")4tokenizer = AutoTokenizer.from_pretrained("my_summarizer")5
Conclusion
This guide provides a comprehensive overview of fine-tuning a pre-trained summarization model. By following these steps, you can create a custom summarizer tailored to your specific needs. Remember to experiment with different hyperparameters (learning rate, batch size, number of epochs) and datasets to achieve the best results. Using a GPU is highly recommended for faster training. The Hugging Face Transformers library makes this process relatively straightforward, even for those with limited machine-learning experience. ✅✨