site Logo
الرئيسية
بورتفوليو
بايثون
جافاسكريبت
الذكاء الاصطناعي وتعلم الآلة
بحث
Fine-Tuning a Summarization Model: A Practical Guide
NLP
Summarization
Transformers
Fine-tuning
Machine Learning
Python
Hugging Face
AI

Fine-Tuning a Summarization Model: A Practical Guide

Learn how to fine-tune a pre-trained language model for text summarization, creating your own custom summarizer.

٢٣ مارس ٢٠٢٤
3 دقائق

Fine-Tuning a Summarization Model: A Practical Guide ✨

Introduction

Automatic text summarization is a crucial task in Natural Language Processing (NLP). It involves condensing a longer piece of text into a shorter, coherent, and informative summary. While pre-trained models like those from Hugging Face's Transformers library can perform summarization, fine-tuning them on a specific dataset allows you to create a model tailored to your particular needs and writing style.

This guide will walk you through the process of fine-tuning a pre-trained summarization model using the Hugging Face Transformers library and a publicly available dataset. We'll focus on a practical, hands-on approach, providing code examples and explanations along the way. We won't be focusing on specific names, but rather on the general process. This guide assumes a basic understanding of Python and machine learning concepts.

Prerequisites

Before we begin, make sure you have the following installed:

  • Python 3.7+: A recent version of Python is recommended.
  • pip: The Python package installer.
  • Required Libraries: We'll install these using pip.
    • transformers (Hugging Face Transformers library)
    • datasets (Hugging Face Datasets library)
    • rouge_score (for evaluation)
    • accelerate (for optimized training)
    • sentencepiece (for tokenization)
    • gdown (for downloading from Google drive)
  • A GPU (Recommended): Fine-tuning large language models is computationally intensive. A GPU (e.g., NVIDIA GPU with CUDA support) will significantly speed up the training process. Services like Google Colab (with GPU enabled) or Kaggle Kernels provide free access to GPUs.

Step 1: Installing the Necessary Libraries

Let's install the required libraries using pip. It's highly recommended to create a virtual environment for this project to avoid conflicts with other Python projects.

Note:

  • after installing the requirements, you should restart the current session.
  • if you are using Colab, you should change the runtime type to "GPU".

Step 2: Loading the Dataset

We will use two datasets for this demonstration, and you can choose the one that best suits your needs:

  1. SumArabic Dataset: This dataset contains over 80,000 articles with their shorter versions. The summaries are extremely short.
  2. Arabic-article-summarization-30-000: This dataset contains 8,378 articles with their summaries.

First, let's download and load the SumArabic dataset.

Now, let's load the Arabic-article-summarization-30-000 dataset:

Choose one of these datasets for the rest of the tutorial. The code will assume you've stored your chosen dataset in a variable called dataset. For example:

Step 3: Loading the Pre-trained Model and Tokenizer

We'll use the UBC-NLP/AraT5v2-base-1024 model, a T5-based model pre-trained on a large Arabic corpus. It's suitable for various text generation tasks, including summarization. We'll load both the model and its corresponding tokenizer.

The DataCollatorForSeq2Seq is a crucial component. It handles the padding and batching of input sequences and labels, ensuring they have the same length, which is a requirement for sequence-to-sequence models.

Step 4: Preprocessing the Data

Before we can train the model, we need to preprocess the data. This involves:

  1. Adding a prefix: We add a prefix like "summarize: " to the input text. This helps the model understand that it should perform a summarization task.
  2. Tokenizing: Converting the text (both input and summary) into numerical representations (tokens) that the model can understand. We use the tokenizer associated with our pre-trained model.
  3. Truncating: Limiting the length of the input and output sequences to a maximum length. This is necessary because transformers have a limited input context size.

Here's the preprocessing function, adapted for both datasets. We use a separate function for each to handle their different structure.

We use the .map() function to apply the preprocessing function to all examples in the dataset efficiently. The batched=True argument speeds up the process by applying the function to batches of examples.

Step 5: Defining the Evaluation Metric (ROUGE)

We'll use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric to evaluate the quality of the generated summaries. ROUGE compares the generated summary to a reference summary (the ground truth) and measures the overlap of n-grams (sequences of words).

This function decodes the model's output, calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L), and computes the average generation length.

Step 6: Setting up the Training Arguments

The Seq2SeqTrainingArguments class from the Transformers library allows us to configure various aspects of the training process.

Key arguments:

  • output_dir: Where to save the fine-tuned model.
  • evaluation_strategy: How often to evaluate the model (here, after each epoch).
  • learning_rate: The learning rate for the optimizer.
  • per_device_train_batch_size and per_device_eval_batch_size: Batch sizes. Adjust these based on your GPU memory.
  • num_train_epochs: The number of times to iterate over the entire training dataset.
  • predict_with_generate: Enables summary generation during evaluation, which is necessary for ROUGE calculation.
  • fp16=True: Enables mixed-precision training, which can significantly speed up training and reduce memory usage if your GPU supports it (e.g., NVIDIA GPUs with Tensor Cores). If you don't have a compatible GPU, remove this line.
  • push_to_hub: If set to True, the model and tokenizer will be automatically pushed to your Hugging Face Hub account after training (you'll need to be logged in).

Step 7: Creating the Trainer and Training

We now create a Seq2SeqTrainer instance, passing in the model, training arguments, datasets, tokenizer, data collator, and evaluation function. Then, we start the training process.

This is the most time-consuming part. The training time will depend on the dataset size, the model size, the batch size, and the number of epochs. With a GPU, it can take anywhere from a few minutes to several hours. Without a GPU, it could take significantly longer.

Step 8: Evaluating the Model (Optional)

Although we evaluate during training, you can evaluate the final model on a separate test set (if you have one) or re-evaluate on the validation set:

Step 9: Using the Fine-Tuned Model

Once the model is fine-tuned, you can use it to generate summaries:

Replace "my_summarizer" with the path to your saved model directory if you didn't push it to the Hugging Face Hub.

Step 10: Saving and Loading (Local)

The trainer.train() method automatically saves the best model (based on the evaluation metric) and checkpoints during training. You can also explicitly save the model and tokenizer:

To load the model later:

Conclusion

This guide provides a comprehensive overview of fine-tuning a pre-trained summarization model. By following these steps, you can create a custom summarizer tailored to your specific needs. Remember to experiment with different hyperparameters (learning rate, batch size, number of epochs) and datasets to achieve the best results. Using a GPU is highly recommended for faster training. The Hugging Face Transformers library makes this process relatively straightforward, even for those with limited machine-learning experience. ✅✨

مشاركة هذا المقال
التعليقات معطلة

Table Of Content

Fine-Tuning a Summarization Model: A Practical Guide ✨
Introduction
Prerequisites
Step 1: Installing the Necessary Libraries
Step 2: Loading the Dataset
Step 3: Loading the Pre-trained Model and Tokenizer
Step 4: Preprocessing the Data
Step 5: Defining the Evaluation Metric (ROUGE)
Step 6: Setting up the Training Arguments
Step 7: Creating the Trainer and Training
Step 8: Evaluating the Model (Optional)
Step 9: Using the Fine-Tuned Model
Step 10: Saving and Loading (Local)
Conclusion

ذات صلة

ModernBERT: قفزة نوعية في نماذج اللغة ذات السياق الطويل
١٩ ديسمبر ٢٠٢٤NLP

ModernBERT: قفزة نوعية في نماذج اللغة ذات السياق الطويل

نظرة عامة على ModernBERT، نموذج جديد من طراز BERT مع قدرات سياق طويلة وأداء متفوق في مختلف المهام.

مقال
مستقبل تطوير الويب: بناء تطبيقات متكاملة باستخدام Bolt.new (دون الحاجة إلى البرمجة!)
١٩ نوفمبر ٢٠٢٤Web-Dev

مستقبل تطوير الويب: بناء تطبيقات متكاملة باستخدام Bolt.new (دون الحاجة إلى البرمجة!)

يُحدث Bolt.new ثورة في تطوير الويب من خلال تمكين أي شخص من إنشاء تطبيقات ويب متكاملة باستخدام الذكاء الاصطناعي، حتى دون خبرة برمجية!

مقال

أحدث المنشورات

مستقبل تطوير الويب: بناء تطبيقات متكاملة باستخدام Bolt.new (دون الحاجة إلى البرمجة!)
١٩ نوفمبر ٢٠٢٤Web-Dev

مستقبل تطوير الويب: بناء تطبيقات متكاملة باستخدام Bolt.new (دون الحاجة إلى البرمجة!)

يُحدث Bolt.new ثورة في تطوير الويب من خلال تمكين أي شخص من إنشاء تطبيقات ويب متكاملة باستخدام الذكاء الاصطناعي، حتى دون خبرة برمجية!

مقال
تسهيل أنواع تايب سكريبت باستخدام الأدوات المساعدة (Utility Types)
٢٠ فبراير ٢٠٢٤typescript

تسهيل أنواع تايب سكريبت باستخدام الأدوات المساعدة (Utility Types)

اكتشف كيف يمكن لأدوات تايب سكريبت المساعدة أن تجعل أنواعك أكثر وضوحًا وقوة.

مقال
ModernBERT: قفزة نوعية في نماذج اللغة ذات السياق الطويل
١٩ ديسمبر ٢٠٢٤NLP

ModernBERT: قفزة نوعية في نماذج اللغة ذات السياق الطويل

نظرة عامة على ModernBERT، نموذج جديد من طراز BERT مع قدرات سياق طويلة وأداء متفوق في مختلف المهام.

مقال
site Logo
  • عن الموقع
  • سياسة الخصوصية
  • اتصل بنا
© 2026 Seyf ELislam. All Rights Reserved.
طُور بواسطةseyf1elislam|فريق TechTuneDz
1
pip install transformers datasets transformers[torch] evaluate rouge_score accelerate==0.20.3 sentencepiece gdown
1
# Download data from drive
2
import gdown
3
import zipfile
4
import os
5
from datasets import load_dataset
6
7
# Ensure we're in the right directory (adjust as needed)
8
if not os.path.exists("/kaggle/working/content"):
9
os.makedirs("/kaggle/working/content", exist_ok=True)
10
11
os.chdir("/kaggle/working")
12
# Download the dataset
13
!gdown 18hoo7Tql8NRMjLvabWYgigkfDrqds4m1
14
# Unzip the dataset
15
with zipfile.ZipFile("/kaggle/working/SumArabic.zip", 'r') as zip_ref:
16
zip_ref.extractall("/kaggle/working/content")
17
18
print("=========================")
19
!du -sh /kaggle/working/content/SumArabic
20
21
data_dir = "/kaggle/working/content/SumArabic"
22
23
sumArabic = load_dataset("json",
24
data_files={
25
"train": f"{data_dir}/sumarabic-1.0-train.jsonl",
26
"validation": f"{data_dir}/sumarabic-1.0-valid.jsonl",
27
"test": f"{data_dir}/sumarabic-1.0-test.jsonl"
28
})
29
print(sumArabic.keys())
30
print(sumArabic["train"].num_rows)
31
print(sumArabic["validation"].num_rows)
32
print(sumArabic["test"].num_rows)
33
34
# Print the first example in the training set
35
print(sumArabic["train"][0])
1
# Load ar_article_sum
2
from datasets import load_dataset
3
ar_article_sum = load_dataset("Abdelkareem/Arabic-article-summarization-30-000")
4
5
print(ar_article_sum["train"][0])
6
1
# Choose either SumArabic or ar_article_sum
2
# dataset = sumArabic
3
dataset = ar_article_sum
1
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
2
3
model_name = "UBC-NLP/AraT5v2-base-1024"
4
5
tokenizer = AutoTokenizer.from_pretrained(model_name)
6
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
7
8
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
1
prefix = "لخص :" # "summarize: " in Arabic
2
3
def process_function_ar_article_sum(examples):
4
inputs = [prefix + doc for doc in examples["Processed Text"]] # Use "Processed Text"
5
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
6
labels = tokenizer(text_target=examples["summarizer"], max_length=220, truncation=True)
7
model_inputs["labels"] = labels["input_ids"]
8
return model_inputs
9
10
def process_function_sumArabic(examples):
11
inputs = [prefix + doc for doc in examples["text"]]
12
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
13
labels = tokenizer(text_target=examples["headline"], max_length=128, truncation=True) # Use "headline"
14
model_inputs["labels"] = labels["input_ids"]
15
return model_inputs
16
17
# Choose the appropriate preprocessing function based on your chosen dataset.
18
if dataset == ar_article_sum:
19
process_function = process_function_ar_article_sum
20
elif dataset == sumArabic:
21
process_function = process_function_sumArabic
22
else:
23
raise ValueError("Invalid dataset choice.")
24
25
tokenized_dataset = dataset.map(process_function, batched=True)
1
import evaluate
2
import numpy as np
3
4
rouge = evaluate.load("rouge")
5
6
def compute_metrics(eval_pred):
7
predictions, labels = eval_pred
8
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
9
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
10
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
11
12
result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
13
14
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
15
result["gen_len"] = np.mean(prediction_lens)
16
17
return {k: round(v, 4) for k, v in result.items()}
1
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
2
3
training_args = Seq2SeqTrainingArguments(
4
output_dir="my_summarizer", # Directory to save the fine-tuned model
5
evaluation_strategy="epoch", # Evaluate at the end of each epoch
6
learning_rate=2e-5,
7
per_device_train_batch_size=2, # Batch size per GPU for training
8
per_device_eval_batch_size=2, # Batch size per GPU for evaluation
9
weight_decay=0.001, # Weight decay for regularization
10
save_total_limit=3, # Limit the number of saved checkpoints
11
num_train_epochs=3, # Number of training epochs
12
predict_with_generate=True, # Generate summaries during evaluation
13
# fp16=True, # Use mixed-precision training (if your GPU supports it)
14
push_to_hub=False, # Set to True to push the model to the Hugging Face Hub
15
)
1
trainer = Seq2SeqTrainer(
2
model=model,
3
args=training_args,
4
train_dataset=tokenized_dataset["train"],
5
eval_dataset=tokenized_dataset["test"], # Use the test set for evaluation
6
tokenizer=tokenizer,
7
data_collator=data_collator,
8
compute_metrics=compute_metrics,
9
)
10
11
trainer.train()
1
# Optional: Evaluate after training is complete
2
results = trainer.evaluate()
3
print(results)
4
1
from transformers import pipeline
2
3
summarizer = pipeline("summarization", model="my_summarizer") # Or path to your saved model
4
5
text = """
6
ضع هنا النص الذي تريد تلخيصه.
7
"""
8
9
summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
10
11
print(summary[0]['summary_text'])
12
1
# Save the model and tokenizer
2
trainer.save_model("my_summarizer")
3
tokenizer.save_pretrained("my_summarizer")
1
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
2
3
model = AutoModelForSeq2SeqLM.from_pretrained("my_summarizer")
4
tokenizer = AutoTokenizer.from_pretrained("my_summarizer")
5