How to fine tune NLP Huggingface transformers model 🤗 using your own dataset in 6 steps

We hope you find this tutorial useful, please share it if you do!

The Huggingface 🤗 package offers you very powerful yet accessible transformer based natural language processing (NLP) models, some models are optimised for Natural Language Understanding (NLU) and some models geared towards Natural Language Generation (NLG).

When I was starting out using Huggingface transformers 🤗, I found despite (or maybe because!) of the huge amount of documentation on it and numerous ways to do things, that how to use it on my own data and outside the realms of the curated Huggingface Dataset Hub was not clear. This tutorial is my contribution to providing you with clear steps, if you have also found this to be a challenge.

Author

This tutorial walks you through the steps to fine-tune an NLP Huggingface transformers 🤗 model using your own custom dataset using the Huggingface Transformers API for training, and Huggingface Datasets library for downloading, storing and preprocessing the training and testing data. The tutorial takes you right through from starting with your own dataset to evaluation of the fine-tuned model results.

Table Of Contents
  1. 1. Introduction to Huggingface Transformers 🤗
  2. 2. Inspecting an existing Huggingface Transformers 🤗 Hub dataset
  3. 3. Loading your own Custom Dataset directly into Huggingface 🤗 Datasets
  4. 4. Preprocessing the dataset using the 🤗 Huggingface Datasets Library
  5. 5. Transformer model fine-tuning with the Huggingface Trainer API
  6. 6. Retrieving Predictions from the fine-tuned transformer model
  7. 7. Conclusion

1. Introduction to Huggingface Transformers 🤗

The Huggingface package offers very powerful yet accessible transformer based natural language processing (NLP) models, some models are optimised for Natural Language Understanding (NLU) and some models geared towards Natural Language Generation (NLG). The types of transformer model available for the Huggingface transformers library, their high level functions and typical tasks are set out below as examples in Table 1. Additionally there is a hub of other pre-trained huggingface models. Huggingface also offers a Datasets library which offers methods to download, store and preprocess datasets for use with transformer models, which will also be used in this tutorial.

This tutorial will cover the steps required if you want to use your own data set to fine tune a Huggingface Model, and the tutorial fine-tuning example will be to make a multi-class sentence classifier. Python 3 is used throughout.

1.1 Types of Transformer Models offered by Huggingface 🤗

Table 1 shows transformer model families offered by Huggingface. A particular model architecture (BERT, GPT-2, Marian etc.) will be best matched to whatever natural language processing task you wish to perform. For instance it can be seen in Table 1 that natural language understanding (NLU) is best carried out by an encoder based transformer model architecture. For a task which involves both natural language understanding and natural language generation, the most suitable transformers model will be from the family of encoder-decoder models.

Transformer Model TypeHigh level function of the model architectureExample Model architecturesTypical natural language processing Tasks
EncoderEncoder receives input and builds a representation model optimized to acquire understanding from an input (primarily for NLUnatural language understanding)BERT, ALBERT, DistilBERT, ELECTRA, ROBERTasentence classification, named entity recognition, extractive question answering
DecoderDecoder uses an existing encoder representation and inputs to generate a sequence. Model optimised for generating outputs (primarily for NLGnatural language generation)CTRL, GPT, GPT-2, Transformer XLtext generation
Encoder-DecoderEncoder-decoder combines the features of encoder and decoder, generating an output from an input, such that there is NLU of an input, and NLG of an output (for sequence to sequence)BART, T5, Marian, mBARTtext summarization, translation, generative question answering
Table 1: Summary of Transformer model types, examples and tasks, adapted and extended from a Huggingface summary page [1] https://huggingface.co/course/chapter1/9?fw=pt

1.2 Selecting the Pre-trained Model based on the Desired NLP Task

From Table 1, it can be seen that typically sentence classification is carried out by an encoder architecture, hence a trained BERT model with a multi-class classifier ‘head’ will be chosen as a starting point for the classifier being trained on your own dataset.

The BERT model (an encoder architecture – see Table 1) is not originally trained as a classifier, but instead via a masked language modeling (MLM) task, and on next sentence prediction (NSP) objectives. The training of BERT involves randomly masking some tokens in a text sequence, and then independently recovering the masked tokens by conditioning on the encoding vectors obtained by a bidirectional Transformer [2] https://arxiv.org/pdf/2004. . It is efficient at predicting masked tokens and at natural language understanding (NLU) in general, but not optimised for text generation. [3] https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertTokenizer

1.3 Installation of 🤗 Huggingface Transformers and Datasets Libraries

Huggingface Datasets library must be installed separately from Huggingface Transformers. If you are installing from within a jupyter notebook, the pip install commands are as below (note the addition of the exclamation mark ‘!’ at the beginning of the command to indicate executing a bash/command line command within a jupyter notebook ). If you are installing directly from the command line, remove the exclamation marks.

# install huggingface transformers and huggingface datasets from jupyter notebook
!pip install transformers
!pip install datasets

2. Inspecting an existing Huggingface Transformers 🤗 Hub dataset

Choosing an existing dataset from the hub gives you a good indication of what format Huggingface datasets adopt; you can see directly what the raw features, and labels look like and what information is stored in a Dataset object.

A complete list of datasets which are on Huggingface can be found on the Huggingface website [4]https://huggingface.co/datasets . On that page, the datasets can be filtered by, for example:

  • Task Categories,
  • Tasks,
  • Languages,
  • Multi-linguality

Because the bespoke dataset of .csv files for conversion into a Huggingface Dataset and DatasetDict object is a sentence classification task, a multiclass classification dataset from Huggingface Hub is also chosen for comparison.

2.1 Downloading an Existing Dataset Object from the Huggingface Dataset Hub

As the task being solved by the custom dataset is going to be a multi-class classification task, I have chosen a similar dataset to download from the Huggingface Dataset hub – the Emotion Dataset. The Emotion Dataset comprises six mutually exclusive classes or labels for short pieces of text (anger, fear, joy, love, sadness, and surprise), so will be a very similar natural language processing task to the bespoke dataset – a multi-class classification problem. As an emotion classification dataset, the model could be used to train or fine tune is a model for natural language understanding (NLU).

Loading the Emotion Dataset Hub dataset creates a DatasetDict object [5] https://huggingface.co/docs/datasets/master/package_reference/main_classes.html#datasetdict :

from datasets import load_dataset

# emotion_dataset is a DatasetDict object
emotion_dataset = load_dataset('emotion')
print(emotion_dataset)

Printing out the emotion dataset shows that it is a DatasetDict (a dataset dictionary object) which is comprised of ‘train’, ‘validation’ and ‘test’ datasets with listed features called ‘text’ and ‘label’:

# note here that based on the output below
# it is known from the output here that the key names are 'train', 'validation' and 'test'
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In order to look at the features of a single split of the emotion Dataset (here the split called train), the emotion dataset is loaded by its split, and then its features attribute is inspected:

emotion_dataset_train = load_dataset('emotion', split='train')
print(emotion_dataset_train.features)
showing the features of the training split in the emotion dataset; this displays the ClassLabels in detail

Display of the emotion dataset training features shows that there is additional information on the labels i.e. ClassLabel information where the number of classes is 6, and the class names are listed in order as strings – ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

To look at this Emotion Dataset Hub in more detail (especially to display the content of the dataset), this DatasetDict can be converted to a Pandas dataframe. This is done in the following section.

2.2 Converting the Huggingface 🤗 Hub Dataset to a Pandas Dataframe

In order to inspect the dataset in detail, and see the content in the form of dataframes (using the Pandas package), the individual dataset ‘train’, ‘validation’ and ‘test’ dictionaries from the Datasetdict:

# the entire dataset cannot be directly loaded into pandas, as it is a 
# dictionary of dictionaries:
# emotion_dataset.to_pandas() # this won't work

# instead, call a dictionary from within the DatasetDict, by using
# key name options: 'train', 'validation', 'test'

 emotion_validation = emotion_dataset['validation'].to_pandas()

Looking at the emotion_validation dataset as a dataframe, it has the following structure:

A sample of the emotion validation dataset from the the Huggingface Transformers Dataset hub, loaded into a pandas dataframe with text and labels

Likewise, the ‘train’ and ‘test’ options can similarly be converted to pandas dataframe:

# call the train and test dictionaries from within the DatasetDict, by key 

emotion_train = emotion_dataset['train'].to_pandas()
emotion_test = emotion_dataset['test'].to_pandas()

From inspection of the Huggingface emotion_validation dataset (or any of the other two datasets for ‘train’ and ‘test’), it can be seen that the labels (column ‘label’) are not strings, but actually already encoded as integers. Note that with the bespoke dataset being used for training as part of this tutorial (see Loading your own Custom Dataset directly into Huggingface Datasets). Looking at the number of labels in the dataset using pandas:

emotion_validation.label.unique()

Looking at the output of unique labels for the Huggingface Dataset Hub emotion dataset, assuming all labels are present in the validation portion of the emotions dataset, you can see the dataset shows six mutually exclusive labels (0-5):

unique integer labels for validation set

The mapping between encoded labels (0-5) and the ordered list of string labels to which they relate ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'] is stored in the Dataset as a ClassLabel.

2.3 Loading Emotion dataset back into Huggingface Dataset 🤗 Objects from Pandas

The Emotion dataset dictionaries (train, validation, test) expressed as pandas dataframes, can each be converted back first of all from Pandas to a Dataset object [6] https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=from_pandas#datasets.Dataset.from_pandas , then all datasets collected together to re-form a Datasetdict [7] https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=datasetdict#datasets.DatasetDict.align_labels_with_mapping .

from datasets import Dataset

# convert each pandas dataframe for train, validate and test back into a dataset

reconstituted_emotion_train = Dataset.from_pandas(emotion_train)
reconstituted_emotion_test = Dataset.from_pandas(emotion_test)
reconstituted_emotion_test = Dataset.from_pandas(emotion_test)

The screenshot below shows the output of the reconstituted emotion test set from pandas into Huggingface Dataset format:

Emotion test dataset reconstituted into Huggingface Dataset format from a pandas dataframe

Once each of the dataframes for train, validation and test have been converted from pandas dataframes back into dataset objects, these dataset objects can be combined into a Huggingface Datasets datasetdict object:

from datasets.dataset_dict import DatasetDict

# construct a DatasetDict from the datasets
reconstituted_datasetdict = DatasetDict({"train":reconstituted_emotion_train, 
                             "validation": reconstituted_emotion_validation, 
                             "test":reconstituted_emotion_test})

The reconstituted dataset dictionary has the structure shown in the screenshot directly below. Note that the dictionary names for the dictionary keys (here, "train“, "validation“, "test” can be anything you wish them to be called)

screenshot showing the reconstituted Huggingface transformers dataset dictionary object from three Huggingface dataset objects

Converting a dataset into a Pandas dataframe provides useful insight into how the dataset is constructed, for example, you can see what the underlying data and labels look like. However, using Pandas dataframes for preprocessing prior to fine-tuning a transformer is not memory efficient. If the dataset is loaded in as a Pandas object, then the whole dataset is loaded in memory; for large datasets this may cause memory problems.

For a memory-friendlier approach to dataset preprocessing, the next section shows you how to load your own .csv file(s) from disk directly into Huggingface 🤗 Datasets, not via a Pandas dataframes object.

3. Loading your own Custom Dataset directly into Huggingface 🤗 Datasets

Principally, this tutorial is intended to show you how to use your own bespoke dataset to fine-tune a Huggingface transformer model, rather than a Dataset Hub dataset such as the Emotion dataset above. Therefore in the following steps, we will be using datasets in .csv format and converting them into Dataset and Datasetdict objects for optimised use with the Huggingface Datasets library.

3.1 Loading your own Dataset

The example dataset used for illustration is the custom intent engine, a multiclass labelled intent classification dataset available for download [8] https://github.com/sonos/nlu-benchmark/tree/master/2017-06-custom-intent-engines , from Coucke A. et al. [9]Coucke A. et al., “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces.” 2018, accepted for a spotlight presentation at the … Continue reading. We will refer to that classification dataset as the ‘Multi Intent’ dataset.

In order to download the bespoke Intent Classification datasets from a jupyter notebook, use the commands below (if running these commands directly from the command line, remove the exclamation mark ‘!’):

!gdown --id 1OlcvGWReJMuyYQuOZm149vHWwPtlboR6 --output train.csv
!gdown --id 1Oi5cRlTybuIF2Fl5Bfsr-KkqrXrdt77w --output valid.csv
!gdown --id 1ep9H6-HvhB4utJRLVcLzieWNUSG3P_uF --output test.csv

The datasets are downloaded with the assigned names ‘train.csv‘, ‘valid.csv‘ and ‘testing.csv‘; all as comma separated variable (.csv) files.

3.2 Inspecting the downloaded .csv files in Pandas Dataframe format

As with the Emotion dataset from the Huggingface Datasets Hub (detailed above), it is possible to load and inspect the three downloaded .csv files in Pandas dataframes, in order to inspect the contents of the dataset.

import pandas as pd
# read the .csvs into pandas dataframes

multiintent_train = pd.read_csv("train.csv")
multiintent_valid = pd.read_csv("valid.csv")
multiintent_test = pd.read_csv("test.csv")

Looking at these .csv files in pandas dataframes, the size of these dataframes is shown:

print(
multiintent_train.shape, 
multiintent_valid.shape, 
multiintent_test.shape
)
# output of train, valid and test dataframe shapes:
>> (13084, 2) (700, 2) (700, 2)

As with the Huggingface Dataset Hub dataset Emotions, it is helpful to inspect the content and format of the Multi-Intent dataset in detail by looking at the contents. You can do this by calling a pandas dataframe method to look at the top five rows of the multi-intent train dataset:

multiintent_train.head(5)

A typical output from looking at the top 5 rows of the dataframe, is shown as follows:

multi intent training dataset top 5 rows which show the text and its string label, shown as panda dataframe

From a detailed look at the training split of the dataset (in the image above), the two columns ‘text’ and ‘intent’ show the data and its labels respectively. Note that the labels are in the form of strings – in contrast to the emotion dataset above, the multi-intent labels have not been encoded at this stage. Check the number and range of labels:

# multiintent_train dataset, intent column, unique values

multiintent_train.intent.unique()
7 unique string labels for multi intent training set (an array of strings)

This gives 7 unique labels: 'PlayMusic', 'AddToPlayList', 'RateBook', 'SearchScreeningEvent', 'BookRestaurant', 'GetWeather', 'SearchCreativeWork'

3.3 Converting Custom Example .csv files directly into Dataset and DatasetDict objects

The datasets from the 🤗 Datasets library are Apache Arrow files stored on disk; you only keep in memory the samples you ask for.

It is more convenient to apply the various preprocessing and feature engineering steps (e.g. tokenization) to all the train, validation and test sets in one step. Therefore combining the split datasets into a single DatasetDict object means that Dataset.map() functions can be applied across all splits at once. To do this, provide a dictionary to the data_files key word argument in the load_dataset function, that maps each split name (here, “train”, “validate”, “test”) to a .csv file for that particular split [10] https://huggingface.co/course/chapter5/2?fw=pt [11] https://huggingface.co/docs/datasets/package_reference/loading_methods.html?highlight=load_dataset#datasets.load_dataset , as shown in the code snippet:

# create dictionary to include all the multi-intent.csv datasets, for train, 
# validate, test
train_valid_test_files = {"train": "train.csv", 
                          "validate":"valid.csv", 
                          "test": "test.csv"}

# load the three .csvs as a single Huggingface DatasetDict
train_valid_test_dataset = load_dataset('csv', data_files=train_valid_test_files)

# display train_valid_test_dataset object in jupyter notebook
train_valid_test_dataset

When you look at the output (i.e. the train_valid_test_dataset object) in a jupyter notebook, it will have a DatasetDict structure, listing the text itself (with the column heading text) and the labels (with the column heading intent) both as features. Note, at this stage, the Huggingface Transformers Dataset objects have not been given information about what the labels are for training, validation and testing. These labels should be generated from the ‘intent’ column in the case of this particular dataset; a description of how to do this is in the Handling Dataset Labels section of this tutorial.

# output of train_valid_test_dataset is a DatasetDict object:
# note that both 'text' (the data) and 'intent' (the label) are denoted as features

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 13084
    })
    validate: Dataset({
        features: ['text', 'intent'],
        num_rows: 700
    })
    test: Dataset({
        features: ['text', 'intent'],
        num_rows: 700
    })
})

If you choose to convert a single split of the dataset, for example, loading train.csv only, rather than multiple .csv files in a dictionary, train_dataset_example will still be loaded into a DatasetDict format

train_dataset_example = load_dataset('csv', data_files="train.csv")

Printing out train_dataset_example in a python notebook will show that the train_dataset_example object also has the DatasetDict format:

# output of the variable called 'train_dataset_example'

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 13084
    })
})

Note that you can reference an individual Dataset within the DatasetDict by its key. In this case, you can extract the train Dataset by referencing its key:

# sub-select via the 'train' key
train_dataset_example['train']

# look at features
train_dataset_example['train'].features

This gives the following outputs when run as individual cells in a python notebook:

code showing output of the sub-selecting the train key of the huggingface dataset dictionary and its features which are 'text' and 'intent'

The output of the sub-selecting the train key of the Huggingface dataset 🤗 dictionary shows its features and the number of rows, and selecting its features shows the details of ‘text‘ (the data) and ‘intent‘ (which are actually our labels) which correspond to what would be the column names of the equivalent pandas dataframe for ‘train‘. It is notable from this that, in contrast to the Emotion dataset from Huggingface Transformers Hub, displaying train_dataset_example['train'].features attribute does not display ClassLabels at this stage, as class labels have not been defined. This is covered at the section Creating Encoded Labels from String Labels.

4. Preprocessing the dataset using the 🤗 Huggingface Datasets Library

Now that a DatasetDict object has been created which holds the dataset splits for the Intent Classification dataset, it must be ensured that the Multi Intent Classification dataset labels are in a form which can be processed by Huggingface Transformers 🤗.

4.1 Handling Dataset Labels

The dataset labels, which are in the ‘intent‘ column of the Multi Intent Classification dataset, are strings. The previous section called Inspecting the downloaded .csv files in Pandas Dataframe format, shows that the labels for the Multi Intent Dataset (as loaded in from the .csv files) are strings: 'PlayMusic', 'AddToPlayList', 'RateBook', 'SearchScreeningEvent', 'BookRestaurant', 'GetWeather', 'SearchCreativeWork'

In order for the Multi Intent dataset to be used to fine-tune a transformer model, its string labels must first be converted to integers to represent classes i.e. the labels must be encoded. If you try to use the labels as they are, you are likely to get an error along the lines below. The error indicates the conversion failed because conversion of the label column ‘intent’ failed, as the contents were of the type object (which is a string):

ArrowInvalid: ("Could not convert 'PlayMusic' with type str: tried to 
convert to int64", 'Conversion failed for column intent with type object')

Note that it might be tempting to tokenize the labels, as they are strings. For example, multiintent_train['intent'], which is the column which contains the labels starts out as strings. Labels themselves (of course!) should not be tokenized, but encoded instead; this is covered in the section called Creating Encoded Labels from String Labels below.

4.1.1 Creating Encoded Labels from String Labels

At this point, we have a Huggingface Dataset or DatasetDict object which has string labels, so the labels must be encoded. The class_encode_column method is used to do this label encoding either across the DatasetDict object, or a Dataset object (which contains a dataset split like ‘train‘ or ‘test‘) [12] https://huggingface.co/docs/datasets/master/package_reference/main_classes.html#datasets.Dataset.class_encode_column and it returns a DatasetDict object.

from datasets import load_dataset

# class_encode_column can operate on a) a individual split within the 'DatasetDict' # object (the 'Dataset' object), or b) the whole DatasetDict

# option A) encoding dataset 'train' split only (dataset object)

train_encoded_example = train_dataset_example['train'].class_encode_column(column='intent')

Printing out the train_encoded_example object, gives the following output:

# displaying train_encoded_example

Dataset({
    features: ['text', 'intent'],
    num_rows: 13084
}

And when you inspect the train_encoded_example features, the output as as follows, from which it can be seen that even though the underlying encoding of the labels (intent column) to 0-6 has been carried out, the features retain the information on the mapping between the string label names and their integer encodings via a datasets.ClassLabel feature (i.e. a field with a set of classes which have labels, but which are stored as integers in the dataset):

# displaying train_encoded_example.features

{'text': Value(dtype='string', id=None),
 'intent': ClassLabel(num_classes=7, names=['AddToPlaylist', 'BookRestaurant', 'GetWeather', 'PlayMusic', 'RateBook', 'SearchCreativeWork', 'SearchScreeningEvent'], names_file=None, id=None)}
# option B) Encode the labels on a dataset dictionary (all train, validation, test splits at once)

train_valid_test_dataset_encoded = train_valid_test_dataset.class_encode_column(column= 'intent')

As with the single dataset encoding, encoding the entire DatasetDict can be carried out at once, but if you wish to view the features of that dataset, you must select a subset of the DatasetDict to view (e.g. the train subset). The following screenshot shows the difference in what the features contain before and after encoding the entire dataset dictionary:

showing features of the training dataset before and after encoding. After encoding the class labels are visible

4.1.2 Handling Dataset Label Column Names

For training the model (see the section called Training the Transformer Model), the trainer object by default expects all the labels to be encoded (as integer values), and listed in a single column named ‘label‘. In that instance, the ‘label_names‘ argument in Training Arguments should also retain its default setting of None – see Setting parameters of transformer model via Training Arguments.

In the multi-intent custom dataset, the label column is called ‘intent‘ (as has been illustrated in Handling Dataset Labels). As a result, there are two options available for handling this:

  • Renaming the multi-intent dataset column from ‘intent‘ to ‘label‘ (preferable); or
  • Explicitly setting the label_names argument in the Training Arguments to ['intent'] (although this creates problems if you wish to use the compute_metrics function easily on the validation and test datasets) – compute_metrics is covered in Running the Training (fine-tuning) of the Model

Where the label column name is not renamed to 'label' or otherwise not handled by setting the label_names argument in the Training Arguments, this will give rise to a key error [13] https://discuss.huggingface.co/t/why-am-i-getting-keyerror-loss/6948 .

4.1.2.1 Renaming the Column

Naming the 'intent' column to 'label' can be done across the whole dataset dictionary in one go. Note that this renaming method doesn’t operate in-place , instead it returns a new dataset which you have to save to a new variable. [14] https://stackoverflow.com/questions/70263251/valueerror-when-pre-training-bert-model-using-trainer-api :

# rename column in ('old_label', 'new_label') format

train_valid_test_dataset_encoded_label = train_valid_test_dataset_encoded.rename_column('intent', 'label')

The structure of the updated train_valid_test_dataset_encoded_label (with new column called label) becomes as follows:

# output of train_valid_test_dataset_encoded_label

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13084
    })
    validate: Dataset({
        features: ['text', 'label'],
        num_rows: 700
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 700
    })
}

4.1.2.2 Bespoke label names in Training Arguments

The alternative to renaming the label column to ‘label’, is to change the input parameter label_names which are fed into Training Arguments (section: Setting parameters of transformer model via Training Arguments). The disadvantage of this is that if

Note that when the label_names are explicitly set, none of them should be named "label" and they should be a list of strings of length >=1, as your model can accept multiple label arguments. Use label_names in your TrainingArguments (section Setting parameters of transformer model via Training Arguments) to indicate their name to the Trainer (section training-the-transformer-model Running the training (fine-tuning) of the Model via Trainer Class).

4.2 Preprocessing the Text Data

Having processed the labels in the dataset, this section covers preprocessing the text samples across the whole dataset dictionary, prior to fine-tuning the model.

4.2.1 Applying the Tokenizer

Given that we are using a BERT based transformer model as a text classifier, a matching BERT based tokenizer should be used. Just like with loading the model, there are two options for instantiating a BERT tokenizer, a specific BertTokenizer (option 1), and a more general AutoTokenizer object (option 2), which takes a BERT checkpoint (training state):

# BertTokenizer is option 1
from transformers import BertTokenizer

# AutoTokenizer is option 2
from transformers import AutoTokenizer

# specify the pretrained 'checkpoint' (i.e. pre-trained weights) to load
# to load the tokenizer from, here I have chosen bert-base-cased

checkpoint = "bert-base-cased"

# option 1 - BertTokenizer
bert_cased_tokenizer = BertTokenizer.from_pretrained(tokenizer_checkpoint, use_fast=True)

# Or option 2 - AutoTokenzier
bert_cased_tokenizer = AutoTokenizer.from_pretrained(checkpoint, fast=True)

The Tokenization method is specific to the model. For example BERT uses ‘WordPiece’ which divides rarer words in sub-word tokenization using common sub-word units) [15] https://arxiv.org/pdf/1609.08144.pdf . Note that a parameter for the tokenizer is set as fast = True. This calls one of the ‘fast’ tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models.

4.2.1.1 Setting Padding and Truncation for tokenizer

The bert_case_tokenizer is applied across the dataset dictionary (all three datasets) by creating a tokenizing function, and then mapping this function across the whole dataset dictionary.

The padding and truncation options are set within this tokenize_function_for_classification function. The Huggingface documention contains a helpful table summarizing the options to setup padding and truncation [16] https://huggingface.co/docs/transformers/main/en/pad_truncation .

Note that although padding to standardised the length of tokenized text samples must be carried out, that it is not carried out at this stage, therefore padding=False in the function. This is because padding is performed at a later step in Data Collator with Padding on a batch by batch basis.

def tokenize_function_for_classification(example):
    """ 
    function to apply the bert_cased_tokenizer to batches - 
truncation = True: truncation to max model input length
padding = True: padding to max sequence in batch, else False
    
    """
    return bert_cased_tokenizer(example['text'], truncation=True, padding=False)

If padding were set to True, all samples would be padded to the longest training sample in the whole dataset.

4.2.1.2 Mapping tokenizer over dataset dictionary

The tokenize_function_for_classification function can be mapped across the whole data dictionary (DatasetDict), based on the example in Huggingface [17] https://huggingface.co/course/chapter3/3?fw=pt :

# note that .map includes a batched=True keyword argument

tokenized_multiintent_datasets = train_valid_test_dataset_encoded_label.map(tokenize_function_for_classification, batched=True)

Looking at the fields in the train_valid_test_dataset_encoded_label, (prior to tokenization) these are limited to ['label', 'text']:

# displaying train_valid_test_dataset_encoded_label
# 'text' and 'label' fields only prior to tokenization

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13084
    })
    validate: Dataset({
        features: ['text', 'label'],
        num_rows: 700
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 700
    })
})

After you apply the tokenize_function_for_classification tokenzier to the whole dataset dictionary, several new fields (dataset features) [18] https://huggingface.co/docs/datasets/about_dataset_features are created ['attention_mask', 'input_ids', 'token_type_ids']:

# displaying tokenized_multiintent_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 13084
    })
    validate: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 700
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 700
    })
})

4.2.2 Input IDs, Attention Mask and Token Type IDs

The BERT tokenizer in the previous section adds the dataset feature fields ['attention_mask', 'input_ids', 'token_type_ids']. This section discusses those fields, using a single tokenized training example below as an example:

# displaying tokenized_multiintent_datasets['train'][11]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 2603, 1142, 1326, 170, 126, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 4, 'text': 'rate this series a 5', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

4.2.2.1 Input IDs

Looking at a sample above within the training dataset, the input_ids commence with sequence starting special token 101 and end with special token 102. These special tokens mark the beginning and end of sequences of variable length, and are present any natural language processing task. Sentence classification, translation, summarization or similar would all be anticipated to have inputs of variable lengths between training samples.

The tokens between the start and end tokens are encodings from the vocabulary for the relevant tokenizer (in this case a “bert-base-cased” tokenizer). The tokens in between are numerical encodings which map the input text against the tokenizer vocabulary of word (or sub-word) tokens.

Reversing the text mapping to input_ids shows how the 'text' relates to it [19] https://huggingface.co/docs/transformers/main/en/glossary#input-ids :

tokenized_training_sample = tokenized_multiintent_datasets['train'][11]

decoded_tokenized_training_sample = bert_cased_tokenizer.decode(tokenized_training_sample['input_ids'])

# output of decoded_tokenized_training_sample is '[CLS] rate this series a 5 [SEP]'

BERT uses classifier [CLS] and separator [SEP] tokens (which map to 101 and 102) for a single sequence of text (which is what we have here for our classifier). BERT can also accept pairs of sequences, in which case its special tokens would be [CLS] A [SEP] B [SEP] where A and B are the text sequences. As padding has not yet been applied, there are no padding token [PAD]; if there were padding tokens these would appear at the end after 102, as BERT models prefer padding on the right of the sequence.

4.2.2.2 Token Type IDs

BERT (and some other transformer models) deploy token type IDs (also called segment IDs). If there is for example two sequences in a BERT model (for example next sentence prediction), they are represented as a mask of 0s and 1s identifying those two sequences respectively in the model [20] https://huggingface.co/docs/transformers/main/en/glossary#token-type-ids . The token_type_ids for the first sentence is denoted by 0 and the second sequence denoted by 1. Some models, like XLNetModel will use an additional token represented by a 2 [21] https://huggingface.co/docs/transformers/main/en/glossary#token-type-ids .

For a text classification problem such as the multi-intent dataset in this tutorial, there is only one text sequence, so the token_type_ids associated with the sequence in this context is a list of zeros:

tokenized_training_sample = tokenized_multiintent_datasets['train'][11]
tokenized_training_sample['token_type_ids']

# output of the token_type_ids is zero for a single sequence
# [0, 0, 0, 0, 0, 0, 0]

4.2.2.3 Attention Mask

The attention mask is a binary tensor indicating the position of the padded indices, so that the model does not attend to them. For the BertTokenizer, 1 indicates a value that should be attended to, while 0 indicates a padded value. This attention mask is in the dictionary returned by the tokenizer under the key “attention_mask[22] https://huggingface.co/docs/transformers/main/en/glossary#attention-mask , [23] https://huggingface.co/course/chapter2/5?fw=pt#attention-masks .

At this stage, there is no padding applied to the tokenized samples, so there will be no 0s which indicate padding in the attention mask – note that the attention_mask is all 1s:

# print(tokenized_multiintent_datasets['train'][11])
# where no padding is applied at tokenization stage

{'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'input_ids': [101, 2603, 1142, 1326, 170, 126, 102], 'label': 4, 'text': 'rate this series a 5', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}

If padding were applied at this stage, the attention mask would instead look as follows, where the parts of the attention_mask which are padded (on the right hand side) are zero:

# print(tokenized_multiintent_datasets['train'][11])
# where padding is applied at tokenization stage to illustrate how this affects the attention mask

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 2603, 1142, 1326, 170, 126, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'label': 4, 'text': 'rate this series a 5', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

4.3 Data Collator with Padding

Padding has been set to False in the tokenizer function, such that dynamic padding by batch can be carried out instead, using the data collator with padding [24] https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding , [25] https://huggingface.co/course/chapter3/2?fw=pt .

We can define a collate function that will apply the correct amount of padding to the dataset samples we want to batch together, and this is achieved using DataCollatorWithPadding. Note that padding is per batch not padding across all samples when collating. It pads to the maximum sample length within that batch (note: batch size is set in Setting parameters of transformer model via Training Arguments). :

from transformers import DataCollatorWithPadding

bert_classifier_data_collator = DataCollatorWithPadding(tokenizer=bert_cased_tokenizer)

The BERT model has absolute position embeddings, therefore padding is advised to be on the right, rather than the left https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertTokenizer. The right hand padding position is specified by default in DataCollatorWithPadding, and the object shows that right hand padding is applied at this data collator stage:

# output of 'bert_classifier_data_collator' object

DataCollatorWithPadding(tokenizer=PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

Checking the operation of DataCollatorWithPadding [26] https://huggingface.co/course/chapter3/2?fw=pt#dynamic-padding on a batch of samples to see how it pads can be achieved as follows:

samples = tokenized_multiintent_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "text"]}
print([len(x) for x in samples["input_ids"]])

# this gives output [17, 16, 18, 10, 16, 18, 10, 12]
# therefore padding of this batch of 8 samples should be to maximum length: 18

Applying the bert_classifier_data_collator to the batch of training samples:

# apply the data collator

batch = bert_classifier_data_collator(samples)
{key: value.shape for key, value in batch.items()}

# output looks like this - all torch tensors padded to 18:

{'attention_mask': torch.Size([8, 18]),
 'input_ids': torch.Size([8, 18]),
 'token_type_ids': torch.Size([8, 18]),
 'labels': torch.Size([8])}

The ‘bert_classifier_data_collator‘ will be called directly in the Trainer object (see section Transformer model fine-tuning with the Huggingface Trainer API):

5. Transformer model fine-tuning with the Huggingface Trainer API

At this step, we are now ready to run the Huggingface transformer model training – a BERT model, repurposed as a classifier in this case. By ‘model’ we mean a) architecture and b) checkpoint (pre-trained weights). Note here that we are retraining the transformer model with the Huggingface Trainer API.

There are alternative ways of fine-tuning this model to the Huggingface API via:

  • Pytorch
  • Tensorflow/Keras

These alternative methods are outside the scope of this particular tutorial, but examples can be found here [27] https://huggingface.co/docs/transformers/custom_datasets , [28] https://huggingface.co/docs/transformers/main/en/tasks/sequence_classification , [29] https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb , [30] https://huggingface.co/docs/transformers/main/en/training#finetune-a-pretrained-model . Keep an eye out for future tutorials on these two alternative approaches.

The process described by this section is:

  • defining your training hyperparameters in ‘TrainingArguments
  • pass the training arguments to a Trainer along with
    • the model,
    • dataset,
    • tokenizer, and
    • data collator,
  • call Trainer.train() to fine-tune your model [31] https://huggingface.co/docs/transformers/master/en/custom_datasets . This training process is shown in the next sections.

5.1 Setting parameters of transformer model via Training Arguments

Prior to fine-tuning the BERT classifier (which is explained in the section Training (fine-tuning) the Transformer Model), you must set the training arguments which determine training settings and hyperparameters.

The full list of training arguments is available from the Huggingface training arguments documentation [32] https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.TrainingArguments , but where non-default parameters are used in this tutorial, they are explained here.

from transformers import TrainingArguments

# if label_names isn't None, it should be a list of strings e.g. label_names = ['intent']
# label_names = None (the default) is preferable and assumes you have already renamed the label column
# training epochs set to 1 during experimentation, fine tuning probably needs 3-5

bert_classifier_training_args = TrainingArguments(
    output_dir = "your_directory_here", 
    label_names = None, 
    per_device_eval_batch_size=8, 
    per_device_train_batch_size=8, 
    num_train_epochs=1 
)

The output_dir contains the path where the configuration is saved as a .json file and model weights are saved as pytorch_model.bin:

saving Huggingface model checkpoints, configurations and model weights to specified file path
Saving Huggingface model checkpoints, configurations and model weights

The label_names training arguments parameter is set above to None. This setting can only be used provided the label column (or field) has been renamed label, as it was done in the section Renaming the Column. Internally, this eventually default to ["labels"] except if the model used is one of the models for question answering in which case it will default to ["start_positions", "end_positions"] [33] https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.TrainingArguments.

Alternatively, the label_names can be a list of keys (strings) in your dictionary of inputs that correspond to the labels. Here, assuming that the column name had not been changed from intent to label, the label_names parameter should be ["intent"].

Note however, if you set the label_names to a list of strings, then label_ids parameter of predictions won’t be populated (label_ids will be None), which makes calculating metrics on the test set a bit more tricky. This is covered in more detail at the Performance Metrics on the Huggingface Datasets 🤗 Test Set section.

The default batch size for training and evaluation is 8 samples, but I have explicitly set it: per_device_train_batch_size, per_device_eval_batch_size. These batch sizes determine the number of batches which are processed by a network before its learned internal weights and biases are updated; a single training pass or epoch occurs when the model has trained on all batches in a training dataset.

I have explicitly set the number of training epochs num_train_epochs to 1 for initial experimentation.

5.2 Defining the Classifier Model

As with the tokenizer step (in section called Applying the Tokenizer) , there are two options for instantiating a BERT-based classifier model, a specific BERT classifier model (option 1), and a more general automodel classifier model (option 2), which also takes a BERT checkpoint (training state). Note that the num_labels parameter is set to 7, as this is is being fine-tuned on the Multi-Intent dataset, a multi-class classification task with 7 distinct labels.

# BertForSequenceClassification is option 1 for the model selection
from transformers import BertForSequenceClassification

# AutoModelForSequenceClassification is option 2 for the model selection
from transformers import AutoModelForSequenceClassification

# specify the pretrained 'checkpoint' (i.e. pre-trained weights) to load
# to load the model from, here I have chosen bert-base-cased

checkpoint = "bert-base-cased"

# option 1 - BertForSequenceClassification

bert_classifier_model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=7)

# Or option 2 - AutoModelForSequenceClassification

bert_classifier_model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=7)

The bert_classifier_model object defined here, will be used as an input parameter to the Trainer class below at the section Running the training (fine-tuning) of the Model.

5.3 Setting Performance Metrics for Model Training

5.3.1 Choosing Appropriate Metrics

There are a range of natural language processing learning metrics for Huggingface transformers 🤗 [34] https://huggingface.co/docs/datasets/about_metrics [35] https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods#datasets.load_metric available via Huggingface Datasets.

You can check the metrics available in Huggingface Datasets 🤗 by running a list_metrics() method, as shown below. When the parameter with_details is set to True, the details of each metric in terms of how it is used and calculated is also displayed:

import datasets
datasets.list_metrics(with_details=True)

An detailed description of the required inputs for any given metric can be found by using the inputs_description attribute of a metric object [36] https://huggingface.co/docs/datasets/metrics . Matthews Correlation is taken as an example metric here:

# The metric loading script will instantiate and return a datasets.Metric object.
from datasets import load_metric

matthews_correlation_metric = load_metric('matthews_correlation')

# printing this object gives detailed input requirements including additional  
# settings, where problem is multi-class or multi-label

print(matthews_correlation_metric.inputs_description)

The output of this matthews_correlation metric is:

Args:
    predictions: Predicted labels, as returned by a model.
    references: Ground truth labels.
    sample_weight: Sample weights.
Returns:
    matthews_correlation: Matthews correlation.
Examples:

    >>> matthews_metric = datasets.load_metric("matthews_correlation")
    >>> results = matthews_metric.compute(references=[0, 1], predictions=[0, 1])
    >>> print(results)
    {'matthews_correlation': 1.0}

More information on the metrics available is in the metrics section of the Huggingface documentation.

As this is a multi-class classification task, relevant built-in Huggingface metrics would be:

  • Matthews correlation coefficient (“matthews_correlation“)
  • Accuracy (“accuracy” )
  • Precision (“precision” )
  • Recall (“recall” )
  • F1 (“f1” )

Note that for multi-class classification, additional parameters are required for accuracy, precision, recall and F1. For example for F1, an additional parameter average is required, and this must be specified when computing the metric, as illustrated in the next section, Evaluation during Model Training via Metrics.

5.3.2 Evaluation during Model training via Metrics

You can write a compute_metrics() style function, to evaluate the model during training [37] https://huggingface.co/course/chapter3/3?fw=pt#evaluation . Here is an example of computing a metric, using the matthews correlation coefficient:

from datasets import load_metric
import numpy as np

def compute_BERT_classifier_matthews_correlation(eval_preds):

    # metrics for the classifier
    metric = load_metric("matthews_correlation")
    logits = eval_preds.predictions
    labels = eval_preds.label_ids

    # choose prediction with maximum probability from logits via argmax method
    # axis = -1, last dimension
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

A further example (F1) which requires an additional metric compute parameter average, would be declared as follows:

def compute_BERT_classifier_F1(eval_preds):
    """
    average: This parameter is required for multiclass/multilabel targets.
    If average = None, the scores for each class are returned. 
    if average = 'micro', calculates metrics globally by counting the total true             positives,false negatives and false positives.

    """
    # metrics for the classifier
    metric = load_metric("f1")
    #logits, labels = eval_preds
    logits = eval_preds.predictions
    labels = eval_preds.label_ids
    # choose prediction with maximum probability from logits via argmax
    predictions = np.argmax(logits, axis=-1)
    # average = None metric means that average: This parameter is required for multiclass/multilabel targets.

    return metric.compute(predictions=predictions, references=labels, average='micro')

The compute_BERT_classifier_matthews_correlation() function can be added as an input parameter on instantiating the Trainer class (see the section Running the training (fine-tuning) of the Model via Trainer Class).

5.4 Running the training (fine-tuning) of the Model via Trainer Class

You instantiate an object of the Trainer class, and this will be the model which we train. Here I have called that training object bert_classifier_trainer. You can see from the code below that input parameters into that bert_classifier_trainer are the previously defined:

  1. Model
  2. Training Arguments
  3. Train and Evaluation datasets
  4. Data Collator
  5. Tokenizer
  6. compute metrics

Other Trainer parameters are left as their default values.

from transformers import Trainer

bert_classifier_trainer = Trainer(
    bert_classifier_model,
    bert_classifier_training_args,
    train_dataset=tokenized_multiintent_datasets["train"],
    eval_dataset=tokenized_multiintent_datasets["validate"],
    data_collator=bert_classifier_data_collator,
    tokenizer=bert_cased_tokenizer,
    compute_metrics = compute_BERT_classifier_metrics
)

Calling bert_classifier_trainer.train, trains the bert_classifier_trainer previously defined can be done by the following command:

bert_classifier_trainer.train()

Calling .train() will output something similar to the following; note that there is a warning “the following columns… have been ignored: text“. The text field is the original text, but the sequence classification model is being trained using the attention_mask and the input_ids (encoded, tokenized text), so the original text field should be ignored and this advisory is nothing to worry about.

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 13084
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1636

As the model trains, there will be further outputs relating to where model checkpoints are saved.

6. Retrieving Predictions from the fine-tuned transformer model

6.1 Prediction Object

“The output of the predict() method is another named tuple with three fields: predictions, label_ids, and metrics. The metrics field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). Once we complete our compute_metrics() function and pass it to the Trainer, that field will also contain the metrics returned by compute_metrics().” [38] https://huggingface.co/course/chapter3/3?fw=pt

test_predictions = bert_classifier_trainer.predict(tokenized_multiintent_datasets["test"])

The test_predictions object is a named tuple with three fields: predictions, label_ids, and metrics.

test_predictions._fields
# outputs are ('predictions', 'label_ids', 'metrics')

The performance metrics on the test dataset can be accessed from the test_predictions object, as shown in the section Performance Metrics on the Huggingface Datasets Test set.

6.2 Performance Metrics on the Huggingface Datasets 🤗 Test Set

The performance metrics for the test sets are in the test_predictions object. To access any of the test_predictions fields individually, you can do so by accessing the test_predictions attributes, or indexing it:

# option 1: test_predictions attributes
getattr(test_predictions, 'metrics')

# option 2: index test_predictions
test_predictions[2]

Both output, in this particular instance (note the added metric test_f1 if the F1 metric has previously been defined during training; if matthews_correlation is used instead, you will see 'test_matthews_correlation') :

{'test_loss': 0.06436864286661148,
 'test_f1': 0.9842857142857143,
 'test_runtime': 3.262,
 'test_samples_per_second': 214.592,
 'test_steps_per_second': 26.977}

7. Conclusion

This tutorial has been a basic step-by-step guide to loading your own custom dataset into Huggingface Datasets, converting it to Huggingface Transformers friendly format, fine tuning and applying performance metrics.

References

References
1 https://huggingface.co/course/chapter1/9?fw=pt
2 https://arxiv.org/pdf/2004.
3 https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertTokenizer
4 https://huggingface.co/datasets
5 https://huggingface.co/docs/datasets/master/package_reference/main_classes.html#datasetdict
6 https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=from_pandas#datasets.Dataset.from_pandas
7 https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=datasetdict#datasets.DatasetDict.align_labels_with_mapping
8 https://github.com/sonos/nlu-benchmark/tree/master/2017-06-custom-intent-engines
9 Coucke A. et al., “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces.” 2018, accepted for a spotlight presentation at the Privacy in Machine Learning and Artificial Intelligence workshop colocated with ICML 2018.
10 https://huggingface.co/course/chapter5/2?fw=pt
11 https://huggingface.co/docs/datasets/package_reference/loading_methods.html?highlight=load_dataset#datasets.load_dataset
12 https://huggingface.co/docs/datasets/master/package_reference/main_classes.html#datasets.Dataset.class_encode_column
13 https://discuss.huggingface.co/t/why-am-i-getting-keyerror-loss/6948
14 https://stackoverflow.com/questions/70263251/valueerror-when-pre-training-bert-model-using-trainer-api
15 https://arxiv.org/pdf/1609.08144.pdf
16 https://huggingface.co/docs/transformers/main/en/pad_truncation
17, 38 https://huggingface.co/course/chapter3/3?fw=pt
18 https://huggingface.co/docs/datasets/about_dataset_features
19 https://huggingface.co/docs/transformers/main/en/glossary#input-ids
20, 21 https://huggingface.co/docs/transformers/main/en/glossary#token-type-ids
22 https://huggingface.co/docs/transformers/main/en/glossary#attention-mask
23 https://huggingface.co/course/chapter2/5?fw=pt#attention-masks
24 https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding
25 https://huggingface.co/course/chapter3/2?fw=pt
26 https://huggingface.co/course/chapter3/2?fw=pt#dynamic-padding
27 https://huggingface.co/docs/transformers/custom_datasets
28 https://huggingface.co/docs/transformers/main/en/tasks/sequence_classification
29 https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb
30 https://huggingface.co/docs/transformers/main/en/training#finetune-a-pretrained-model
31 https://huggingface.co/docs/transformers/master/en/custom_datasets
32 https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.TrainingArguments
33 https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.TrainingArguments.
34 https://huggingface.co/docs/datasets/about_metrics
35 https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods#datasets.load_metric
36 https://huggingface.co/docs/datasets/metrics
37 https://huggingface.co/course/chapter3/3?fw=pt#evaluation
We hope you find this tutorial useful, please share it if you do!