Hugging Face從零到一 — 從初始化到fine tune教學

林聖晏

21 min readJan 6, 2024

這篇文章是以Hugging Face官網上的指南所翻譯、整理與理解的結果，若有錯誤歡迎提出

1.Transformer Models

Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.

=> 模型的架構，定義模型的每一層以及每個運算

Checkpoints: These are the weights that will be loaded in a given architecture.

=> 被載入到模型的權重

Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.

=> 模型這個詞並沒有那麼準確，因此在這邊並不常用

EX : BERT is an architecture，a set of weights trained by the Google team for the first release of BERT, is a checkpoint

2.Using Transformers

a.Preprocessing with a tokenizer

由於Transformer models無法直接理解原始文本(raw text)，因此我們需要先將文本轉換成模型能夠理解的數字(number) ⇒ 由Tokenizer負責，其功能如下 :

將輸入拆分為單詞、子單詞或符號（如標點符號），稱為標記(token)
將每個標記(token)映射到一個整數
添加可能對模型有用的其他輸入

由於這些data preprocessing都需要以與模型預訓練時完全相同，因此我們首先需要從Model Hub中下載這些資訊，才知道轉換的規則

⇒ 使用AutoTokenizer class and its from_pretrained() method

Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint

其中return_tensors=”pt”指定了return值的格式，”pt” : Pytorch，”tf” : Tensorflow ，” “: list

我們會得到一個dict，再來就需要把這些數字轉換成張量

其中這個dict包含兩個key，input_ids 跟 attention_mask

input_ids 包含兩個row的list (一句一個) ，其中包含的是unique的每個單詞、子單詞或符號(也稱為token)

b.Going through the model

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

對於輸出，我們稱其為hidden states，a.k.a. features。

對於每個輸入，我們取回一個高維向量(high-dimensional vector)作為模型對於輸入的文意理解，而這些輸出很常作為模型內其他部分的輸入，也稱作Head

一般而言LLM輸出的向量有三個維度 :

Batch size: The number of sequences processed at a time (2 in our example).

=> 也就是一次輸入的”句子數量”

Sequence length: The length of the numerical representation of the sequence (16 in our example).

由於使用了truncation=True，意味著如果輸入文本的長度超過模型的最大序列長度，分詞器會將其缩减到16，以確保所有輸入等長；若不足16，則會自動填充，而attention_mask就是在標記哪個分詞是填充的

Hidden size: The vector dimension of each model input.

c.Model heads: Making sense out of numbers

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
#之所以要**是因為在python中，這樣的用法意味著inputs是一個字典，其key被作為命名參數傳遞給model

model(inputs)：這種方式將整個 inputs 字典作为一個單獨的參數傳遞给 model。如果 model 函數的定義期望得到一个單獨的字典作為輸入，那麼這種方法是適合的。
model(**inputs)：這種方式將 inputs 字典展開为多個命名參數。這意味著如果 inputs 包含了 {'input_ids': ..., 'attention_mask': ...}，model(**inputs) 就相當於**model(input_ids=..., attention_mask=...)**。這種方法通常用於函数定義期望多個命名參數，而非單個字典。

如果直接印出output，可以觀察到輸出值本身不一定有意義

這些不是機率，而是logits，即模型最後一層輸出的原始非標準化分數。要轉換為機率，它們需要經過SoftMax層

dim = -1表示softmax 將被應用於輸入張量的最後一個維度

透過 model.config.id2label 可查看模型配置的id2label屬性

⇒ 我們即可得知：

第一句：否定：0.0402，肯定：0.9598
第二句：否定：0.9995，肯定：0.0005

d.Creating a Transformer(只有架構，沒有pre-trained)

要初始化模型前需要先載入configuration object

from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)

可以藉由印config看到其內容

⇒ print(config)

BertConfig { […]

“hidden_size”: 768, (size of the hidden_states vector)

"intermediate_size": 3072,

"max_position_embeddings": 512,

"num_attention_heads": 12,

"num_hidden_layers": 12, (number of layers the Transformer model has) [...] }

空白模型是可以直接使用的，但出來的結果是隨機的

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

也可以直接載入pre-trained過的模型，其weight會下載並存在緩存資料夾中，以利後續使用

model.save_pretrained("directory_on_my_computer")

模型也可儲存以利後續利用，並獲得這兩個檔案

config.json

⇒ 裡面儲存的就是config，some meta data

pytorch_model.bin

⇒ 模型的權重

e.Tokenizers

首要目標是要將原始文本(raw text)轉成數字，而下面是幾種方法

1.Word-based

每一個單詞都會被分配到一個unique的ID，0~n

同時，我們也需要一個token表示此token不存在於我們的token-ID轉換表，通常是”[UNK]”或是” ”

2.Character-based

擷取自Hugging Face官網

Tokenizer將raw text拆分成char，因此有以下兩點好處 :

詞彙量要小得多。
詞彙外（未知）標記(token)要少得多，因為每個單詞都可以從字符構建。

在英文中這樣的拆分可能意義不大，因為單一的char並沒有太大的意義，但在中文中就比較好用

3.Subword tokenization

把一個詞拆成具有獨立意義的小部分，比較像是字首字尾拆，如

“annoyingly” ⇒ “annoying” + “ly”

擷取自Hugging Face官網

除了上述這些基本款，其實還有很多其他種拆分方法

Byte-level BPE, as used in GPT-2
WordPiece, as used in BERT
SentencePiece or Unigram, as used in several multilingual models

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
from transformers import AutoTokenizer
#相較於指定類型的Class，這個更加通用
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
#保存這個tokenizer
tokenizer.save_pretrained("directory_on_my_computer")

f.How tokenizer converts the token into number

Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

⇒ 切分 + 轉換

1.Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
#['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

2.From tokens to input IDs

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)
# [7993, 170, 11303, 1200, 2443, 1110, 3014]

3.Decoding

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

#'Using a Transformer network is simple'

g.Handling multiple sequences

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

如果像這樣只送出一條句子當input會出錯，因為Transformers default要接收多個input(多個sentense)

所以需要透過Batching

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

⇒ batched_ids = [ids, ids]

h.Padding the inputs

如果不同的句子長度不同，就需要此處理

因此，我們在比較短的句子填充Padding token，使得所有的句子輸入具有相同長度

padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

但要注意的是單純使用padding token並沒辦法使得上面兩個sentense具有相同的logits

這是因為Transformer模型是使用attention layers去contextualize each token(對每個token進行文意理解)，因此padding tokens會因為對上下文造成影響而讓模型有不同的理解

⇒ 我們需要透過attention mask來告知attention layers去忽略這些padding token

i.Attention masks

Attention masks是一個與input維度完全相同的tensor，只存在1跟0，0表示attention layer應該忽略其所對應的padding token

j.Longer sequences

對於一般的Transformers模型，可以處理的input長度是有限的，因此要處理更長的input時，有兩種方法

Use a model with a longer supported sequence length.

⇒ 有專門處理長輸入的模型，如Longformer is one example, and another is LED.

Truncate(拆分) your sequences.

⇒ 透過指令max_sequence_length參數來限制最大輸入長度

sequence = sequence[:max_sequence_length]

3.Fine-Tuning a Pretrained Model

a.Processing the data

import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# This is new
#這邊把batch裡面兩句話的label都標註成1，可能表正向
batch["labels"] = torch.tensor([1, 1])
#這邊只進行一次迭代
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

b.Load dataset

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

c.Preprocessing a dataset

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

d.Padding the input

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

上面的程式碼會將每個batch句子填充到正確的長度(那個batch中最長的句子的長度)，比單純用padding，which把所有input填充到樣本內的最大長度，浪費時間和效率

e.Fine-tuning a model with the Trainer API

在定義Trainer之前，需要先定義一個TrainingArguments class

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")
#載入一個pretrained Bert
from transformers import AutoModelForSequenceClassification
#這個Bert沒有經過關於classifying pairs of sentences的pretrained
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
from transformers import Trainer
#先定義trainer
trainer = Trainer(model,training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
#執行fine-turn
trainer.train()

上面的code只會顯示loss，無法有效評估模型的訓練效果，是因為

We didn’t tell the Trainer to evaluate during training by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).

⇒ 沒有使用evaluation_strategy在每個epoch結束直進行評估

We didn’t provide the Trainer with a compute_metrics() function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

⇒ 沒有給Trainer提供一個**compute_metrics()**函數來直接計算模型的好壞，而只有loss

f.Evaluation

我們需要構建一個compute_metrics()，其採用an EvalPrediction object

predictions = trainer.predict(tokenized_datasets["validation"])

import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

即可獲得validation set的準確度為85.78%

⇒ 有了上面這些工具，我們可以建構出def compute_metrics

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

⇒ 並將其傳入Trainer中的compute_metrics

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)