Hugging Face從零到一 — 從初始化到fine tune教學

林聖晏
21 min readJan 6, 2024

--

這篇文章是以Hugging Face官網上的指南所翻譯、整理與理解的結果,若有錯誤歡迎提出

1.Transformer Models

  • Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.

=> 模型的架構,定義模型的每一層以及每個運算

  • Checkpoints: These are the weights that will be loaded in a given architecture.

=> 被載入到模型的權重

  • Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.

=> 模型這個詞並沒有那麼準確,因此在這邊並不常用

EX : BERT is an architecture,a set of weights trained by the Google team for the first release of BERT, is a checkpoint

2.Using Transformers

擷取自Hugging Face官網

a.Preprocessing with a tokenizer

由於Transformer models無法直接理解原始文本(raw text),因此我們需要先將文本轉換成模型能夠理解的數字(number) ⇒ 由Tokenizer負責,其功能如下 :

  • 將輸入拆分為單詞、子單詞或符號(如標點符號),稱為標記(token)
  • 將每個標記(token)映射到一個整數
  • 添加可能對模型有用的其他輸入

由於這些data preprocessing都需要以與模型預訓練時完全相同,因此我們首先需要從Model Hub中下載這些資訊,才知道轉換的規則

使用AutoTokenizer class and its from_pretrained() method

Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint

其中return_tensors=”pt”指定了return值的格式,”pt” : Pytorch,”tf” : Tensorflow ,” “: list

我們會得到一個dict,再來就需要把這些數字轉換成張量

其中這個dict包含兩個key,input_idsattention_mask

input_ids 包含兩個row的list (一句一個) ,其中包含的是unique的每個單詞、子單詞或符號(也稱為token)

b.Going through the model

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

對於輸出,我們稱其為hidden states,a.k.a. features

對於每個輸入,我們取回一個高維向量(high-dimensional vector)作為模型對於輸入的文意理解,而這些輸出很常作為模型內其他部分的輸入,也稱作Head

一般而言LLM輸出的向量有三個維度 :

  • Batch size: The number of sequences processed at a time (2 in our example).

=> 也就是一次輸入的”句子數量”

  • Sequence length: The length of the numerical representation of the sequence (16 in our example).

由於使用了truncation=True,意味著如果輸入文本的長度超過模型的最大序列長度,分詞器會將其缩减到16,以確保所有輸入等長;若不足16,則會自動填充,而attention_mask就是在標記哪個分詞是填充的

  • Hidden size: The vector dimension of each model input.

c.Model heads: Making sense out of numbers

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
#之所以要**是因為在python中,這樣的用法意味著inputs是一個字典,其key被作為命名參數傳遞給model
  1. model(inputs):這種方式將整個 inputs 字典作为一個單獨的參數傳遞给 model。如果 model 函數的定義期望得到一个單獨的字典作為輸入,那麼這種方法是適合的。
  2. model(**inputs):這種方式將 inputs 字典展開为多個命名參數。這意味著如果 inputs 包含了 {'input_ids': ..., 'attention_mask': ...}model(**inputs) 就相當於**model(input_ids=..., attention_mask=...)**。這種方法通常用於函数定義期望多個命名參數,而非單個字典。

如果直接印出output,可以觀察到輸出值本身不一定有意義

這些不是機率,而是logits,即模型最後一層輸出的原始非標準化分數。要轉換為機率,它們需要經過SoftMax

dim = -1表示softmax 將被應用於輸入張量的最後一個維度

透過 model.config.id2label 可查看模型配置的id2label屬性

⇒ 我們即可得知:

  • 第一句:否定:0.0402,肯定:0.9598
  • 第二句:否定:0.9995,肯定:0.0005

d.Creating a Transformer(只有架構,沒有pre-trained)

要初始化模型前需要先載入configuration object

from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)

可以藉由印config看到其內容

print(config)

BertConfig { […]

“hidden_size”: 768, (size of the hidden_states vector)

"intermediate_size": 3072,

"max_position_embeddings": 512,

"num_attention_heads": 12,

"num_hidden_layers": 12, (number of layers the Transformer model has) [...] }

空白模型是可以直接使用的,但出來的結果是隨機的

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

也可以直接載入pre-trained過的模型,其weight會下載並存在緩存資料夾中,以利後續使用

model.save_pretrained("directory_on_my_computer")

模型也可儲存以利後續利用,並獲得這兩個檔案

  • config.json

⇒ 裡面儲存的就是config,some meta data

  • pytorch_model.bin

⇒ 模型的權重

e.Tokenizers

首要目標是要將原始文本(raw text)轉成數字,而下面是幾種方法

1.Word-based

擷取自Hugging Face官網

每一個單詞都會被分配到一個unique的ID,0~n

同時,我們也需要一個token表示此token不存在於我們的token-ID轉換表,通常是”[UNK]”或是” ”

2.Character-based

擷取自Hugging Face官網

Tokenizer將raw text拆分成char,因此有以下兩點好處 :

  • 詞彙量要小得多。
  • 詞彙外(未知)標記(token)要少得多,因為每個單詞都可以從字符構建。

在英文中這樣的拆分可能意義不大,因為單一的char並沒有太大的意義,但在中文中就比較好用

3.Subword tokenization

把一個詞拆成具有獨立意義的小部分,比較像是字首字尾拆,如

“annoyingly” ⇒ “annoying” + “ly”

擷取自Hugging Face官網

除了上述這些基本款,其實還有很多其他種拆分方法

  • Byte-level BPE, as used in GPT-2
  • WordPiece, as used in BERT
  • SentencePiece or Unigram, as used in several multilingual models
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
from transformers import AutoTokenizer
#相較於指定類型的Class,這個更加通用
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
#保存這個tokenizer
tokenizer.save_pretrained("directory_on_my_computer")

f.How tokenizer converts the token into number

Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

⇒ 切分 + 轉換

1.Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
#['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

2.From tokens to input IDs

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)
# [7993, 170, 11303, 1200, 2443, 1110, 3014]

3.Decoding

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

#'Using a Transformer network is simple'

g.Handling multiple sequences

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

如果像這樣只送出一條句子當input會出錯,因為Transformers default要接收多個input(多個sentense)

所以需要透過Batching

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

⇒ batched_ids = [ids, ids]

h.Padding the inputs

如果不同的句子長度不同,就需要此處理

因此,我們在比較短的句子填充Padding token,使得所有的句子輸入具有相同長度

padding_id = 100

batched_ids = [
[200, 200, 200],
[200, 200, padding_id],
]

但要注意的是單純使用padding token並沒辦法使得上面兩個sentense具有相同的logits

這是因為Transformer模型是使用attention layerscontextualize each token(對每個token進行文意理解),因此padding tokens會因為對上下文造成影響而讓模型有不同的理解

⇒ 我們需要透過attention mask來告知attention layers去忽略這些padding token

i.Attention masks

Attention masks是一個與input維度完全相同的tensor,只存在1跟0,0表示attention layer應該忽略其所對應的padding token

j.Longer sequences

對於一般的Transformers模型,可以處理的input長度是有限的,因此要處理更長的input時,有兩種方法

  • Use a model with a longer supported sequence length.

⇒ 有專門處理長輸入的模型,如Longformer is one example, and another is LED.

  • Truncate(拆分) your sequences.

⇒ 透過指令max_sequence_length參數來限制最大輸入長度

sequence = sequence[:max_sequence_length]

3.Fine-Tuning a Pretrained Model

a.Processing the data

import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# This is new
#這邊把batch裡面兩句話的label都標註成1,可能表正向
batch["labels"] = torch.tensor([1, 1])
#這邊只進行一次迭代
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

b.Load dataset

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

c.Preprocessing a dataset

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

d.Padding the input

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

上面的程式碼會將每個batch句子填充到正確的長度(那個batch中最長的句子的長度),比單純用padding,which把所有input填充到樣本內的最大長度,浪費時間效率

e.Fine-tuning a model with the Trainer API

在定義Trainer之前,需要先定義一個TrainingArguments class

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")
#載入一個pretrained Bert
from transformers import AutoModelForSequenceClassification
#這個Bert沒有經過關於classifying pairs of sentences的pretrained
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
from transformers import Trainer
#先定義trainer
trainer = Trainer(model,training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
#執行fine-turn
trainer.train()

上面的code只會顯示loss,無法有效評估模型的訓練效果,是因為

  1. We didn’t tell the Trainer to evaluate during training by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).

⇒ 沒有使用evaluation_strategy在每個epoch結束直進行評估

  1. We didn’t provide the Trainer with a compute_metrics() function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

⇒ 沒有給Trainer提供一個**compute_metrics()**函數來直接計算模型的好壞,而只有loss

f.Evaluation

我們需要構建一個compute_metrics(),其採用an EvalPrediction object

predictions = trainer.predict(tokenized_datasets["validation"])

import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

即可獲得validation set的準確度為85.78%

⇒ 有了上面這些工具,我們可以建構出def compute_metrics

def compute_metrics(eval_preds):
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

⇒ 並將其傳入Trainer中的compute_metrics

trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)

--

--

林聖晏
林聖晏

No responses yet