這篇文章是以Hugging Face官網上的指南所翻譯、整理與理解的結果,若有錯誤歡迎提出
1.Transformer Models
- Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
=> 模型的架構,定義模型的每一層以及每個運算
- Checkpoints: These are the weights that will be loaded in a given architecture.
=> 被載入到模型的權重
- Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.
=> 模型這個詞並沒有那麼準確,因此在這邊並不常用
EX : BERT is an architecture,a set of weights trained by the Google team for the first release of BERT, is a checkpoint
2.Using Transformers
a.Preprocessing with a tokenizer
由於Transformer models無法直接理解原始文本(raw text),因此我們需要先將文本轉換成模型能夠理解的數字(number) ⇒ 由Tokenizer負責,其功能如下 :
- 將輸入拆分為單詞、子單詞或符號(如標點符號),稱為標記(token)
- 將每個標記(token)映射到一個整數
- 添加可能對模型有用的其他輸入
由於這些data preprocessing都需要以與模型預訓練時完全相同,因此我們首先需要從Model Hub中下載這些資訊,才知道轉換的規則
⇒ 使用AutoTokenizer
class and its from_pretrained()
method
Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint
其中return_tensors=”pt”指定了return值的格式,”pt” : Pytorch,”tf” : Tensorflow ,” “: list
我們會得到一個dict,再來就需要把這些數字轉換成張量
其中這個dict包含兩個key,input_ids
跟 attention_mask
input_ids
包含兩個row的list (一句一個) ,其中包含的是unique的每個單詞、子單詞或符號(也稱為token)
b.Going through the model
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
對於輸出,我們稱其為hidden states,a.k.a. features。
對於每個輸入,我們取回一個高維向量(high-dimensional vector)作為模型對於輸入的文意理解,而這些輸出很常作為模型內其他部分的輸入,也稱作Head
一般而言LLM輸出的向量有三個維度 :
- Batch size: The number of sequences processed at a time (2 in our example).
=> 也就是一次輸入的”句子數量”
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
由於使用了truncation=True,意味著如果輸入文本的長度超過模型的最大序列長度,分詞器會將其缩减到16,以確保所有輸入等長;若不足16,則會自動填充,而attention_mask就是在標記哪個分詞是填充的
- Hidden size: The vector dimension of each model input.
c.Model heads: Making sense out of numbers
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
#之所以要**是因為在python中,這樣的用法意味著inputs是一個字典,其key被作為命名參數傳遞給model
model(inputs)
:這種方式將整個inputs
字典作为一個單獨的參數傳遞给model
。如果model
函數的定義期望得到一个單獨的字典作為輸入,那麼這種方法是適合的。model(**inputs)
:這種方式將inputs
字典展開为多個命名參數。這意味著如果inputs
包含了{'input_ids': ..., 'attention_mask': ...}
,model(**inputs)
就相當於**model(input_ids=..., attention_mask=...)
**。這種方法通常用於函数定義期望多個命名參數,而非單個字典。
如果直接印出output,可以觀察到輸出值本身不一定有意義
這些不是機率,而是logits,即模型最後一層輸出的原始非標準化分數。要轉換為機率,它們需要經過SoftMax層
dim = -1表示softmax 將被應用於輸入張量的最後一個維度
透過 model.config.id2label
可查看模型配置的id2label
屬性
⇒ 我們即可得知:
- 第一句:否定:0.0402,肯定:0.9598
- 第二句:否定:0.9995,肯定:0.0005
d.Creating a Transformer(只有架構,沒有pre-trained)
要初始化模型前需要先載入configuration object
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)
可以藉由印config看到其內容
⇒ print(config)
BertConfig { […]
“hidden_size”: 768, (size of the hidden_states
vector)
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12, (number of layers the Transformer model has) [...] }
空白模型是可以直接使用的,但出來的結果是隨機的
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")
也可以直接載入pre-trained過的模型,其weight會下載並存在緩存資料夾中,以利後續使用
model.save_pretrained("directory_on_my_computer")
模型也可儲存以利後續利用,並獲得這兩個檔案
- config.json
⇒ 裡面儲存的就是config,some meta data
- pytorch_model.bin
⇒ 模型的權重
e.Tokenizers
首要目標是要將原始文本(raw text)轉成數字,而下面是幾種方法
1.Word-based
每一個單詞都會被分配到一個unique的ID,0~n
同時,我們也需要一個token表示此token不存在於我們的token-ID轉換表,通常是”[UNK]”或是” ”
2.Character-based
Tokenizer將raw text拆分成char,因此有以下兩點好處 :
- 詞彙量要小得多。
- 詞彙外(未知)標記(token)要少得多,因為每個單詞都可以從字符構建。
在英文中這樣的拆分可能意義不大,因為單一的char並沒有太大的意義,但在中文中就比較好用
3.Subword tokenization
把一個詞拆成具有獨立意義的小部分,比較像是字首字尾拆,如
“annoyingly” ⇒ “annoying” + “ly”
除了上述這些基本款,其實還有很多其他種拆分方法
- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
from transformers import AutoTokenizer
#相較於指定類型的Class,這個更加通用
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
#保存這個tokenizer
tokenizer.save_pretrained("directory_on_my_computer")
f.How tokenizer converts the token into number
Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.
⇒ 切分 + 轉換
1.Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
#['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']
2.From tokens to input IDs
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
# [7993, 170, 11303, 1200, 2443, 1110, 3014]
3.Decoding
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
#'Using a Transformer network is simple'
g.Handling multiple sequences
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)
如果像這樣只送出一條句子當input會出錯,因為Transformers default要接收多個input(多個sentense)
所以需要透過Batching
Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:
⇒ batched_ids = [ids, ids]
h.Padding the inputs
如果不同的句子長度不同,就需要此處理
因此,我們在比較短的句子填充Padding token,使得所有的句子輸入具有相同長度
padding_id = 100
batched_ids = [
[200, 200, 200],
[200, 200, padding_id],
]
但要注意的是單純使用padding token並沒辦法使得上面兩個sentense具有相同的logits
這是因為Transformer模型是使用attention layers去contextualize each token(對每個token進行文意理解),因此padding tokens會因為對上下文造成影響而讓模型有不同的理解
⇒ 我們需要透過attention mask來告知attention layers去忽略這些padding token
i.Attention masks
Attention masks是一個與input維度完全相同的tensor,只存在1跟0,0表示attention layer應該忽略其所對應的padding token
j.Longer sequences
對於一般的Transformers模型,可以處理的input長度是有限的,因此要處理更長的input時,有兩種方法
- Use a model with a longer supported sequence length.
⇒ 有專門處理長輸入的模型,如Longformer is one example, and another is LED.
- Truncate(拆分) your sequences.
⇒ 透過指令max_sequence_length參數來限制最大輸入長度
sequence = sequence[:max_sequence_length]
3.Fine-Tuning a Pretrained Model
a.Processing the data
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# This is new
#這邊把batch裡面兩句話的label都標註成1,可能表正向
batch["labels"] = torch.tensor([1, 1])
#這邊只進行一次迭代
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()
b.Load dataset
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
c.Preprocessing a dataset
from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
d.Padding the input
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
上面的程式碼會將每個batch句子填充到正確的長度(那個batch中最長的句子的長度),比單純用padding,which把所有input填充到樣本內的最大長度,浪費時間和效率
e.Fine-tuning a model with the Trainer API
在定義Trainer之前,需要先定義一個TrainingArguments class
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")
#載入一個pretrained Bert
from transformers import AutoModelForSequenceClassification
#這個Bert沒有經過關於classifying pairs of sentences的pretrained
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
from transformers import Trainer
#先定義trainer
trainer = Trainer(model,training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
#執行fine-turn
trainer.train()
上面的code只會顯示loss,無法有效評估模型的訓練效果,是因為
- We didn’t tell the
Trainer
to evaluate during training by settingevaluation_strategy
to either"steps"
(evaluate everyeval_steps
) or"epoch"
(evaluate at the end of each epoch).
⇒ 沒有使用evaluation_strategy在每個epoch結束直進行評估
- We didn’t provide the
Trainer
with acompute_metrics()
function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).
⇒ 沒有給Trainer提供一個**compute_metrics()**函數來直接計算模型的好壞,而只有loss
f.Evaluation
我們需要構建一個compute_metrics(),其採用an EvalPrediction
object
predictions = trainer.predict(tokenized_datasets["validation"])
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
即可獲得validation set的準確度為85.78%
⇒ 有了上面這些工具,我們可以建構出def compute_metrics
def compute_metrics(eval_preds):
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
⇒ 並將其傳入Trainer中的compute_metrics
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)