Skip to main content

Tokenizer

Workflow
  • Tokenization/分词/Encode - 文本 -> Token IDs
    • 一般使用 BPE 算法
  • Decode - Token IDs -> 文本

modelvocab sizetokenizernotes
openai/gpt250257byte-level-bpe
openai/r50k_base50257byte-level-bpe
openai/p50k_base50281byte-level-bpe
openai/p50k_edit50283byte-level-bpe
openai/cl100k_base100276byte-level-bpe
openai/o200k_base200018byte-level-bpe
openai/o200k_harmony201088byte-level-bpe
  • o200k_harmony
    • vocabe 201088
    • 在 c200k_base 上增加了额外的特殊 token
abbr.stand formeaning
BPEByte Pair Encoding
unigramUnigram
NFKCUnicode Normalization Form KC

SentencePiece

  • spm_encode
  • spm_decode
  • spm_normalize
  • spm_train
  • spm_export_vocab
pip install sentencepiece

FAQ

tokenizer.json vs vocab.json

  • vocab.json + merges.txt
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json)
# 会保存为 merges.txt+vocab.json
tokenizer.model.save('tokenizer')