Understanding SentencePiece ([Under][Standing][_Sentence][Piece])

11 min readMay 21, 2020

As SentencePiece is used in many cutting-edge NLP models, I decided to go into depth to explore what SentencePiece is about and understand a bit better about how and why it is used in NLP — (used in T5, Reformer, XLNet, Albert) and then their usage in the context of transformers. Hopefully, by the end — you will somewhat make better sense of what is happening within the brackets.

Table of Contents

NLP Pipeline Refresher
So What Exactly Does Sentencepiece Do?
Subword Regularisation
How Does This All Relate To Transformer Architectures?
How Does SentencePiece Stack Up Against Other Sub-word Generators?
My Thoughts On Future Research
Conclusion / Writing Philosophy

“Sentencepiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing” — SentencePiece Paper

This is really, really dense. For the reader, I assume that there is prior experience of natural language processing. To understand what the above quote means, we first revisit our understanding of how natural language processing works in general and the NLP pipeline. If you are already very familiar with the pipeline and are looking for a more thorough dive into SentencePiece paper, feel free to skip to “So What Does SentencePiece Do” .

NLP Pipeline Refresher

Let us consider firstly what an example pipeline looks like.

The purpose of this pipeline is to turn words into numbers to feed into our model. These numbers simply represent vectors and we have different vectors for different words. An intuitive understanding of vectorised representation of words is explored here by Christopher Olah. Please read this before continuing if you do not already grasp this concept. Now, let us discuss this pipeline to better understand where SentencePiece comes in.

Tokenizing refers to splitting up a text document. We can split up a text document in a number of different ways such as through white spaces, punctuation, spaces.

In the example above (the image), the tokeniser splits up on vowels unless it is the second letter of a word. (Strange, absurd rule I know but please bear with me as I explain why this is necessary.) Now, there is something important to note here: there are 2 #’s in a few of the ‘subwords’ but not in others. In this tokenisation example, having 2 #’s in front of the word means that it is a continuation of the previous sub-word. While this may seem to be an appropriate thing to do — no distinction is usually made to determine in what order these are done (for example, having multiple #’s to denote their position in the word).

From here, we map these separated words into numerical representations of words. This is done via a look-up table. Each sub-word or word has a vector representation that has been established from pre-training.

Image from *Soumith Chintala and Wojciech Zaremba of Facebook AI Research* https://devblogs.nvidia.com/understanding-natural-language-deep-neural-networks-using-torch/

Once we vectorise these words (which refers to changing words to numerical vectors), we can plug them through some model that has been trained on the tokenisers.

The rest of the article will go on to explain the problems as this “simple” pipeline grows and SentencePiece’s vital role in this pipeline.

So what does SentencePiece do?

Now that we understand SentencePiece basically splits words into several parts called subwords (via the process of tokenisation), we now explore why the implementation is so popular that it is used across so many different models.

What I just said was perhaps slightly misleading — it doesn’t just ‘split’ words, it does a bit more than that and I will clarify what I mean. Imagine the following sentence:

“How does sentencepiece work at a fundamental level?”

What is the best way to split this question into different subwords? And how do we even quantify this approach?

To answer questions such as these — let us re-frame the question into a more understandable problem statement:

How do I capture the most frequent and diverse sub-words when I have a fixed vocabulary list?

But wait a minute — why does our word segmentation algorithm need a vocabulary size limit?

To answer this question as intuitively as possible, let us consider the inverse scenario where we do not set a vocabulary size limit. At some point, you will end up including very rare words only counted once or words where their meaning can be complicated. For example — the words “simple” and “simplify” are the same words but slightly altered to suit grammatical purpose. Is it really worth having an additional vector stored in memory? Furthermore, words like “Jentacular” (I googled unusual English words), we are not too interested in recording as it comes up very very rarely! Hence, we want to have a restriction on the vocabulary list. With SentencePiece, the limit is set by the number of words in the vocabulary list.

Now that the dimensions now intuitively make sense — we want to capture words that appear frequently enough to determine the importance of the word but also diverse enough between the sub-words to minimise re-capturing the same information and build up a useful, diverse sub-word vocab list.

There are a few different algorithms to resolve these issues. Such algorithms can include Byte-Pair Encoding (supported in SentencePiece), Unigram language model (also supported in Sentencepiece) or subword-nmt (not supported in SentencePiece).

I will not go too in-depth with these algorithms as I found very, very clear explanations in this blog which discusses NLP pre-processing quite in-depth although I found its examination of SentencePiece to be quite general and not too in-depth. It does, however, note the distinction between SentencePiece as a concept and as a piece of software. So naturally, the next question is — if these are the algorithms then what exactly is Sentencepiece and what do these algorithms have to do with this?

To answer this question — we explore the underlying components of SentencePiece.

SentencePiece itself comprises of 4 different components — which we will go into more detail — Normaliser, Encoder, Decoder and Trainer.

SentencePiece components (blue) as part of the tokenisation process

Let us go through the components of SentencePiece one by one:

The normaliser does not refer to taking the mean and removing standard deviations from some digit. Except, in NLP, normalisation refers to standardising the words of the text such that they follow a suitable format. In this case, SentencePiece alter words/letters into equivalent NFKC Unicode (e.g. things starting with U+0026). This means that semantic equivalence is established by splitting/removing accents as outlined here. Note, however, as outlined in the normalization README, there is room to try different unicode methodologies. For those interested, the code implementation of the normaliser can be found here and is implemented in C++.
The trainer uses the specified algorithm to build up a word vocabulary based on sub-word components. As previously mentioned, SentencePiece supports 2 main algorithms — BPE and unigram language model. Each of these are quite simple concepts that are very clearly explored in other blogs/Wikipedia articles so I will once again refrain from re-inventing the wheel.
The encoder/decoder itself is quite self-explanatory. The encoder refers to the process of pre-processing and post-processing. This can be summarised neatly in the equation presented in the SentencePiece paper as below:

Decode(Encode(Normalized(text))) = Normalized(text)

The paper refers to the above equation as lossless tokenisation. This is the idea that no information is lost in encoding/decoding. There is a bit of novelty here in SentencePiece in how it deals with white spaces. It explains that it handles white spaces by replacing it with ‘_’ at the beginning of the word following the white space — allowing it to reconstruct this. However, upon decoding, it then simply replaces all ‘_’ with with an empty space. While not entirely lossless as you lose the value of ‘_’, I couldn’t think of any meaning interpretation of the underscore anyway and the preservation of white spaces is more important.

Subword Regularisation

There is, however, one small problem with the tokenisation process — which is what happens when you have multiple ways to split up the word based on the vocabulary list.

Kudo in Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

If we were to frame this into a definitive problem, it would be:

How do we best split a sentence to ensure that when words are used in the same context, they are matched to the same IDs?

Now that we have defined the problem, let us explore how SentencePiece aims to resolve this problem. This is largely explored in this paper but we will go a bit into how he solves it and the intuition behind them here as they are useful components to SentencePiece.

Subword regularization aims to “employ multiple subword segmentations to make the NMT model accurate and robust”. Here, NMT simply means “Neural Machine Translation”, referring to the use of neural networks in translating from one language to another in this context. Furthermore, “multiple subword segmentations” refer to our segmentation model considering the different ways in which a sentence can be split.

Subword regularisation in Sentencepiece comprises of 2 components:

A new training algorithm to “integrate multiple segmentation candidates”. This means that SentencePiece takes into consideration multiple ways to split the sentence and number of parameters. NMT training with on-the-fly subword sampling (wherein the paper utilises an algorithm that is not too computationally complex by reducing the number of words considered in the sample and the number of alternative samples we can consider based on probability). In this case ‘on-the-fly’ refers to when we are training and predicting for each word. Subword sampling refers to selecting a random subword based on particular probability distributions.
A new subword segmentation algorithm based on language models, which the paper calls ‘n-best decoding’. This simply refers to when given n-best segmentations, we choose the best to maximise a specific score. This score’s formula for those who are interested is:

Algorithm’s maximisation algorithm where: ‘x’ is the source sentence, ‘y’ is the target sentence, and ‘lambda’ is a regularisation parameter to penalise shorter sentences. |y| is the number of words in the sentence.

The reader will notice that there is a penalisation parameter added in the denominator aimed at penalising shorter sentences. However, it is not immediately obvious why this is the case. As mentioned by reader Jmkernes, this is a concept that is quite common in NLP and we can understand this phenomenon by understanding how the probability is calculated in the first place.

The probability of a target sentence is the multiplication of the probability of the words predicted in that sentence. Hence, longer sentences mean more multiplications and a lower probability of being selected as the target sentence.

Therefore a penalisation term is introduced to make shorter sentences (which have fewer multiplications and hence higher probabilities) as likely as the longer sentences. The lambda is stated in the paper to be optimized from the development data (I am unsure what the exact method is so if any readers in the future know, please inform me and I am happy to update this section of the article in the future).

So what does this all have to do with transformer architectures? How does this apply to models like T5/Reformer/XLNET?

This is all to do with how the model is trained in the first place. By firstly using SentencePiece to generate word IDs and then feeding them through the architecture, you can train the model and then decode to return words again.

However, we then need to think why can’t you use the same SentencePiece tokenizer for all the different transformer models? This is a result of different architectures being trained on different texts in the first place. The idea behind transfer learning is incorporating a lot of pre-training from massive text databases like Wikipedia and then trying to understand how to model against those. Hence, because we have different text sources and different sub-word vocabulary generation algorithms that we can implement, we end up with different words from vocabularies and IDs that match to these vocabularies. As a result, we end up with different tokenisers for different models.

As an example — BERT is a transformer architecture. It uses a tokenizer called ‘BertTokenizer’ which is based on the WordPiece tokeniser. Whilst this was Google’s own internal closed-source tokeniser, it has been re-created in the transformers library and can be used to extend existing vocabulary sources.

How SentencePiece Stacks Up Against Other Sub-word generators?

Comparison of different implementations as found on SentencePiece Github repo

I found the above table to be a very concise summarisation of the benefits that SentencePiece offers in comparison to other implementations.

TLDR; SentencePiece isn’t too bad.

Where I Think Future Research Should Go

I think no discussion of concept is complete without outlining work for future contributors and where I think research can take place. Here, I discuss 2 potential improvements that I think need to be made.

I think tokenisation should be based on a minimal frequency criteria for BPE/minimal probability criteria for unigram language modelling rather than having fixed vocabulary size. While I understand this is to ensure that memory limitations are met (or at least I believe that is why we set a limit at 32K — please let me know if I am misinformed), I think we should change the criteria from a hard-set total count limit to a sub-word frequency limit instead.
I also think asian character language tokenisation should be further researched. At the time of writing this, I am not aware of any intra-character chinese tokenisation but I believe they are crucial to improving Chinese language modelling. For example, note the following words and their left parts: 说话 (speak), 谈论(talk), 告诉 (tell). One would expect that as these are similar in meaning and it would make sense to group them. I expect there to be research in this field already but perhaps in Chinese. If the reader knows any, please feel free to link me!
Improving our understanding of sub-word regularisation and the measures we need to take in providing robust sub-words.

If you enjoyed this, you might enjoy my AI newsletter (The AI Hero) and join me in building AI — https://ai-hero.beehiiv.com/subscribe

Writing/Teaching Philosophy

I share my writing philosophy in order to give exposure to teachers who have helped me improve clarity in the field and offer readers a chance to learn from them too.

My attempt at distiling information surrounding the paper and code behind transformers’ most popular sub-word vocabulary generator has largely been inspired by the works of Christopher Olah (known for his recurrent neural network blogs), Chris McCormick (his Youtube channel is great for understanding Bert at a deeper level) and Jeremy Howards (FastAI for all deep learning resources). The purpose behind writing detailed posts like these is to make the mountain easier to climb and to ensure future AI algorithms are built properly and with care. Please feel free to reach out to me via e-mail if I have helped you or connect with me on LinkedIn here.