LLM 101 - Transformers, Encoders, Tokens

Oliver Jack Dean

What is a Large Language Model (LLM)?

Transformer-derived LLMs represent an exciting advancement in AI. And we shall begin here.

The intricacies and nuances of communication are captured in these models by their ability to comprehend and generate human language.

As a result of their extensive training on a wide range of text datasets, Transformer-derived LLMs perform diverse language tasks with exceptional accuracy, fluency, diction and style. From straightforward text classification to sophisticated text generation, these models pack a punch.

Generally, experts use the term LLM to refer to a subset of broader “Natural Language Understanding” (NLU) capabilities, which is a branch of Naural Language Processing (NLP) research focused on developing algorithms and models to understand natural language communication and how humans interact with language.

Before the rest of the world knew LLM products like Chat-GPT, the LLM revolution had already begun within this earlier sub-set branch of NLP research.

For example, early NLU research models were in particular used for text classification, sentiment analysis, and named entity recognition.

In fact, for a while now, NLU models have been used in various industries, like healthcare, finance, and e-commerce - even streaming services like Netflix!

Since then, Transformer-based LLMs have changed the landscape.

What is a Transformer?

So, when I first heard of Transformer-based architecture, I also thought of the movie franchise. I freaked out a bit.

But the most effective way I have found to think about Transformers is through a simple analogy.

So, imagine an LLM is a really clever school student, who knows anything and everything there is to know - in fact, the student can symbolically graph out every line of text from the internet over the past 8 years and find symbolic relationships and context between each line of text. So, yeah, massive NERD.

Essentially, a Transformer is a magical schoolbag that enables the student to find symbolic relationships between texts really, really, really fast. The Transformer schoolbag for our LLM “student” also has “jetpack” capabilities. It enables the “student” to fly high and scan across large amounts of text to formulate knowledge at a massive scale.

Yeah, well… I am sure you get the idea. I hope!

How does the Transformer work?

Alright let’s take it back a step or two.

Over the past decade, across NLP / NLU research, many attempts at making LLMs perform and compute more optimally and efficiently have been made.

In fact, LLMs and Transformer-derived architectures work because they combine many ideas from the past 5-8 years.

It’s a lot of stuff that’s been around for a while actually. However, it’s also active research and each week an exciting breakthrough whitepaper is released.

But to the point, key breakthroughs were made in “attention”, “transfer learning”, and scaling up neural networks (NN). All of these breakthroughs led to the now well established “Transformer” LLM architecture that enables powerful LLMs like Chat-GPT to work magic.

So how do Transformers work? Tricky but we can simplify.

There were two main components to the original Transformer architecture, created in 2017. At it’s core, it is called a sequence-to-sequence (S2S) model.

You can break up the S2S into two components:

1) An “encoder” that splits raw text, splitting them up into core units, converting them into vectors (similar to the Word2vec process), and using “self-attention” to understand the context of the text units.

2) An “decoder” that excels at generating text by using a modified method or technique of “self-attention” to predict the next most appropriate word in a given sequence of text units provided by the encoder.

Here is a nice diagram highlighting these two key/critical components:

I’ve borrowed this diagram from this wonderful and insightful article here.

Now let’s break some of the language down a bit.

“Attention” or “self-attention” allows words in a sequence to “attend to” - “each other”.

The goal is to mirror all remaining words in a given sequence to “each other”, so that a specific sequence can learn a broader context and identify long-range dependencies as well as position itself alongside other text sequences within large datasets.

Using methods like “self-attention” or “multi-headed attention” has changed the game.

So by using such an Transformer architecture, LLMs can compute and capture relationships between words with long-range dependencies and contextualise unstructured text into structured text.

Or put another way, formulate the unstructured context of text into the structured context of text.

Under the hood, Transformers use different NLP “Language Modeling” encoding/decoding techniques to find relationships between words.

In fact, the architecture uses two key components - which the original masterminds call “encoder/decoder” - as outlined above.

Now, obviously no architecture is complete, and Transformer-derived architectures have limitations!

Actually, the most obvious thing is that they’re constrained by input context windows. So, each time they process text, they’re limited by text length. But again, this is an active research area.

Regardless of the architecture’s limitations, since its inception in 2017, the Transformer-derived architecture has created a huge ecosystem for LLM applications across different sectors.

The “Transformers” library, along with its supporting software packages and open-sourced tooling, has also played a pivotal role in making Transformers more accessible to practitioners from every walk of life.

In fact, the real USP of the Transformer architecture is that it has abstracted alot of heavy computational machinery away that was once required to make LLMs “work”. Pretty cool.

Pitstop: What are Tokens?

Coming back to NLP and NLU - within the scope of language modeling, many researchers or practitioners use statistical/Deep Learning (DL) techniques to create and formulate human language with computers.

At the heart of this research area, many techniques have been developed for “predicting the likelihood of a sequence of tokens” in a specified vocabulary (a limited and known set of tokens).

The terminology “Token” gets used a lot. You will hear about it often and read about it often.

But in language modeling, a “token” refers to the tiniest unit of semantic meaning.

Tokens are created by breaking down sentences or text into smaller units, and they serve as fundamental input for LLMs.

While tokens can indeed represent complete words, they can also encompass “sub-words”.

If you’ve come across the term “n-gram” before, you might find it related.

An “n-gram” denotes a sequence of n consecutive tokens, offering valuable insights into contextual relationships within a text. Think of n-grams as tokens with their own sub-graph or sub-tree of related tokens.

Auto-Regressive Language Models (ARLM):

Now, within the context of Transformers - you may hear NLP researchers or practitioners use the phrase ARLM.

ARLMS are actually the “under the hood” sub-system component often utilized for predicting tokens in sequences of text (i.e., predicting what word comes next).

So, ARLMs actually correspond to the “decoder” component of the Transformer-derived LLM architecture.

ARLMs work by applying a mask to full sentences, so that the “attention heads” can only see the tokens that came before. It’s a clever way of sorting through tokens in sub-optimal ways.

ARLM are ideal for text generation and an excellent example can be found within GPT.

Auto-Encoding Language Models (AELM):

Auto-encoding Language Models (AELM) are another type of sub-system model, often used to reconstruct an original sentence from a corrupted version of a sentence. AELMs typically correspond to the “encoder” sub-system component of the Transformer architecture and have access to the full input without masks.

So, AELMs use bidirectional representations of whole sentences. They can be fine-tuned for a variety of tasks such as text generation, but their main application is sentence classification or token classification.

A typical example of AELM in action is BERT.

So what are some examples of Encoding/Decoding?

To recap - Large Language Models (LLMs) are language models that are either autoregressive, auto-encoding, or a combination of the two.

Modern LLMs like BERT or GPT use something called the Transformer architecture (magical jetpack schoolbag) but they don’t have to be. Also, the Transformer architecture uses different computational ideas - so, it borrows ideas from both ALM and ACLM representations of text prediction and traversal.

If you want to know more - you should check out the original whitepaper - now 10 years old, called “Attention is all you need” - which outlines some of the now famous computational and statistical language modelling ideas.

Let’s Recap

Let’s use bullet points. I cannot write more. So, here it goes:

  • Original 2017 Transformer architecture is often utilized for modern LLMs (but not always)…
  • Depending on use of Transformers - LLMs are good at understanding or generating text (or both)…
  • LLMs run on large unstructured/structured datasets + vocabulary (depending on use-case)…
  • Modern BERT and GPT LLMs separate the Transformer into Encoder/Decoder components to perform certain language based tasks.
  • Encoder/Decoders utilize various NLP/NLU computational techniques like ARLM / AELM methods…
  • Unstructured datasets are often broken down into small sub-unit tokens then parsed and ingested for LLM to Encoder/Decoder pipeline…
  • LLMs can be placed into three buckets: 1) Auto-regressive, 2) Auto-encoding, 3) Hybrid (depending on usecase and level of flexibility required)…

Just to reinforce (excuse the pun). No matter LLM architecture or whether LLMs make use of Transformers - the main objective is to understand “context” of tokens or sequences of unstructured sentences or words and how they all relate to eachother.

Cherry-picking encoder or decoder features from here and there, or deriving an LLM using Transformer-based archiecture will only get you so far. You need very skilled practitioners to pull this off and of course, huge computational infrastructure available at hand - to build, train and design an LLM from scratch.

So, for many organisations or companies, there is a lot more to think about and a lot more to experiement with first. More on this to come.

EDIT: For anyone interested in the Transformer architecture and the finer details involved - do checkout the "Illustrated Transformer" article by the brilliant Jay Alammar via the following link.