LLM 101 - Pre-Training, Fine-Tuning, Alignment

Oliver Jack Dean

Alright, let's dive back into the world of Language Learning Models (LLMs), picking up where we left off in my last kiss and hug. See this post as the second act or second installment.

My previous post revolved around the bedrock of LLMs, delving into the transformer architecture and other various computational techniques that forms the backbone of these god damn LLMs. I also explored the principles of self-attention that empowers LLMs to perform their wizardry.

This time around, I will delve into other key elements that make LLMs work magic.

So, I will outline some training techniques, the concept of transfer learning, fine-tuning, tokens and embedding techniques for in-context learning among other key topics.

Pre-training

Imagine you're embarking on a journey to learn a new language, not just any language, but the language of humans.

Well, this is the journey all LLMs must undertake. Everytime you use an off-the-shelf LLM like Chat-GPT, such an LLM was at some point in time, during it's early origins, pre-trained.

Pre-training is an LLMs first immersion into a vast ocean of textual data, where they learn to swim in the currents of human language.

Pre-training is the foundation, the bedrock upon which LLMs build their understanding of how words relate to each other. The goal during pre-training is to help LLMs find rhythms, patterns, and symbolic relationships.

Now, such computational methods used in pre-training are not born out of thin air. They're the results of years of research in cognitive science and brain behaviorual modelling. Think back to my previous article, where various ideas like self-attention alongside others have been very useful in helping us creating sophisticated LLMs.

Anyway, in the world of AI, there are two main schools of thought about how well LLMs can learn rhythms, patterns and relationships of words.

On one hand, we have the brain-inspired researchers, who engineer LLMs to mimic certain behaviors of the human brain, harnessing the power of Neural Networks (NN). This school of thought believes that the secrets to universal intelligence lie within the intricate workings of the brain and how brains tackle problems or games like playing Chess.

On the other hand, we have the theoretical linguists, who delve into the evolution of language over time, scrutinizing the underlying computational model of language. They view language as an organ, a machine, a component that evolves over time and encapsulates universal intelligence. However, they also acknowledge that this elegant model of computation can be perplexing and elusive for humans to comprehend.

Much AI and LLM breakthrough has occured by way of the brain-inspired researchers and instutites. Notable figures in AI align themselves with such a school of thought. David Silver of DeepMind, intrigued by various learning strategies for games like Chess and Go, have been wildy successful. And of course, people like Ilya Sutskever, similiar to Silver, who was essential in enabling models like Word2Vec, GPT-1,2,3,4, and even DALL-E.

Meanwhile, the likes of Noam Chomsky find themselves in the second camp - focused on the theoretical underpinnings of human language and it's computational relationship to LLMs. They see value in the current LLMs, but do not believe them to be actual models of the brain.

Then there are those like Geoffrey Hinton, who straddle the line, drawing insights from both schools.

Anyway coming back to pre-training.

During pre-training an LLM learns to understand general natural language and the relationships between words.

Each LLM is trained on different texts and tasks. For example, BERT learned its steps from two public text collections and two tasks: learning individual word meanings and predicting the next sentence.

Such pre-training helped BERT learn a rich set of language features and contextual relationships. However, the pre-training process for an LLM can evolve over time, like a dance that changes with each new dancer. For instance, a variant of BERT called RoBERTa was able to match and even surpass BERT's performance without one of the original tasks.

Each LLM is pre-trained differently, which is what gives each its unique flavor or shall we say dialect.

Some LLMs, like OpenAI’s GPT models, are trained on proprietary text data sources to give OpenAI and other creators using thier service an competitive edge. Other LLMs are trained on more narrow or specific text data sources - given them a more lean and precise utilization of language. For example, an LLM which knows a lot about words related to pharmacovigilance but knows zero words relating to geopolitics.

So, the world of pre-training LLMs is a fascinating blend of language learning, cognitive science, and competitive computational innovation.

It's an research area that's constantly evolving, constantly changing, and endlessly fascinating.

Transfer Learning

Imagine you're a master juggler, deftly keeping multiple balls in the air. Now, someone hands you a tennis racket and a ball. You've never played tennis before, but your juggling skills give you a head start. You already understand the rhythm, the timing, the coordination. Well, this pretty much, is the essence of transfer learning.

In the realm of machine learning (ML), transfer learning is like a shortcut.

Transfer learning is a way to take the knowledge an LLM has gained from one previous task and apply it to another.

For LLMs, transfer learning is like a second act. An LLM that's been pre-trained on one body of text data is then fine-tuned for a specific task, like text classification or text generation. It's like our juggler, now a tennis player, learning to serve or volley.

The beauty of transfer learning is that the pre-trained model has already learned so much about language and word relationships. It's like it's already done the heavy lifting. So, in theory, such knowledge can be reused, giving the model a head start on a new task.

Transfer learning allows LLMs to be fine-tuned for specific tasks with much less task-specific data than if the model were trained from scratch.

It's like our juggler-turned-tennis-player doesn't need to learn how to move their arms or track a moving object. They can focus on the specifics of tennis, like serving and volleying.

In essence, transfer learning is a computational time and resource saver.

It's a way to stand on the shoulders of giants, to build on what's already been learned, to take a shortcut to success. It's the secret sauce that makes LLMs so versatile.

Fine-tuning

Once a large language model (LLM) has been pre-trained, it's ready for the next stage: fine-tuning. Well, this depends on the application of the LLM. You may not need fine-tuning.

But anwyay, fine-tuning is a process of training the LLM on a smaller, task-specific dataset.

The number of examples required for fine-tuning an LLM, however, is a variable that depends on both the task and the underlying LLM architecture.

It's a bit like seasoning a dish - a few hundred examples might noticeably change the LLM performance, much like a pinch of salt can enhance a meal's flavor for greater or worse.

Yet, even with a few hundred examples, the result might not be significantly better than simply prompting the LLM.

As a general rule of thumb, fine-tuning tends to outperform prompting as the number of examples increases. Well, that's the word on the street.

It's like adding more seasoning to a dish - the more you add, the richer the flavor.

However, there's no upper limit to the number of examples you can use to fine-tune an LLM, just as there's no limit to a craftsman's pursuit of perfection.

Anyway, the process of fine-tuning can be thought of as a feedback-loop, a cycle of steps that we repeat until we achieve the desired performance.

Whether we're working with open-source or closed-source LLMs, the process remains the same:

  • Define the model and set fine-tuning parameters.
  • Gather relevant training data.
  • Compute losses and gradients to gauge learning error.
  • Update the model using back-propagation to minimize errors.

In essence, fine-tuning is a process of adaptation and specialization. And there are various trade-offs to think about when either fine-tuning or prompting an LLM.

Certainly, with fine-tuning can reduce average costs of inference. The more instruction you can bake into your LLM, the less instruction you have to put into any prompt. Yet, performance must be carefully analysed.

As mentioned earlier, it's a bit like a chef trying to find the correct balance of seasoning for some special dish. For some ingredients, salt makes sense, whilst for others, there is no need.

It's a journey from general to specific, from broad knowledge to specialized expertise.

Embeddings

Think of embeddings like the DNA of words. And each word can be made up of different linguistic-like elements, that help create phrases, such as tokens.

In theory, embeddings are mathematical representations of such DNA. They are usually a high-dimensional space that captures the essence of these linguistic elements. So, as a mathematical tool, embeddings are very useful for encapsulating the semantic meaning of words and their relationships with others. They're like a secret code, translating words into a language that machines can understand.

There are various types of embeddings. Position embeddings, for instance, capture the location of a token in a sentence, while token embeddings encapsulate the semantic meaning of a token. They're like the longitude and latitude of words, providing context and position. This is what experts refer to as "in-context learning".

LLMs learn different embeddings for tokens based on their pre-training and can further refine these embeddings during fine-tuning. It's a continuous process of learning and adapting.

Embedding techniques are currently a popular method for building LLM-based applications or services. Many generate embeddings and then build on top of these, like constructing a semantic search. OpenAI, for example, has a specific LLM called 'text-embedding-ada-002' that can leverage embedding integration.

Embeddings are particularly useful for chatbots and learning. If a chatbot is to be a companion, it needs context on specific topics. For instance, if you want a chatbot for cognitive behavioral therapy (CBT), you'll need to fine-tune the LLM using contextual embeddings - embeddings of words, sentences, phrases, and relationships specifically targeting CBT. It's like giving the chatbot a crash course in CBT, equipping it with the knowledge it needs to be effective.

Tokenization

In the realm of NLP, tokenization is like the a key component of any word's DNA or existence. It's the process of dissecting text into its smallest units of understanding - tokens.

These tokens are the raw material that's embedded with semantic meaning and fed into the attention calculations that make an LLMs work.

Tokens form the static vocabulary of an LLM, and they're not always entire words.

Tokens can represent punctuation, individual characters, or even a sub-word if a word is unfamiliar to the LLM.

Many LLMs also have special tokens that carry specific meanings for the model.

For instance, the BERT model has a few special tokens, including the [CLS] token, which BERT automatically inserts as the first token of every input. The CLS token is meant to encapsulate the semantic meaning of the entire input sequence to the model.

You might be familiar with traditional NLP techniques like stop words removal, stemming, and truncation.

But LLMs don't need these techniques. Such techniques are designed to handle the complexity and variability of human language in all its glory, including stop words like "the" and "an", "it" and variations in word forms like tenses and misspellings.

In fact, altering the input text to an LLM using these techniques could potentially harm an LLMs performance by stripping away contextual information and distorting the original meaning of the text.

Tokenization can also involve preprocessing steps like character casing, which refers to the capitalization of tokens.

There are two types of casing: uncased and cased.

In uncased tokenization, all tokens are lowercased and accents are usually removed, while in cased tokenization, the capitalization of tokens is preserved.

The choice of casing can impact the model's performance, as capitalization can provide important clues about the meaning of a token - think the German language with nouns capitalised, in particular.

The choice between uncased and cased tokenization is like choosing the right tool for the job.

For straightforward tasks like text classification, uncased tokenization usually does the trick.

But for tasks that draw meaning from case, like Named Entity Recognition, cased tokenization is the way to go.

Also kind of critical to note. LLMs have a limit on the number of tokens we can input at a time.

LLMs have a guest list. So, the way an LLM tokenizes text can matter for particular tasks. If you are looking to write and compose a book - you may only get one chapter at a time. Connections between each chapter will depend on the preceeding input token stream.

So, if you are looking for an LLM to convert a whole codebase of 100 LOC into text and then do analysis - well, it may be difficult given token limits.

Tokenization is a well established research area. And right now many LLMs are battling it out for larger and larger token input limits.

Anthropic's Claude 2 which currently sits at the top of the world's LLM token limit premier league - boasting of a 100,000 token input length.

Prompt Engineering

Prompting in the world of LLMs is a bit like the art of conversation. It's about crafting the right question, the right input, to extract the most valuable response from an LLM. Believe me, prompt engineering is a thing.

A prompt is a text input for an LLM, made up of tokens. Each LLM has a limit for prompts, a maximum number of tokens it can handle at a time. It's like a conversation with a word limit. You have to make every word count.

The art of prompting is about getting the most accurate and valuable responses from LLM interactions. It's about understanding the LLM's limits and working within them to craft the most effective prompts.

Once you've crafted your prompt, the next step is prompt evaluation + prompt stuffing.

Both evaluation and stuffing are core components of prompt engineering, where you provide a few examples in the input prompt and hope that the LLM will generalize from these examples, like a fewshot learner.

Imagine you're trying to assign a controversy score to a text, like a tweet from Elon Musk. The more detail and examples you put into the prompt showcasing different various controversal tweets, the better the model's performance might be.

But there's a catch: the more detailed your prompt, the higher the cost of your inference. So, the more an LLM has to work using various computational techniques.

So, prompt engineering is a real thing, and it is a bit of an artform. There is very little available such as consistent frameworks for LLM prompting.

As it stands, the process involves prompt versioning and optimization, using various techniques like prompt stuffing to ensure the effectiveness of your prompts and to track their performance.

Alignment + RLHF

Alignment in LLMs is a big topic. It's abit like a dance between the LLM and the user. It's about how well the model can move in response to the user's prompts, matching user expectations.

Traditional language models predict the next word based on the preceding context, but this can be like trying to dance a waltz to a tango beat. It can limit their ability to respond to specific instructions or prompts.

Researchers are now looking at new ways to align language models to a user's intent, incorporating techniques like reinforcement learning (RL) into the training loop.

RL with Human Feedback (RLHF) is a popular method of aligning pre-trained large language models (LLMs). It's like a dance lesson, where the LLM learns from feedback on its own outputs from a small, high-quality batch of human feedback.

RLHF allows the LLM to overcome some of the limitations of traditional supervised learning (SL), like a dancer learning to improvise rather than just following a set routine.

RLHF has shown significant improvements in modern LLMs like ChatGPT.

But RLHF is just one dance in the RL repertoire. There are other emerging approaches like RL with AI feedback, such as Constitutional AI.

Anyway, such techniques are very interesting and are expanding the possibilities for how LLMs can move and respond to user prompts.

Distilation

In March 2023, an innovative idea was put forth by a group of Stanford students to help optimize both fine-tuning and embedding based techniques.

The research paper proposed the fine-tuning of a smaller open-source language model, LLaMA-7B (the 7 billion parameter version of LLaMA), using examples generated by a larger language model, the text-davinci-003 LLM, which boasts 175 billion parameters.

Such a technique, became known as distillation and involves training a smaller model to imitate the behavior of a larger model.

So, the smaller model is fine-tuned based on the examples generated by the larger model, effectively learning to replicate its behavior.

The result of this process is a fine-tuned model that behaves similarly to text-davinci-003, but with the advantage of being significantly smaller and more cost-effective to computationally run and operate across compute infrastructure.

Distilation approaches demonstrate the potential of a nice clever method for leveraging the power of LLMs in a more efficient and economical way.