LLM 101 - Prompt Engineering

Oliver Jack Dean Sep 6, 2023

Prompt Engineering (PE) Unpacked

In my previous post, I touched on the topic of prompts. Diving deeper, Prompt Engineering (PE) stands out as a pivotal tool when harnessing the capabilities of Large Language Models (LLMs).

PE allows us to direct LLMs towards desired outcomes without altering their core structure or underlying model weights.

As a methodology, PE is rich and varied, but its effectiveness can differ across models, necessitating thorough experimentation.

Prompting for the rest of us

Zero-shot learning: refers to the ability of a model to perform tasks for which it has seen no examples during training. In the context of LLMs, zero-shot learning means that the model can generate relevant responses to prompts without being provided any examples of the desired output. You would use this approach when you believe the model has enough knowledge from its training data to handle a task without needing any examples. It just needs some extra polite nudging.
Instruction prompting: involves providing the model with a clear and direct instruction or command to guide its behavior or output. Instead of showing examples, you're telling the model explicitly what you want it to do or how you want it to respond. For example, instead of just asking the model to "write a tweet," you might instruct it with "write a tweet in the style of a financial times correspondent."
Few-shot learning: for when using LLMs, few-shot learning involves providing the model with a limited number of examples (or "shots") to help it understand the task at hand. These examples typically consist of both input and the desired output (demonstrations). By seeing these demonstrations, the model is expected to generalize and produce relevant outputs for new, unseen inputs related to the same task.

Few-shot learning

Inserting 'few-shot examples' or 'demonstrations' into an LLM prompt serves to clarify our intent. Essentially, these demonstrations guide the model on how to approach a downstream task.

Few-shot learning has become a pretty big subject domain and a significant hot topic in Machine Learning (ML) research. Its significance is there for us to see, take for example, the original GPT-3 research paper, aptly titled “Language Models are Few-Shot Learners”.

Yet it's still not perfect and the technique is not without its challenges.

Few-shot learning performance hinges on the quality and relevance of the provided examples or demonstrations. Which leads us to question: what defines a high-quality few-shot example?

Anyway, as an example, I would like Chat-GPT to summarise some articles, by way of emulating the following writing style:

{"role": "user", "content": "Text: I like to think of this as..."},
{"role": "assistant", "content": "Sentiment: positive"},
{"role": "user", "content": "Text: I think if you look behind the lines..."},
{"role": "assistant", "content": "Sentiment: positive"},
{"role": "user", "content": "Text: I might perceive the risk calculus differently than I did at a seemingly identical moment..."},
{"role": "assistant", "content": "Sentiment: positive"},
{"role": "user", "content": "Text: In turning my mind to this topic. I think ..."},
{...}

This is a cool and simple example. But I have a feeling it will need more examples else there is a risk most of the LLM output will be of high variance.

Luckily, there are a few benchmarks for us humans to judge how well Few-shot Learning is doing for our LLM.

For the sake of simplicity, think of "labels" as "examples":

Majority label bias exists if distribution of labels among the examples is unbalanced or highly random.
Recency bias refers to the tendency where the model may repeat the label(s) at the end.
Common token bias indicates that LLM tends to produce common tokens more often than rare tokens.

Few-Shot learning, by all accounts, has a lot of promise. It's a nice way to explicitly state how an LLM should return a generated result - ideally for downstream tasks involving text generation or summarisation.

But I've spoken too soon.

Already, researchers and practitioners alike have been looking into other ways to avoid high variance amongst outputs:

Shuffling Demonstrations: Randomize the order of provided examples to ensure that the model doesn't overly rely on the most recent ones.
Attention/Flash Attention: Research on how attention is distributed across input tokens can provide insights into recency bias.
Self-Reflection: Have the LLM assign confidence scores to its outputs and set a threshold below which it either refrains from answering or flags the answer as potentially unreliable. Self-reflection is a vital topic of importance for autonomous agents using LLMs in particular.

Self-reflection is gaining prominence in across ML research.

Instructed-Language Models, such as InstructGPT and those using natural instruction, are fine-tuned versions of pretrained models. So, they utilize high-quality sets of instructions, inputs, and desired outputs to enhance an LLM's ability to grasp and execute user intentions more accurately and decisively. Another notable technique is RLHF (Reinforcement Learning from Human Feedback), which Chat-GPT famously employs.

Such instruction-based fine-tuning aligns the model more closely with human intent. This not only streamlines communication but also minimizes the need to repeatedly seek clarity from LLMs, making interactions more efficient.

Improving Few-shot learning

But coming back to Few-shot learning, there have been many attempts to improve the overal workflow and framework of Few-shot learning:

In-context instruction learning combines Few-shot learning with instruction prompting. It incorporates multiple demonstration examples across different tasks in the prompt, each demonstration consisting of instruction, task input and output (Ye et al. 2023).
Q-Learning can be used to optimize the selection of examples in Few-shot learning scenarios. By treating the selection of examples as an action in a given state, Q-learning can help determine which examples are most informative for the model (Zhang et al. 2022).
Active-Learning, a subset of Supervised Learning - is particularly relevant when you have a limited budget for labeling examples, which aligns with the constraints of Few-shot learning. By intelligently selecting which examples to label, active learning can enhance the effectiveness of few-shot learning. So, given an unlabeled dataset \(\mathcal{U}\) and a fixed amount of labeling cost \(B\), active learning aims to select a subset of \(B\) examples from \(\mathcal{U}\) to be labeled such that they can result in maximized improvement in model performance.
CEAL (Cost-Effective Active Learning) - although not exclusive to Few-shot learning, combines the principles of Active Learning with Semi-Supervised Learning. In scenarios where labeled data is scarce (as in few-shot learning), CEAL can be beneficial by leveraging both labeled and unlabeled data and combining datasets in various ways.
K-clustering - a subset of Unsupervised Learning, can be used to ensure that the examples provided in Few-shot learning are diverse and representative of the entire data distribution. By clustering semantically similar examples, one can avoid biases and ensure that the model gets a holistic overview of the data and its distrubution.
TALM and Toolformer frameworks which researchers use via APIs. These tools can assist making sure prompts are more robust and refined for tasks like calculations, translations, and information retrieval.

Do note that TALM and Toolformer frameworks are used to augment LLMs with external capabilities. They can be applied in the context of Few-shot Learning but are essentially tools to enhance the capabilities of LLMs in general.

Advanced Stuff

As touched upon earlier, when LLMs undertake tasks related to autonomous agents, there's a need for more sophisticated prompt engineering techniques and frameworks. To ensure these autonomous agents operate appropriately, some renowned methods include:

Self-Consistency Sampling: In a system with multiple agents, ensuring consistency across agents is crucial. Self-Consistency Sampling can be used to validate the outputs of different agents, ensuring that they align with the majority or a predefined criterion. For instance, if multiple agents are tasked with solving a problem, this method can help in selecting the most consistent solution.
Chain-of-Thought (CoT) Prompting: CoT prompting can be used to generate a sequence of "reasoning" steps, which can be especially useful when coordinating multiple agents. Each step in the chain can represent an action or decision made by an agent, ensuring that the collective actions of all agents lead to a desired outcome.
Self-Ask Method: Autonomous agents often operate in dynamic environments where they need to adapt and gather more information. The Self-Ask Method allows agents to proactively seek additional information or context, which can be particularly useful when they interact with other agents or external systems. For instance, if an agent is unsure about a decision, it can use this method to ask other agents for input or clarification.

Underlying all of these three approaches is something called the ReAct framework.

Chain-of-Thought (CoT)

Interestingly, there are two types of prompting techniques for CoT based usecases:

Few-shot CoT: approach involves providing the model with a set of demonstrations. Each demonstration consists of either manually crafted or model-generated reasoning chains that exemplify high-quality thought processes.
Zero-shot CoT: method uses natural language prompts. Starting with a statement like "Let's think step by step here," the model is encouraged to generate a sequence of reasoning. Following this, a prompt such as "Therefore, the answer should be" guides the model to produce a final answer, as outlined by Kojima et al. in 2022.
STaR (Self-Taught Reasoner): two-step process begins by prompting LLMs to create reasoning chains, retaining only those that lead to accurate conclusions. Subsequently, the model is fine-tuned using the generated rationales. This cycle is repeated until the model reaches a point of convergence, as detailed by Zelikman et al. in 2022.

ReAct Framework

As mentioned above, the ReAct framework or method is designed to support real-time interactions and feedback, making it particularly suitable for methods that require iterative and dynamic interactions, such as the CoT and Self-Ask prompt engineering techniques.

By integrating CoT prompting within the ReAct framework, it becomes possible to monitor and guide the reasoning process of autonomous agents in "real-time". This ensures that any deviations or errors in the chain of thought can be promptly addressed, leading to more reliable and coordinated outcomes.

Know thy Prompts

LLMs are not limited to processing just a single prompt. For instance, dialogue-oriented models like ChatGPT can handle a system prompt along with multiple user and assistant interactions. While Cohere demands a more hands-on approach to structuring prompts, the quality of its results justifies the effort.

The effectiveness of prompts is closely tied to a model's architecture and training. For instance, ChatGPT, GPT-3 (distinct from ChatGPT), T5, and Cohere's command series models each have unique architectures, data sources, and training methods. This means a prompt that's effective for one might not be for another.

Open-source LLMs, such as GPT-J and FLAN-T5, are also noteworthy. While they can produce high-quality text akin to GPT and Cohere, their open-source nature provides an added advantage: enhanced flexibility in prompt engineering. Developers can tailor prompts to their needs, leading to more precise outputs.

However, I anticipate potential challenges, especially from regulatory perspectives. While open-source models might be ideal for sectors like medical or pharmaceutical modeling due to the prompt engineeering flexibility and accuracy, regulatory bodies often view open-source solutions with a degree of skepticism and caution.

References

List of awesome resources and references: