FineTuning Methods
Prompt Tuning
Complexity: High
Background: Prompt tuning, prefix tuning and ptuning are variants of the same approach  tunable, auxiliary embeddings are concatenated to the original prompt embeddings to engineer a better prompt. It is important to note that these are not human readable prompts  they are added to the input query sequence after sequence embedding is performed. Because these auxiliary embeddings are trainable, they can be finetuned to improve prompt quality and LLM output for specific downstream tasks. Prompt tuning can be considered a form of Parameter Efficient FineTuning.
Methods:
Prompt Tuning^{1}
Tunable input embeddings are prepended to embeddings of the original prompt.
Prefix Tuning
Tunable embeddings, fed through a regular MLP, are prepended to both the input layer and transformer layers in the network. Conceptually similar to prompt tuning, but performed in several layers rather than just the input layer.
ptuning^{2}^{3}
Tunable embeddings, fed through an LSTM or MLP, are inserted into the input layer. Conceptually similar to prompt tuning, but auxiliary embeddings are not necessarily prepended. The authors find that this improves stability in LLM outputs to small discrete changes in prompts.
ptuning V2^{4}^{5}
The primary difference between ptuning V2 and the original ptuning is that ptuning V2 introduces tunable embeddings to every layer, not just the input layer. Conceptually similar to prefix tuning and ptuning.
Parameter Efficient Finetuning
Complexity: High
Background: Finetuning entails updating the weights of a pretrained LLM using taskspecific data to improve its indomain performance. There are multiple variations to the technique  parameterefficient finetuning only updates a small subset of model weights for computational efficiency, while full finetuning requires updating all model weights.
Methods:
LoRA and QLora^{6}
LowRank Adaption (LoRA) is a finetuning approach stemming from the insight that model weights, which are large matrices, can be decomposed into smaller matrices while still approximately preserving the information they contain^{7} . QLora is an extension of LoRA that uses quantisation to further improve the efficiency of LoRA.
For instance, a 100x100 matrix decomposed into two separate 100x2 matrices has significantly fewer parameters; 10000 (100x100) vs 400 (100x2x2). The authors of the paper find that, in their tests on GPT3 175B, using just 0.02% of the total parameters for LoRA produces results superior to full finetuning.
More specifically, lowrank decompositions are used to approximate the updates to model weights during finetuning, not the weights themselves^{8}. The original model weights are frozen and only the deltas are finetuned. These updates can then be added to the model’s original weights. At training, this significantly reduces the computational requirements for finetuning a model. At inference, unlike other methods (prefix tuning, adapters, etc) LoRA does not increase latency because there are no additional parameters added to the model.
IA3^{9}^{10}
IA3^{11} performs finetuning by introducing trainable taskspecific vectors to the key, value and feedforward layers of each transformer block in an encoderdecoder transformer^{12} . Similar to LoRA, IA3 comes at no additional latency at inference because the total number of parameters in the model are not changed. Additionally, IA3 also allows for mixedtask batches as the activations of each task in the batch can be separately multiplied by its associated learned task vector. Compared to LoRA, IA3 requires even fewer parameters (0.01% vs >0.1% of total parameters in LoRA) while achieving superior performance based on their tests^{13}.
Full Finetuning
Complexity: High
Background: In its simplest form, full finetuning of a model entails supervised learning over a domainspecific dataset or task. Because all model weights are being optimized, for larger models this entails a tremendous number of gradient calculations and large model checkpoints, making full finetuning very costly in both compute and storage. Unless a task or dataset falls significantly outside the domain of a pretrained LLM, full finetuning is unlikely to be necessary.
Methods:
Reinforcement Learning with Human Feedback (RLHF)
RLHF is a finetuning technique with the specific goal of aligning models with some framework of human preference. RLHF is used to finetune ChatGPT and GPT4 to ensure they are helpful, honest and harmless. It proceeds in the following steps:
 A set of actual prompts is sampled from the input prompt distribution and manually labeled by human evaluators. This initial dataset is used in a regular supervised learning procedure to finetune the model^{14}, which we call the SFT model.
 A new set of prompts is sampled and several outputs are generated from the SFT model. These responses are ranked by human evaluators, and a reward model^{15} (RM) is trained to predict the higher ranked output between pairs of outputs^{16}.
 A new set of prompts is sampled and outputs are generated from the SFT model and scored by the RM. The PPO algorithm^{17} is used to iteratively optimize the SFT model, which in reinforcement learning terms is referred to as the ‘policy’.

https://arxiv.org/pdf/2104.08691.pdf ↩

https://arxiv.org/pdf/2103.10385.pdf ↩

https://github.com/THUDM/Ptuning ↩

https://arxiv.org/pdf/2110.07602.pdf ↩

https://github.com/THUDM/Ptuningv2 ↩

https://arxiv.org/pdf/2106.09685.pdf ↩

https://arxiv.org/pdf/2012.13255.pdf ↩

In the paper’s testing, targeting key and query matrices of the transformer layers performed best. ↩

https://arxiv.org/pdf/2205.05638.pdf ↩

https://huggingface.co/docs/peft/en/conceptual_guides/ia3 ↩

As the name suggests, IA3, which stands for Infused Adapter by Inhibiting and Amplifying Inner Activations, is a variation of one of the first PEFT techniques which utilized Adapter modules added to transformer blocks for finetuning. ↩

In the paper, IA3 is implemented together with modifications to the T03B model’s loss functions, which add a length normalization and unlikelihood loss. However, these should be independent of the actual mechanism for IA3. ↩

It should be noted that IA3 was tested on T03B whereas LoRA was tested on GPT3 175B (and several other models), and on different datasets. T03B is also an encoderdecoder style model whereas popular models like Llama and GPT are decoder only, so it is not clear if results carry over. ↩

OpenAI’s RLHF paper notes that, for their first iteration of RLHF, prompts were manually written by labelers rather than sampled. ↩

For the RM, OpenAI used an SFT GPT3 6B with the final unembedding layer removed. ↩

Assuming there are 4 generated responses, these are ranked 1>2>3>4. Then, pairs are chosen to train the RM, leading to 4C2 data points. ↩

https://huggingface.co/learn/deeprlcourse/unit8/intuitionbehindppo ↩