7 Fine-Tuning Strategies for LLM: Techniques, Tips and Best Practice
Essential techniques for enhancing fine-tuning of LLM. Deep dive best practice, tips for driving optimization and perfromance.
If you are familiar with LLMs, you already know the concept fine-tune, whether you’ve actually done it or just heard or know about it, then you are already a step ahead of the game.
For those who are newbies in the LLM world and looking for a guide to developing an LLM application or RAG, you can refer to my previous post on Medium and navigate through the subsequent posts.
Now, for those who already know and have put your hand in practice to fine-tune LLM for specific data, I believe you’ve gone through a lot of guides on how to fine-tune LLM. This article is not another post on how to do it, though I want to write a comprehensive one on how to fine-tune efficiently along with the performance comparison as well as when you should fine-tune and when you should do RAG or just prompt engineer. However, that will be for future posts.
Back to our story, in my opinion, most of the fine-tuning process is usually a very straightforward approach through these steps.
You have data
You select the built-in framework such as LlamaIndex or Langchain or any other frameworks
Choose the base LLM, which could be GPT-3.5 or Llama-2 or the two brand news LLMs that are shocking the world Mistral and Zephyr
Fine-Tune the data
Compare the RAG performance with new fine-tuned LLMs vs base LLMs.
What I’ve observed is that many tend to take guides at face value, diligently applying them to their datasets without much inquiry.
However, there’s a crucial element in step 4 that often goes overlooked: the art of fine-tuning data efficiently. To rephrase, what exactly is the optimal strategy for tailoring the fine-tuning process to your unique dataset?
Because taking a naive approach to fine-tuning can be perilous. Without a well-thought-out strategy and careful consideration of the data, there’s a tangible risk of degrading the model’s performance or inducing drastic changes in its behaviour. Such alterations can compromise the reliability of the model, potentially leading to outputs that diverge from intended results or lack consistency with the original training, or even worse, you waste your money for nothing.
In this short article, we’re set to explore the diverse methods at hand for refining LLMs, providing you with an extensive array of tools to guide your choices. Instead of pushing a generic approach or adhering to a one-size-fits-all method, our goal is to showcase the spectrum of options, each designed to cater to the unique intricacies of varied datasets.
Overview
If you need a refresher on how to fine-tune LLM then I highly recommend perusing LlamaIndex’s official documentation. I frequently use LlamaIndex for many of my RAG applications, and their guide on finetuning is exceptionally thorough, catering to both beginners and advanced.
Why do we need Finetuning?
Adjusting a model through finetuning involves refining its parameters over a specific dataset to enhance its overall performance. In other words, it’s all about getting better results, cutting down on weird outputs, remembering data better, and saving time and money.
At the heart of our tools is something called in-context learning mixed with retrieval augmentation. This typically means leveraging the models during the inference stage rather than actively training them.
Furthermore, while finetuning can indeed incorporate external data to “enrich” a model, it seamlessly aligns with retrieval augmentation to achieve diverse improvements.
Embedding Finetuning Benefits
Finetuning the embedding model can allow for more meaningful embedding representations over a training distribution of data –> which leads to better retrieval performance.
LLM Finetuning Benefits
Allow it to learn a style over a given dataset
Allow it to learn a DSL that might be less represented in the training data (e.g. SQL)
Allow it to correct hallucinations/errors that might be hard to fix through prompt engineering
Allow it to distill a better model (e.g. GPT-4) into a simpler/cheaper model (e.g. GPT-3.5, Llama 2)
That is why, we can not overlook “just fine-tuning” the data, we need to experiment and select the most effective strategy to fine-tune LLMs. At the time of writing, here are a few common techniques that I know:
Full fine-tunning approaches
Parameter-efficient fine-tuning methods (PEFT)
Prompt engineering strategies
Multi-task learning
Adapter-based fine-tuning
Meta-Adapters: Parameter Efficient Few-shot Fine-tuning
Sandboxed tuning environments
Getting a handle on these techniques paves the way for adapting LLMs to a wide range of applications effectively. First up, let’s dive into the Full fine-tunning approaches
Full Fine-Tuning
This is the easiest approach, you tweak every single model parameter to optimize performance specific to a particular domain.
Approaches
Train every layer from start to finish using fresh data
Slowly release the constraints on foundational layers as time progresses
Apply constraints to minimize deviation from initial settings
Advantages
Optimal results in the desired area
Recognizes intricate patterns unique to the domain
When to Use
When there’s a lot of domain-specific data on hand
When top-notch accuracy is essential
When there’s wiggle room in the computational budget
Full fine-tuning is suitable when customization is the top priority.
Parameter-efficient fine-tuning (PEFT)
Unlike full fine-tuning, PEFT modifies only a limited number of model parameters when fine-tuning. This ensures the broad knowledge from the pre-trained model remains intact.
Approaches
Adjust only the adapter layers added to the core model
Refine the upper layers, keeping the foundational layers static
Pinpoint vital parameters using preliminary tasks prior to adjustments
Advantages
Maintains the extensive skills of the primary model
Needs less data to tweak effectively
Offers computational savings
When to Use
As a preliminary step before comprehensive fine-tuning
When there are sparse domain-specific data
In settings with limited resources
Prompt Engineering for Fine-Tuning
Prompt engineering is the art of crafting specific prompts to steer the model during the fine-tuning process.
Approaches:
Include clear examples within prompts
Introduce new ideas via detailed walkthroughs
Favour descriptive language instead of plain labels
Refine prompts based on the model’s responses over iterations
Advantages:
Better at imparting new ideas
Offers a deeper understanding of the domain
Boosts clarity and uniformity in outputs
When to Use:
When there’s restricted flexibility in modifying training data or labels
To supplement datasets with illustrative examples
To elevate logical thinking in intricate domains
Multi-Task Learning
Adjusting a model on various connected tasks at the same time can enhance its broad applicability.
Approaches:
Train collectively using combined datasets
Switch between small data batches from different datasets
Merge gradients from distinct optimizers
Advantages:
Enhances efficiency across multiple tasks
The model develops a more versatile understanding
When to Use:
When all tasks rely on akin features or data
When aiming for robust adaptability across different applications
To benefit from the synergies of multiple datasets
Adapter-Based Fine-Tuning
Adapter tuning involves integrating extra adjustable modules into the core model structure to make it more specialized.
Approaches:
Incorporate compact adapters after specific layers
Focus on training the adapters while keeping the core model unchanged
Combine different reusable adapters for diverse tasks
Advantages:
Focuses modifications to prevent major shifts in the main model
Offers agile and adaptable expansion into fresh domains
When to Use:
When regularly branching out to different domains
To uphold robust foundational performance
In scenarios with tight computational limits
Meta-Adapters: Parameter Efficient Few-shot Fine-tuning
Meta-learning focuses on prepping models to swiftly adjust using minimal examples.
Approaches:
Expose the model to a wide variety of tasks during its early training phase
Develop an optimization method tailored for rapid fine-tuning
Set the initial model weights in a way that favours speedy adjustments
Advantages:
Enables learning even with a tiny set of domain-specific data
Quick pivot to unfamiliar tasks
When to Use:
For crucial projects where data is a rarity
In user-centric apps that demand swift localization adjustments.
Sandboxed Fine-Tuning Environments
This is more like a risk-control approach. One of the main issues of fine-tuning is the ability to learn from the training dataset. If the company fine-tuned on bunch of dialogs, and conversations between the customer and the support team, there may be a case when the customer is toxic and talks to the supporter in an inappropriate way. Having a sandbox and testing separately is the way to eliminate the toxic conversation that is part of the training.
Approaches:
Keep the training hub distinct from the live production setting
Use virtualization to separate and protect against unfamiliar data and code
Adopt containerization to ensure a uniform and repeatable environment
Advantages:
Minimizes dangers linked to introducing fresh code and data
Shields the ongoing production processes
Guarantees stable and reliable tuning settings
When to Use:
When dealing with ultra-sensitive production information
In sectors with tight regulations, like healthcare
On platforms designed for developing models across multiple users or tenants
Evaluating Fine-Tuned LLMs
There are multiple ways to evaluate Fine-Tuned LLMs but the main approach is to compare the results from Fine-Tuned LLMs again base LLM and base LLM + RAG.
According to LlamaIndex, these are the metrics that are generally used to evaluate performance.
Quantitative and Qualitative Response Evaluation
Retrieval Evaluation
If you want to know more about how to evaluate your RAG application, you can refer to this post. Apart from fine-tuning techniques, consider these techniques to improve your RAG performance
Summary
There is no perfect solution or one-size-fits-all approach when it comes to fine-tuning. You need to understand your data and may go through different experiments to see what is the best fit.
Fine-tuning is powerful. You can fine-tune GPT-3.5 on your data to get a response that surpasses GPT-4. In addition, having regular fine-tuning on your data + RAG is one of the best combos to get quality answers from your LLM application.
Throughout this article, we’ve explored various strategies to fine-tune LLMs securely, all aimed at boosting performance in specific domains. While I only listed 7 techniques, I believe there are a lot more. If you feel there’s a critical approach I’ve overlooked, or if there’s something specific you’d love to see covered, please drop your thoughts in the comment section below. Your insights enrich our collective understanding!
In conclusion, by integrating these techniques, you can seamlessly tailor LLMs to meet the distinct requirements of your organization. Proper fine-tuning truly unleashes the immense capabilities of foundational models.
And don’t ever forget to aseess your fine-tuned LLMs. It is crucial. So always remember “never skip the evaluation”
❤️ If you found this post helpful, I’d greatly appreciate your support by giving it a clap. It means a lot to me and demonstrates the value of my work. Additionally, you can subscribe to my substack as I will cover more in-depth LLM development in that channel
Want to Connect?
If you need to reach out, don't hesitate to drop me a message via my Twitter
or LinkedIn and subscribe to my Substack, as I will cover more learning
practices, especially the path of developing LLM in depth in my Substack
channel.
References
LlamaIndex: Finetuning
LlamaIndex: Evaluation