Introduction
Large language models (LLMs) have become increasingly valuable for answering questions in specialized domains, such as medical or legal documents. To improve their performance, it is common to inject domain-specific knowledge into LLMs through techniques such as retrieval-augmented generation (RAG) or fine-tuning. In this blog post, we explore a tuning technique known as Recovery Augmented Fine Tuning (RAFT) and evaluate its effectiveness in adapting LLMs previously trained for RAG in specialized domains.
rag today
RAG is a method to improve LLMs when dealing with knowledge that is not “integrated” during the pre-training stage. This usually involves specific domains or more up-to-date information. A common way to create a RAG system is to retrieve fragmented documents from a vector store and inject them directly into the LLM message. For example, a common message for the LLM would look like this:
“Context information is below:\n{contexts}\nGiven the context information and not prior knowledge, answer the query.\nQuery: {question}\nAnswer: “ |
Take a look at our RAG in 4 lines of code guide.
While these systems are easy to build, there may still be room to reduce additional performance. The debate revolves around whether RAG or fine tuning is more preferable for a given use case. A recent paper called RAFT studies this problem and proposes a novel method to adapt a pre-trained LLM by fine-tuning with augmented question answering (QA) data retrieval.
What is RAFT?
Recovery Augmented Fine Tuning (RAFT), introduced by Zhang et al.is a method designed to improve the performance of LLMs in specific domains. RAFT improves response quality by leveraging Chain of Thought (CoT) responses generated from the data provided. Basically, RAFT refines a model’s reasoning and response generation capabilities by using large pre-trained models. The process involves generating responses with a large model and then fitting these responses in a smaller, specialized model. This approach helps create high-quality CoT responses, which significantly improves model performance. In doing so, RAFT bridges the gap between general-purpose LLMs and the specialized knowledge required for specific domains.
Figure 1: Example of LLM message to generate CoT responses with explanations given the relevant context along with a set of distractor documents.
Why use RAFT?
One of the main advantages of RAFT is its ability to adjust chat or instruct models without needing to realign them for chat functionalities. This efficiency saves time and resources that would otherwise be spent realigning the model for conversational purposes. By focusing on domain-specific fit, RAFT ensures that the LLM can generate more accurate and contextually relevant responses.
The original RAFT article presents experiments using the Llama2-7B model, demonstrating its effectiveness in various specialized domains. In particular, while using RAG often improves QA performance compared to using an LLM alone, fine-tuning and RAFT consistently outperform RAG by a larger margin.
This begs the question: How does RAFT work with newer models like the Llama3-8B? By comparing these models, we can gain insight into the scalability and improvements offered by the latest advancements in LLM.
How does RAFT perform in newer LLMs?
The published code for RAFT is at this Github repository. We use all the default settings with some small changes:
- Although the article uses GPT-4 to generate the questions and answers, we chose the Call3-70B-instruct model since we host it ourselves.
- We generated 1 question per fragment and included 3 distractor documents per data point.
- Instead of supervised fine-tuning, we use LORA.
For the data we use the HotpotQA data set, specifically the fragmented contexts of the development set, to create the data points (i.e. questions, CoT answers). The direct questions and answers from the HotpotQA dataset are not included in the generated data, so the model will not memorize them. We created samples with only 100 fragments for the sake of time. The resulting data set is available at hugging face.
Since we focus on computing-constrained environments, we are interested in models in the 7-8B range or smaller. As such, we have selected Call3 8B and Call3.1 8B Instruct models and their 4-bit quantized variants for our experiments.
We also compare the results using Call2-7B-chat as a starting point. For training we use the TRL SFT Trainer. We use lm-evaluation-harness by EleutherAI and evaluated the fitted models on the HotpotQA validation set (1k samples) on a single NVIDIA A100-SXM4-40GB.
Results
Figure 2 below shows the F1 scores of the fitted and pre-trained models. In fact, we observed a significant increase in performance thanks to fine-tuning the RAFT-style data for most of the models tested. Most notably, the performance increase was over 60% for Llama3 variants and up to over 100% for Llama2 7B. On the other hand, the Llama3.1 8B adjustment produces a 16% increase in comparison.
By using 4-bit quantized variants of the Llama3 models, we were able to retain between 91 and 94% of the performance and only used 25% of the GPU memory dedicated to the model weights.
For LoRA configurations, we have found that using “fully linear” modules as targets is more effective than using a subset of target modules. Furthermore, by using a higher LoRA rank (64), we can obtain higher scores than using a lower LoRA rank (16). Here we report the best scores by tuning the hyperparameters.
Figure 2: F1 scores of fitted (blue) and pre-trained (orange) models evaluated on 1000 samples from the HotpotQA development set
Discussions and limitations
Initial runs show that CoT responses appear cut off when max_new_tokens=512. By setting max_new_tokens=800, we observed that the models were able to generate full CoT responses. This leads to almost double the performance over the lowest setting, but on the other hand consumes more GPU time and memory.
Time and cost are also important factors to consider. Generating the data set (100 rows) takes approximately 30 minutes. At the current inference price ($0.0012/request), the data set costs $0.24 (2 calls/row). Once we have the data set, fitting the model takes on average ~10 minutes. At the current price of deep training ($4/hour), training costs $0.67. The perfected model costs less than $1 from start to finish! But of course some data sets may require different training needs. Tuning hyperparameters could also increase the cost.
We use Llama3-70B-instruct as a question and answer generator. There are higher range models in the LMSYS Chatbot Arena that can generate better quality questions and answers.
What’s next?
RAFT appears to be an effective method for fitting smaller LLMs to domain-specific data. From the context fragments, CoT questions and answers can be easily generated through RAFT to form a data set for fine-tuning instructional models. This not only eliminates the need to align a fitted base model, but also dramatically reduces the amount of data needed for fitting overall. If you want RAFT to be available on the Clarifai platform, send us a message to our Community Discord Channel!