LLM Fine-Tuning: A Guide for Domain-Specific Models
Source: DigitalOcean
By Adrien Payong and Shaoni Mukherjee Large language models have become highly capable, but off-the-shelf models often fall short for specific domains/applications. LLM fine-tuning is the process of further training a pre-trained LLM on a custom dataset to specialize it for a particular task/domain. Fine-tuning enables you to infuse domain knowledge, align a model’s tone/style with your brand, and maximize task performance beyond general models. Fine-tuning leverages the model’s existing knowledge, saving the massive cost of training a model from scratch. Base models have more power than ever, but to get real value, customization is essential. Fine-tuning helps your model sound like your company’s jargon, understand your niche context, and meet strict accuracy or tone guidelines. Fine-tuning a smaller model for your use case can be far cheaper than calling a large generic model via an API for each request. In this crash course, we’ll cover the concepts, tools, PEFT (LoRA, QLoRA), best practices, and real-world examples. Before diving into the workflow, let’s cover some foundational concepts and terminology around LLM fine-tuning. Pre-training is the first-time training of an LLM on a broad corpus using self-supervised learning. This is when a model learns to model language in general (like, predict the next word in billions of sentences). Pre-training is unsupervised and very expensive (think of the billions of dollars of compute that went into training GPT-scale models). Fine-tuning happens after pre-training. It’s a type of transfer learning. You take the pre-trained model (which is “generally knowledgeable”) and further train it on a more specific, labeled dataset for a more specific task. Fine-tuning is a supervised learning process – you give the model example inputs and example desired outputs (the “ground truth” for that task) and tweak the model to produce those outputs. For example, after pre-training on all the text on the internet, you could fine-tune a model on a dataset of legal question/answer pairs to build a legal assistant. Alignment is a collection of training steps that adjust a model’s behavior to better match human intents, ethics, or preferences. The best-known alignment technique is Reinforcement Learning from Human Feedback (RLHF). In RLHF, after fine-tuning in a supervised manner, you use human evaluators to provide feedback on a model’s outputs, and then further train the model to make outputs that will rate higher. This is a way to make a model not only more task-effective, but also more helpful, harmless, and honest, as defined by human reviewers. Alignment typically leverages techniques such as first training a reward model (which scores outputs), then fine-tuning the LLM using reinforcement learning to optimize for that reward score. To recap, pre-training equips the model with general capabilities, fine-tuning teaches it the skills for specific tasks, and alignment techniques such as RLHF adjust its behavior to make it appropriate and safe for its users. The distinction between these phases can be blurry (e.g., instruction tuning can be described as both fine-tuning and alignment), but it’s still helpful to keep the differences in mind. Continuous pre-training (also known as domain-adaptive pre-training) is a related approach. You continue to train the model on unlabeled data in the target domain to absorb the jargon, and then do supervised fine-tuning. This is different from regular fine-tuning in that it is unsupervised; it is more of an extension of the original pre-training with specialized text. Continuous pre-training can be used to deepen the model’s domain knowledge, while fine-tuning sharpens its performance on a specific task. Supervised Fine-Tuning is the simplest kind of fine-tuning: you have pairs of inputs and outputs, and you train the model to take the inputs and produce the desired outputs. The outputs could be classification labels, expected continuations of a prompt, and more. Fine-tuning GPT-3 on a dataset of customer emails (input) and best answer (output) pairs would be supervised fine-tuning; the model learns to take the email as input and produce the correct response. SFT requires a large amount of high-quality labeled data (which can be expensive to create), but it works very well for well-defined tasks. Instruction tuning is a specific case of SFT where the dataset contains instructions and ideal responses. The purpose of this type of fine-tuning is to improve the LLM’s ability to follow natural language instructions. In practice, when fine-tuning a model for the majority of applications today, you will likely be using an instruction-tuned base model, and further fine-tuning on your domain instructions (that’s in effect domain-specific instruction tuning). For example, you might start with an “instruct” version of a model (say Llama-2-13b-chat) and fine-tune it on your company’s Q&A pairs. In this case, the model already knows how to respond to an instruction; now you teach it how to give your type of answers. This works better and requires less data than fine-tuning a raw model. The model already has a general ability to follow prompts. One of the major challenges with fine-tuning LLMs is their size. A “full” fine-tuning re-trains all of the parameters in the model. For a 7B model, this represents billions of weights to update (literally), and for 70B and larger models, we’re an order of magnitude higher. This means huge GPU memory requirements just for the model and optimizers, as well as a risk of overfitting or catastrophic forgetting of the model’s pretrained capabilities. Enter Parameter-Efficient Fine-Tuning (PEFT): a set of techniques that instead only tune a small portion of the model’s parameters, drastically reducing resource requirements. With PEFT, instead of modifying 100% of the weights in a model, you add some small adapter weights or rank-decomposition matrices, and only train those while leaving the original model weights mostly frozen. This results in a much smaller number of parameters to update (often <1% of the total), lower memory use, and the ability to fine-tune very large models on a single GPU. Two popular PEFT methods are LoRA and QLoRA: Beyond LoRA/QLoRA, PEFT could also cover other methods such as Adapters (small feed-forward modules inserted at each transformer block, with only these being trained while the main weights are frozen) or Prompt Tuning (learning soft prompt vectors). However, LoRA-style methods are by far the most prevalent approach when it comes to fine-tuning LLMs due to the good balance between simplicity and effectiveness. We will show you how to use them in the workflow. Before investing in fine-tuning, evaluate these factors: Fine-tuning is a powerful technique, but it’s not always the right solution. Consider how it compares to other approaches: Prompt engineering is the process of writing the input to a model in such a way as to influence the model’s output. It does not change the model parameters itself. Prompt engineering is fast to iterate and requires no training: you just write instructions or examples. It is also resource-efficient (you do not need GPUs). The downside of prompts is that they can hit a ceiling of some sort: you might run up against a context length limit, or the outputs can be inconsistent or inaccurate for complex tasks. Fine-tuning changes the model’s weights by training it on labeled examples. This allows much deeper customization. A fine-tuned model will be able to do whatever behavior you want without having to supply a long prompt each time, because it has learned that behavior. The trade-off is that fine-tuning requires large-scale GPU compute and high-quality training data. In practice, prompt engineering works well for prototyping and managing simple use cases or adjustments. Fine-tuning is more effective for longer-lasting and more robust changes when you have a well-defined sense of the task and the data that you want to train on. These two approaches are not mutually exclusive – many projects leverage prompt modifications and then also perform fine-tuning if prompts alone are not able to reach the desired level of accuracy or consistency. Retrieval-Augmented Generation is another option: rather than modifying the model, you endow it with access to an external knowledge source. When queried, a RAG system searches and pulls in relevant docs to incorporate into the prompt. This helps keep the model current with the latest knowledge and can help mitigate hallucinations by rooting answers in retrieved text. RAG is great when you need up-to-date knowledge or your data is too large/volatile to infuse into the model. Fine-tuning, in contrast, bakes the domain knowledge into the model’s weights. The model itself becomes a self-contained expert that no longer needs to look up information to answer known situations. This provides low-latency responses (no retrieval necessary at run time) and enables the model to internalize more subtle aspects of the data (such as contextual nuances, style). However, the knowledge in a fine-tuned model is static: if the data gets updated, you must retrain to refresh the model’s knowledge. Fine-tuning also doesn’t inherently give the model access to sources/references to cite, whereas a RAG approach can cite the documents it retrieved. For many applications, a hybrid approach often works best. You might fine-tune an LLM to provide it with a good base behavior (e.g., it’s already good at following instructions and your domain’s jargon) and then use RAG to provide it with up-to-date facts. Sometimes you can avert intensive fine-tuning by using an LLM with tools. For example, instead of fine-tuning a model to perform complex math, use a prompt that calls an API for the hard part (an agent approach). This section walks you through an eight-step workflow for fine-tuning an LLM from planning all the way to deployment. Every fine-tuning project should begin with a well-defined objective. What are you trying to build? A contract analysis assistant? A customer support chatbot? A code generation helper? Or what? Define the use case as precisely as possible; this will inform all other decisions (data, model choice, etc.). Along with the use case, define success criteria. Pick metrics or evaluation criteria that capture the desired behavior of the model. For instance: Next, choose which base LLM you would like to fine-tune. The base model you choose is critically important; you want one that is A) capable enough for the task at hand, B) allowed for your intended use (license), and C) reasonably fine-tunable given your hardware. The following table presents some considerations to make during this step: Good data, tailored to your task, is the key to success. Data collection and preparation are the most time-consuming. Sub-steps include data collection, cleaning, and formatting. The table below gives a high-level view of the end-to-end workflow for preparing data to fine-tune a large language model. It walks you through three main phases: (1) collecting data from all sources (domain documents, task demonstrations, synthetic data, and public datasets), (2) cleaning and preprocessing that data to get it to the proper quality, privacy, and balance, and (3) formatting the data into model-ready input–output pairs that follow how the model will be prompted in production. Now that you have data and a model – how exactly will you fine-tune? The table below compares the most common strategies for adapting large language models: full fine-tuning, parameter-efficient fine-tuning (PEFT, including LoRA and QLoRA), in-context learning, and hybrid approaches. With the strategy in place, set up the environment to run fine-tuning. The table below summarizes the practical environment setup for LLM fine-tuning. It includes hardware requirements, core libraries and frameworks, optional managed platforms, and a typical workflow for configuring and testing your training script. Time to fine-tune! This step is running the actual training process and tweaking the hyperparameters to enable it to learn. Here we present the key training loop hyperparameters and operational practices for fine-tuning LLMs. Once you have trained your model, the next step is to evaluate your fine-tuned model to see whether it achieves the success criteria defined in step 1. Evaluation should include quantitative metrics and a qualitative analysis. The final step is to put your fine-tuned model into production use. Deployment for an LLM means enabling it to serve inference queries at the needed scale and integrating it with your application. The table below summarizes how to deploy and serve a fine-tuned LLM in production. Let’s try to put together a high-level template of a PEFT fine-tuning project. This ties together many of the steps. We’ll use a pseudo-code /checklist style to present the full project structure and steps: Example MODEL_NAME (e.g., “mistralai/Mistral-7B-Instruct-v0.2”). 2. Load Model in 4-bit and Add LoRA: And here prepare_model_for_kbit_training is performing various recommended things (gradient checkpointing, casting layer norms to fp32, etc.) for QLoRA stability. 4. Training Loop (using HF Trainer or custom): We perform accumulation to reach batch 32. We save checkpoints regularly (every 50 steps) and keep the last 2. Evaluate each epoch on val_dataset if available. After training, load the best model (the trainer should have saved it or use the last checkpoint): Calculate metrics if you have structured outputs or references. 6. Save LoRA Adapter (or merged model): As default, get_peft_model wraps the base model, a call to save_pretrained will thus save a config + the LoRA weights (not the base weights) to adapter_model.bin or similar. You’d have to get the base model weights separately to use them. Alternatively, to get a standalone model: This will produce a directory with the full model (base+adaptation merged). Be careful with memory when doing a merge (you need the whole model in memory). 8. Testing: Do final tests on a staging environment or on a subset of real data if possible, then deploy. The template above omits some details (exact data collation function, any custom generation settings, …), but it should be a pattern that you can use as a starting point for most tasks. Fine-tuning is not just a theoretical exercise – many organizations are doing it to unlock value in specific applications. Let’s look at a few use cases. Suppose an organization has been generating customer support logs for years: emails, chat transcripts, FAQ articles, etc. They want an AI assistant that can quickly and consistently answer customer questions using that existing data. GPT-4 and similar open-source models can answer any random question one might think to ask, but they obviously don’t know any internal product specifications, policies, or past resolution details specific to this organization. Fine-tuning an LLM on past support tickets/resolutions effectively creates a custom support domain specialist model for the organization. Legal/compliance documentation is a classic example of expert knowledge in niche jargon and subtly-defined concepts. A general-purpose LLM will not have prior knowledge of your company’s particular contract language, policies, or compliance obligations. However, by fine-tuning on your domain’s corpus of documents (contracts, policy manuals, regulatory documents, etc.), you can build a model with that expert knowledge. For example, you could fine-tune on a large body of contract text and then ask the model to answer questions like “Does this draft contract have a non-compete clause? If so, summarize what restrictions it imposes.” with greater accuracy than a generic model. It will have seen lots of clause variations during training and learned how to extract/understand them. AI coding assistants for software developers are already widely used. However, many are trained on general code and documentation. Internal company frameworks, libraries, and codebase details are not necessarily present in the general-purpose LLMs. If you fine-tune an LLM on your own codebase and documentation, you can build a code assistant that is an expert in your stack. Fine-tuning LLMs can be a powerful technique, but it can also go horribly wrong if not done carefully. Let’s go through some common antipatterns and how to avoid them: LLM fine-tuning used to be a niche optimization step. However, it is quickly becoming the de facto method to convert powerful base models into reliable, domain-specific systems. By leveraging pre-trained capabilities as a starting point instead of training from scratch, you can imbue the model with your own data, tone, and constraints while keeping compute and engineering effort under control. The combination of supervised fine-tuning, instruction tuning, and alignment techniques such as RLHF also provides a toolkit to shape both what the model knows and how it behaves. Parameter-efficient fine-tuning methods such as LoRA and QLoRA allow the adaptation of massive models with modest GPUs and a tiny fraction of trainable parameters. This drastically reduces the barrier to experimentation. Combined with a principled decision framework, you can select the right technique for each use case instead of defaulting to the most expensive option. Effective LLM fine-tuning is more about a disciplined fine-tuning lifecycle: Define your use case → Choose a suitable base model → Curate high-quality data → Pick a strategy (full FT or PEFT) → Train with sane hyperparameters → Evaluate rigorously → Deploy with monitoring, versioning, rollbacks in place. If you treat fine-tuning as an iterative product process rather than a one-off experiment, you can turn generic LLMs into dependable, high-ROI components of your stack. Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases. Learn more about our products I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community. With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads. This textbox defaults to using Markdown to format your answer. You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link! Please complete your information! Join the many businesses that use DigitalOcean’s Gradient AI Agentic Cloud to accelerate growth.
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI agents, and bare metal GPUs. Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation. Full documentation for every DigitalOcean product. The Wave has everything you need to know about building a business, from raising funding to marketing your product. Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter. New accounts only. By submitting your email you agree to our Privacy Policy Scale up as you grow — whether you're running one virtual machine or ten thousand. Sign up and get $200 in credit for your first 60 days with DigitalOcean.* *This promotional offer applies to new accounts only.