FEB 2, 2026 • FINE-TUNING • 9 MIN READ

What We Learned Fine-Tuning Gemma and Qwen with Unsloth AI

Practical lessons from fine-tuning small language models for enterprise domains using Unsloth AI, including model selection, synthetic data generation, and deployment trade-offs.

By Wenable Labs

Generic large language models are impressive generalists. They can write poetry, summarize legal briefs, and generate working code in dozens of languages. But when we asked one to classify device management policy violations according to a client’s internal taxonomy, it hallucinated categories that did not exist. When we asked another to extract structured data from trucking compliance documents, it confidently returned fields that the schema never defined.

This is the gap that pushed us into fine-tuning. Our consulting engagements increasingly demand models that understand narrow, domain-specific vocabularies pharmaceutical quality control terms, fleet management regulations, internal policy hierarchies with high precision and zero tolerance for creative interpretation. Calling a frontier API works during prototyping, but in production it introduces latency, ongoing per-token costs, and a dependency on external infrastructure that many enterprise clients are unwilling to accept.

We needed models that are smaller, faster, and cheaper to run, while being more accurate on the specific tasks our clients care about. That led us to Unsloth AI, an open-source library that makes fine-tuning practical on hardware we already had access to.

Why Unsloth

Fine-tuning a 7B parameter model with standard HuggingFace and PEFT tooling requires significant GPU memory and patience. A full LoRA fine-tune on a 7B model can consume 40GB+ of VRAM and take hours even on an A100. For a consulting firm running multiple client experiments in parallel, those resource requirements add up fast.

Unsloth changes the economics. It delivers 2 to 5x faster training and 60 to 80 percent lower VRAM consumption compared to the standard stack. It achieves this through custom CUDA kernels for attention and MLP layers, optimized backward passes that reduce memory allocation overhead, and intelligent gradient checkpointing that trades minimal compute for substantial memory savings. The library patches HuggingFace models in place, so the workflow remains familiar you still write standard Trainer configs and use standard datasets.

For us, the practical impact is straightforward: we can fine-tune a 3B model on a free Google Colab T4 GPU, and a 7B to 9B model on a single A100 or an affordable RunPod instance. Per-client fine-tuning experiments cost single-digit dollars instead of triple digits. That makes it feasible to iterate quickly, test hypotheses, and discard approaches that do not work without budget anxiety.

Choosing the Right Base Model

We have spent considerable time comparing base models across our client workloads. The three families we return to most often are Gemma 2, Qwen 2.5, and Llama 3. Each has distinct strengths, and the right choice depends less on benchmark leaderboards than on where the model will actually run.

Qwen 2.5 3B has become our default starting point for structured output tasks. It handles JSON generation, classification, and entity extraction with surprising competence for its size. When fine-tuned on 500 to 1000 domain-specific examples, it consistently produces well-formed outputs with low hallucination rates. The 3B parameter count means it runs comfortably on modest hardware, including edge devices and CPU-only servers.

Gemma 2 9B is where we go when the task requires stronger reasoning. Multi-step extraction pipelines, document comparison tasks, and anything involving conditional logic benefit from the additional capacity. The trade-off is real, though: it needs roughly 3x the VRAM of the Qwen 3B at equivalent quantization levels, and inference latency scales accordingly.

Llama 3.x 8B remains the most versatile option. The ecosystem around it is unmatched quantized formats, inference engines, deployment guides, and community-contributed fine-tunes for nearly every domain. When a client engagement has ambiguous requirements or when we need to move fast with broad community support, Llama is a safe bet.

The key insight we keep reinforcing internally: model selection is a deployment decision, not a benchmarking exercise. A 3B model that runs on a client’s existing edge hardware is worth more than a 9B model that requires provisioning a dedicated GPU server. We have seen multiple engagements where the “worse” model on paper was the correct engineering choice because it fit the infrastructure constraints.

We have also started evaluating Qwen 3.5 for newer projects. Early results suggest meaningful improvements in instruction following and structured output quality over the 2.5 series, though we do not yet have enough production data to make definitive claims.

The Fine-Tuning Pipeline

Our pipeline has stabilized after several months of iteration. Here is what it looks like in practice.

Data Preparation

Training data comes primarily from synthetic generation using teacher models (Claude and GPT-4), seeded with real enterprise documents provided by clients. We feed source documents to the teacher model with detailed instructions about the output format, then generate instruction-response pairs that mirror the tasks the fine-tuned model will handle in production. Every generated example is reviewed against validation rules and, for critical domains, spot-checked by a domain expert on the client side.

LoRA Configuration

We use LoRA (Low-Rank Adaptation) exclusively. Full fine-tuning is unnecessary for our use cases and introduces overfitting risks with the dataset sizes we work with. The parameters that actually matter, based on our experimentation:

Rank (r): 16 to 32 for most tasks. We have not seen meaningful gains above 64 on domain-specific classification and extraction work.
Alpha: We typically set alpha equal to rank, or double the rank value. The ratio matters more than the absolute numbers.
Target modules: Attention projections (q_proj, k_proj, v_proj, o_proj) plus the gate and up projections in the MLP layers. Targeting all linear layers is tempting but increases training time without proportional quality gains for our task profiles.
Dropout: 0.05 to 0.1. We used to skip this entirely and paid for it with overfitting on smaller datasets.

Training Configuration

A typical training run looks like this: 3 to 5 epochs, learning rate of 2e-4 with cosine scheduling, batch size of 4 with gradient accumulation to an effective batch size of 16, and warmup over the first 10 percent of steps. We use Unsloth’s FastLanguageModel loader and the standard HuggingFace SFTTrainer. Most runs complete in 30 to 90 minutes on a single A100 for datasets under 5000 examples.

Evaluation

This is where we made our most expensive mistakes early on. We initially evaluated models by “feel” loading them up, running a few queries, and deciding whether the outputs seemed better. This is unreliable. A fine-tuned model that produces more confident-sounding outputs can actually be less accurate than the base model on held-out test data.

We now evaluate on task-specific accuracy metrics: exact match for classification, token-level F1 for extraction, and schema compliance rate for structured output. Every fine-tuned model is compared against both the base model and the equivalent frontier API on the same test set. If the fine-tuned model does not clear a defined accuracy threshold above the base model, we revisit the training data before adjusting model parameters.

Mistakes That Cost Us Time

Three errors recurred before we built guardrails against them:

Overfitting on small datasets. With fewer than 500 examples, models memorize rather than generalize. We now enforce a minimum dataset size and monitor validation loss for divergence from training loss.
Learning rate too high. The default learning rates from many tutorials are tuned for larger datasets. For our typical dataset sizes, 2e-4 is an upper bound; we often drop to 1e-4 for datasets under 1000 examples.
Evaluating on training-adjacent data. Our held-out test sets initially looked too similar to the training data. We now construct test sets that include distribution shifts different document types, edge cases, and adversarial phrasing to catch models that only learned surface patterns.

Synthetic Data: The Hardest Part

If fine-tuning is the engine, training data is the fuel, and we have learned that refining that fuel takes more effort than building the engine.

Our synthetic data pipeline works in stages. First, we ingest a client’s source documents policy manuals, regulatory filings, internal knowledge bases, product specifications and segment them into coherent chunks. Then we prompt a teacher model to generate instruction-response pairs grounded in each chunk. The instruction style varies by task: classification prompts, extraction prompts, question-answer pairs, or summarization requests.

The raw output from this generation step is not usable as-is. Teacher models introduce subtle errors: slightly wrong category labels, hallucinated details not present in the source document, inconsistent formatting across examples. We apply a multi-stage filtering pipeline:

Schema validation: Does the output conform to the expected structure?
Self-consistency: Generate three responses for the same prompt; keep only examples where at least two agree.
Source grounding: Verify that extracted entities and facts appear in the original document chunk.
Deduplication: Remove near-duplicate examples that would bias the model toward memorizing specific patterns.

The survival rate is sobering. For every 1000 examples we generate, roughly 300 to 400 pass all quality filters. We use the instructor library for structured generation from teacher models, which improves the raw pass rate on schema validation, but the grounding and consistency checks still eliminate a significant portion.

We have also started experimenting with iterative refinement: taking rejected examples, feeding the rejection reason back to the teacher model, and requesting corrected versions. This recovers approximately 15 to 20 percent of initially rejected examples, though at the cost of additional API calls to the teacher model.

What Surprised Us

Several findings contradicted our initial assumptions.

Small models can beat frontier models on narrow tasks. A Qwen 2.5 3B fine-tuned on 800 examples of pharmaceutical quality term classification outperformed GPT-4 on held-out test data by 7 percentage points. The fine-tuned model had seen the specific taxonomy and learned the precise boundary conditions between categories. The frontier model, despite its general intelligence, made classification errors that a specialist never would.

The dataset size threshold is lower than expected. We anticipated needing 5000 to 10000 examples for meaningful fine-tuning gains. In practice, 500 to 1000 high-quality examples consistently produce models that are clearly superior to the base model on targeted tasks. Beyond 2000 examples, we see diminishing returns unless the task distribution is exceptionally broad.

Quantization is less destructive than we feared. We expected significant quality degradation when quantizing fine-tuned models to 4-bit (Q4_K_M via llama.cpp). In practice, quantized models retain 90 to 95 percent of the full-precision accuracy on our evaluation suites. For deployment scenarios where memory is constrained, the trade-off is almost always worth it.

Evaluation discipline matters more than training technique. The single highest-impact change to our pipeline was not a training optimization it was building rigorous, task-specific evaluation suites before starting any fine-tuning run. Models that “feel” better in casual testing can score worse on structured metrics. Without quantitative evaluation, we would have shipped inferior models and never known.

Where This Goes Next

Fine-tuning is not a silver bullet. It is a practical engineering technique that, when applied carefully, produces models that are cheaper to run, faster to respond, and more accurate on specific tasks than general-purpose alternatives. For an enterprise consulting firm, this is a genuine capability the ability to build a domain-specific model for a client in days rather than months, and deploy it on infrastructure they already own.

We are continuing to refine our pipeline with each engagement. Future areas of focus include multi-task fine-tuning (training a single model to handle several related tasks within a domain), more sophisticated evaluation frameworks, and tighter integration between our synthetic data generation and fine-tuning workflows. We plan to document specific techniques and results as we deploy these models across client environments.

The tools are mature enough. The real challenge, as it always is in applied AI, is the data.