MAR 16, 2026 • DATA • 7 MIN READ

Our Synthetic Data Playbook: What Works and What Wastes GPU Hours

Practical lessons from generating synthetic training data for domain-specific small language models, covering teacher-student pipelines, quality filtering, and the mistakes that burn compute without improving results.

By Wenable Labs

Fine-tuning a small language model is, in relative terms, the easy part. Pick a base model, choose a LoRA rank, set a learning rate, and let the trainer run. The hard part the part that actually determines whether the resulting model is useful is creating the training data.

Over the past year, we have generated synthetic datasets across device management, fleet compliance, and pharmaceutical quality domains. Some approaches worked on the first attempt. Others consumed significant compute budgets and produced models that performed worse than the base. The difference between success and failure was rarely the model architecture or the training hyperparameters. It was almost always the data.

This is our playbook, written from accumulated trial and error. We are sharing it because the community discourse around synthetic data tends to focus on the generation step while glossing over the filtering, validation, and iteration that make the difference between a useful dataset and an expensive hallucination amplifier.

The Teacher-Student Pipeline

Our core architecture is a teacher-student pipeline. The concept is straightforward: a large, capable model generates training examples that a smaller, deployable model learns from.

The pipeline has five stages. First, we collect source documents enterprise policies, regulatory texts, equipment manuals, standard operating procedures, and historical support tickets. These documents represent the ground truth for the domain. Every synthetic example must ultimately be traceable back to something in this corpus.

Second, a teacher model typically Claude or GPT-4 generates instruction-output pairs from the source documents. The prompts are domain-specific and structured. We do not ask the teacher to “generate some training examples.” We give it a specific document segment, a specific task format, and explicit constraints on what constitutes a valid output.

Third, automated quality checks evaluate every generated example. These checks cover format compliance (does the output match the expected schema?), factual grounding (can the answer be verified against the source document?), and diversity (is this example meaningfully different from others already in the dataset?).

Fourth, aggressive filtering. This is the stage most teams underestimate. For every 1,000 examples the teacher generates, roughly 300 to 400 survive our quality pipeline. The rest are duplicates, near-duplicates, poorly grounded, or trivially easy examples that would not teach the student model anything useful.

Fifth, the curated dataset fine-tunes a smaller model typically Qwen 3B or Gemma 2B using LoRA adapters. The small model learns the domain-specific patterns from the teacher’s examples without requiring the teacher’s parameter count or inference cost.

Why this works in practice: the large model has sufficient knowledge and reasoning ability to generate high-quality examples, but it is too expensive and too slow for production inference at enterprise scale. The small model learns the patterns efficiently and runs on modest hardware, including edge devices. The synthetic data pipeline is the bridge between the two.

What Works

After dozens of dataset generation runs across multiple domains, several patterns consistently produce better results.

Domain-specific instruction formats matter more than volume. The format of the instruction directly affects how well the student model generalizes. “Given this device policy excerpt, identify any compliance violations and cite the relevant section” consistently outperforms generic formats like “Answer the following question about this text.” The more the training format mirrors the actual production query format, the better the fine-tuned model performs. We design instruction templates for each client engagement rather than reusing a generic template across domains.

Diversity beats volume every time. We have repeatedly observed that 500 diverse, high-quality examples outperform 5,000 repetitive ones. Diversity means variation across document types, question complexity levels, answer lengths, and edge case coverage. When the teacher model generates examples from a narrow slice of the source corpus, the student model overfits to that slice. We enforce diversity through stratified sampling across document categories, explicit difficulty bucketing, and deduplication based on semantic similarity rather than exact string matching.

Edge cases must be generated explicitly. Left to its own defaults, the teacher model gravitates toward straightforward examples with clear, confident answers. But in production, the most valuable behavior is often knowing when not to answer. We explicitly prompt the teacher to generate examples where the correct response is “insufficient information to determine compliance status” or “this scenario requires human review.” These examples teach the student model appropriate uncertainty, which is critical in regulated domains.

Source traceability is non-negotiable. Every generated example includes a reference to the specific source document and section it was derived from. During validation, we verify that the generated answer is actually supported by the cited source. Examples that cannot be grounded are discarded regardless of how plausible they appear. This single practice has done more to reduce hallucination in our fine-tuned models than any other intervention.

Multi-turn conversations outperform single-turn Q&A for agent use cases. When the deployment target is a conversational agent, training on single-turn instruction-output pairs produces a model that answers questions but does not handle follow-ups, clarifications, or multi-step reasoning well. We generate synthetic multi-turn dialogues where the teacher model plays both the user and the assistant, simulating realistic conversation flows. The additional complexity in generation pays off directly in deployment quality.

What Wastes GPU Hours

Equally important are the patterns that reliably waste compute without improving the final model.

Training on unfiltered synthetic data is worse than not fine-tuning at all. The teacher model, despite its capabilities, generates a meaningful percentage of examples that contain subtle errors hallucinated regulation numbers, slightly incorrect policy interpretations, or answers that are technically correct but misleadingly framed. Training on these examples does not just fail to help; it actively teaches the student model to reproduce those error patterns with high confidence. Unfiltered synthetic data is a hallucination amplifier.

Volume hits diminishing returns faster than expected. For narrow domain tasks, the quality improvement curve flattens dramatically after 1,000 to 2,000 high-quality examples. We have run experiments where doubling the dataset from 1,500 to 3,000 examples produced no measurable improvement on held-out evaluation sets, while the additional generation and training time was substantial. The right response to a plateau is not more data; it is better data, more diverse data, or data targeting the specific failure modes observed in evaluation.

Format mismatch between training and deployment silently degrades performance. Training on instruction-output pairs when the deployment expects chat-formatted messages or vice versa introduces a distribution gap that the model must bridge at inference time. This does not cause obvious failures. The model still produces outputs. But accuracy drops by 5 to 15 percent compared to format-matched training, and debugging the cause is time-consuming because the symptoms look like a data quality problem rather than a formatting problem.

Generating examples for knowledge the base model already has is wasted capacity. The base Qwen 3B model already understands general English grammar, common reasoning patterns, and widely-known factual information. Generating synthetic examples that teach it things it already knows consumes the dataset budget without adding domain-specific value. We now audit the base model’s existing capabilities on our evaluation set before generating any synthetic data, and we focus generation exclusively on the knowledge gaps the domain-specific terminology, regulatory details, and enterprise-specific logic that the base model lacks.

Waiting for the full dataset before evaluating is an expensive mistake. In our early projects, we would generate the entire synthetic dataset, run the full fine-tuning job, and only then evaluate the result. If the data had issues, we had wasted the entire pipeline. We now fine-tune on the first 100 examples, run the evaluation suite, and examine failure patterns before scaling up. This incremental approach has caught prompt design flaws, format mismatches, and quality filtering gaps within the first day rather than after a week of generation and training.

The Tools We Use

Our pipeline relies on a small number of well-tested tools rather than a sprawling custom framework.

The instructor library is our primary tool for structured synthetic data generation. It wraps LLM API calls with Pydantic model validation, ensuring that every generated example conforms to a typed schema before it enters the pipeline. When the teacher model produces an output that does not match the expected structure, instructor handles retries and validation automatically. This eliminates an entire class of data quality issues at the generation stage.

For larger-scale generation runs, we use distilabel, which provides built-in pipeline orchestration, quality metrics, and batch processing. It handles the infrastructure concerns rate limiting, checkpointing, parallel generation so that we can focus on prompt design and quality criteria.

Beyond these libraries, we maintain a set of custom validation scripts for each domain. These are not generic quality checks; they are domain-specific verifications. For HOS compliance data, the scripts verify that referenced regulation numbers actually exist in the CFR. For pharmaceutical quality data, they confirm that cited USP chapter numbers are valid and that process parameter ranges fall within physically plausible bounds.

For high-stakes domains pharmaceutical quality and regulatory compliance in particular a sample of the generated data goes through expert human review. Automated validation catches structural and factual errors, but it cannot catch subtle misinterpretations of regulatory intent. Human review on a stratified sample of 5 to 10 percent of the generated data provides the final quality gate.

Closing

Synthetic data generation is an engineering discipline, not a one-time task. The quality of the training data directly determines the quality of the fine-tuned model, and that quality comes from sustained investment in the pipeline: source document curation, generation prompt design, automated validation, aggressive filtering, and incremental evaluation.

It is less glamorous than model architecture research. It does not produce impressive benchmark numbers to share on social media. But in our experience across multiple enterprise domains, the difference between a fine-tuned model that earns client trust and one that gets shelved after a pilot is almost never the model. It is the data.

Build the pipeline first. The model will follow.