MAR 26, 2026 • ENGINEERING • 11 MIN READ

Fine-Tuning Small Language Models for Domain-Specific NL Interfaces

The end-to-end pipeline for training SLMs that understand enterprise terminology and map user intent to system operations, at a fraction of frontier model costs.

By Wenable Labs

Natural language interfaces for enterprise software the kind we described in our NL Control Plane post depend on a model’s ability to understand domain-specific terminology and map user intent to system operations. “Show me all devices that failed compliance in the last 48 hours” is a straightforward sentence in English, but parsing it into a precise API call requires understanding what “compliance” means in a device management context, what “failed” maps to in the system’s status taxonomy, and how “last 48 hours” translates to a timestamp filter.

A general-purpose large language model can do this. Given a sufficiently detailed system prompt, a frontier model will produce correct structured output for most well-phrased queries. But “most” is not good enough for production, and the cost and latency profile of frontier API calls does not support the always-on, sub-second responsiveness that interactive interfaces demand.

A fine-tuned small language model 3B to 8B parameters, trained on thousands of domain-specific examples does it better for the specific domain, at a fraction of the cost, with the latency profile that interactive use requires. This is the approach behind ViVi, our device management agent, and behind similar deployments across fleet operations and pharmaceutical quality management. This post covers the end-to-end pipeline we use to build these models: from the decision to fine-tune, through data preparation and training, to evaluation and the deployment architecture that ties it all together.

Why Fine-Tune Instead of Prompt Engineer?

Prompt engineering is the correct starting point for any NL interface project. We always begin there, and for many simple use cases, a well-crafted system prompt with a frontier model is sufficient. But prompt engineering hits a ceiling, and recognizing that ceiling early saves significant time and cost.

Context window saturation. Enterprise domains are not simple. A device management platform like WeGuard exposes over 200 configuration settings. A fleet management system tracks dozens of HOS regulation parameters, driver risk metrics, and route optimization variables. Describing all of these operation types, their parameters, their constraints, and their output formats in a system prompt produces prompts that run to 8,000 to 15,000 tokens before a single user query arrives. At that scale, the model’s attention is diluted, and performance on edge cases degrades noticeably.

Cost at scale. A frontier API call with a long system prompt costs roughly $0.01 to $0.03 per query when factoring in both input and output tokens. A fine-tuned 3B model running on self-hosted infrastructure costs approximately $0.0001 per query. At 10,000 queries per day a realistic volume for an enterprise NL interface with several hundred active users the difference is $100 to $300 per day versus $1 per day. Over a year, the fine-tuned model saves $36,000 to $109,000 in inference costs alone.

Latency. A frontier API call with a large system prompt takes 500ms to 2,000ms end-to-end. A fine-tuned SLM running locally produces output in 50ms to 200ms. For interactive interfaces where a user issues commands in rapid succession, sub-200ms response times are the difference between a tool that feels instant and one that feels sluggish.

Output consistency. Fine-tuned models produce more consistent output formats because they have internalized the structure from thousands of training examples. A prompted frontier model will occasionally vary its JSON key ordering, include unexpected fields, or wrap its output in markdown code blocks despite explicit instructions not to. These inconsistencies create parsing fragility downstream. A fine-tuned model trained on 2,000 examples in the exact output schema rarely deviates.

Data privacy. Self-hosted SLMs keep enterprise data within the organization’s infrastructure. For regulated industries healthcare, finance, government this is not a preference; it is a compliance requirement.

The decision threshold we apply internally: if the task is specific, repeated, involves 500 or more distinct input-output patterns, and requires low latency, fine-tuning is the better path. Below that threshold, prompt engineering with a frontier model is faster to deploy and easier to iterate on.

The Training Data Pipeline

The quality of a fine-tuned model is bounded by the quality of its training data. We have learned this lesson repeatedly, and our data pipeline now receives more engineering effort than the training process itself.

Source Material

Training data starts with the enterprise domain itself. For ViVi, the WeGuard device management agent, sources included the complete WeGuard API documentation, historical device management commands issued by administrators through the existing console, internal runbooks for common IT operations, and the full policy configuration schema with all valid parameter combinations. For our fleet management deployment, sources included HOS regulatory documentation, the fleet system’s query language reference, and anonymized historical queries from dispatchers and safety managers.

Synthetic Generation

Raw source material is not training data. It needs to be transformed into instruction-output pairs that match the format the model will encounter in production. We use teacher models primarily Claude and GPT-4 to generate these pairs. The process is deliberate: we feed the teacher model a chunk of source documentation along with detailed instructions about the target output format, then generate candidate training examples.

For a device management command like “Lock all Android devices in the sales department,” the training example includes the natural language input, the parsed intent (device_action: lock), the extracted entities (platform: Android, department: sales), and the structured API call that the system should execute. For multi-turn interfaces, we generate full conversation trajectories where follow-up queries reference context from earlier turns.

Quality Control

Teacher models introduce errors. They hallucinate parameter values not present in the schema, generate subtly incorrect entity mappings, and produce inconsistent formatting across examples. Our filtering pipeline applies schema validation (does the output conform to the expected structure?), self-consistency checks (generate three responses for the same prompt; keep only examples where at least two agree), source grounding verification (do extracted entities exist in the system’s actual data model?), and deduplication to prevent the model from memorizing specific phrasings.

The survival rate is typically 30 to 40 percent. For every 1,000 generated examples, 300 to 400 pass all quality filters. This is expected, and attempts to shortcut the filtering step have consistently produced inferior models.

The Instruction Taxonomy

Before generating any data, we build a complete taxonomy of user intents for the target domain. For device management, this includes query intents (device status, compliance reports, inventory searches), configuration intents (policy creation, modification, deletion), action intents (lock, wipe, push update, deploy app), and meta intents (help, clarification, confirmation). Each category must be represented in the training data in proportion to its expected production frequency, with deliberate oversampling of edge cases and boundary conditions.

Critically, the training data must include examples where the correct model response is refusal or clarification. “Delete all devices” should produce a safety confirmation, not a deletion command. “Show me the quantum flux capacitor settings” should produce a response indicating that no such capability exists. A model that has never seen these boundaries in training will not respect them in production.

LoRA Configuration That Matters

We use LoRA (Low-Rank Adaptation) exclusively for domain-specific fine-tuning. Full fine-tuning is unnecessary for our use cases and introduces overfitting risks with the dataset sizes we typically work with. After dozens of fine-tuning runs across multiple client domains, we have converged on a set of configuration choices that reliably produce good results.

Rank (r): 32 as the default. Rank controls the dimensionality of the adaptation matrices. Lower ranks (8 to 16) work for simple classification tasks, but for the combined intent classification, entity extraction, and structured output generation that NL interfaces demand, we find that rank 32 provides the right balance between adaptation capacity and parameter efficiency. We have tested ranks up to 128 and have not observed meaningful gains above 64 for our task profiles.

Alpha: double the rank. We set alpha to 64 when rank is 32. The alpha-to-rank ratio controls the effective learning rate scaling of the LoRA layers. A 2:1 ratio has been consistent across our experiments; deviating from it has not produced improvements worth the additional hyperparameter search time.

Target modules: attention layers plus MLP projections. We target q_proj, k_proj, v_proj, and o_proj in the attention layers, plus gate_proj, up_proj, and down_proj in the MLP layers. Targeting only attention layers is insufficient for tasks that require the model to learn new output distributions (structured JSON, domain-specific schemas). Including the MLP projections allows deeper adaptation of the model’s representation space.

Learning rate: 2e-4 with cosine scheduling. This is our upper bound for datasets of 1,000 or more examples. For smaller datasets (500 to 1,000), we reduce to 1e-4 to avoid catastrophic forgetting where the model loses general language capabilities it needs to understand varied phrasings of user queries. Cosine scheduling with a brief warmup (5 to 10 percent of total steps) provides smooth convergence.

Epochs: 3 as the default. Two epochs is sometimes insufficient for the model to learn complex output schemas. Four epochs risks overfitting on datasets under 2,000 examples. We monitor validation loss and stop early if it begins diverging from training loss.

What matters less than people think: batch size (anywhere from 4 to 16 with gradient accumulation produces similar results), warmup step count (minimal impact beyond the first few percent of training), and optimizer choice (AdamW with standard betas is reliable; exotic optimizers have not justified their complexity in our experiments).

Evaluation: Beyond Perplexity

Perplexity is the default evaluation metric for language model training, and it is misleading for task-specific fine-tuning. A model can achieve excellent perplexity indicating that it predicts the next token well on average while producing outputs that fail on the specific metrics that matter for a production NL interface. We have seen models with lower perplexity than their predecessors that were actually worse at the deployed task.

Task-Specific Metrics

We evaluate fine-tuned models on three dimensions that directly predict production performance.

Intent classification accuracy. Given a natural language input, does the model correctly identify the operation type? For ViVi, this means distinguishing between a device query, a policy configuration command, a compliance report request, and an action instruction. We measure this as exact-match accuracy on a held-out test set that includes adversarial phrasings and ambiguous edge cases. Our deployment threshold is 95 percent accuracy.

Entity extraction F1. Does the model correctly identify and extract all relevant parameters from the input? For “Show me Android devices in the sales department that failed compliance since Monday,” the model must extract platform, department, compliance status, and time range with no hallucinated extras and no missing fields. We measure precision, recall, and F1 across all entity types. Our threshold is 0.92 F1.

Output format compliance rate. Does the model’s structured output conform to the expected schema 100 percent of the time? A model that produces valid JSON in 98 percent of cases sounds good until you realize that at 10,000 queries per day, 200 queries will return unparseable output that requires error handling, retry logic, or fallback to a different model. We target 99.5 percent schema compliance.

Comparative Evaluation

Every fine-tuned model is tested against three baselines: the base model without fine-tuning (to quantify the fine-tuning gain), a prompted frontier model like Claude or GPT-4 (to determine whether fine-tuning has reached parity with the best available API), and the previous fine-tuned version if one exists (to confirm that the new version is a genuine improvement). The fine-tuned SLM must match or exceed the prompted frontier model on all domain-specific metrics before it replaces the API dependency in production.

Continuous Evaluation

Deployment is not the end of evaluation. We monitor task accuracy on a sampled subset of production traffic, comparing model outputs against human-verified labels on a rolling basis. When accuracy drifts below threshold typically because user query patterns have shifted or new domain terminology has been introduced that signals the need for a new fine-tuning cycle with updated training data.

SLM-Default, LLM-Fallback in Practice

A fine-tuned SLM, however well trained, will not handle every query. Users will ask questions outside the model’s training distribution, phrase requests in ways the model has not seen, or combine multiple complex operations in a single instruction. The deployment architecture must account for this gracefully.

We use a routing pattern we call SLM-default, LLM-fallback, which maps to Layer 2 (Model Intelligence) in the 7-Layer Agentic AI Stack. The fine-tuned SLM is the first responder for all incoming queries. A lightweight confidence classifier evaluates the model’s output: if the confidence score exceeds the threshold and the output passes schema validation, it proceeds to execution. If either check fails, the query is routed to a larger model a frontier API call that handles the query with its broader capabilities.

The fallback path is not a failure state; it is a data collection opportunity. Every query that routes to the LLM fallback is logged, along with the frontier model’s response. These logged pairs become training candidates for the next SLM fine-tuning cycle, reviewed by our quality pipeline and added to the training set if they pass validation. This creates a flywheel: the SLM improves with each fine-tuning cycle, handles a larger share of queries, and reduces the volume of expensive fallback calls.

In our device management deployment with ViVi, the SLM handled 65 percent of incoming queries after the initial fine-tuning cycle. After the second cycle incorporating logged fallback data and additional edge cases identified during production monitoring the handling ratio improved to 78 percent. After the third cycle, it reached 85 percent, where it has stabilized. Each percentage point improvement translates directly to reduced API costs and lower median latency for end users.

The remaining 15 percent of queries that still route to the LLM fallback are predominantly novel query types, complex multi-step operations, and ambiguous requests that genuinely benefit from a larger model’s reasoning capacity. The SLM does not need to handle everything. It needs to handle the high-frequency, well-understood operations that constitute the bulk of daily usage.

Closing

Fine-tuning small language models for natural language interfaces is not about replacing frontier models. It is about deploying the right model for the right task at the right cost. For domain-specific, high-frequency, latency-sensitive enterprise interfaces, a well-trained 3B parameter model is not a compromise it is the optimal engineering choice.

The investment is real: building a quality training data pipeline, configuring LoRA parameters, establishing rigorous evaluation suites, and maintaining the SLM-fallback routing architecture all require engineering effort. But that investment pays back with every query that does not need a frontier API call, every sub-200ms response that keeps the interface feeling instant, and every fine-tuning cycle that makes the model more capable.

The models are small. The impact is not.