Every engineering leader I know is looking at Large Language Models and asking the same question: “How do we use this?”
It’s the wrong question. The right question is: “Which model class actually solves our problem?”
The assumption that bigger is better has led more teams to overengineer AI products than any other mistake. They reach for a trillion-parameter model to classify customer support tickets, then spend months fighting latency, cost, and hallucination problems that a Small Language Model would have avoided entirely.
I’ve been in enough of these conversations to know the pattern. Here’s what I’ve learned about when small beats big — and when it doesn’t.
The Oversizing Problem
Teams default to LLMs because LLMs are what they’ve heard about. GPT, Claude, Gemini — these are the names in every headline. When an engineering team decides to “add AI” to a product, the first instinct is to call the OpenAI API and figure out the rest later.
This works for demos. It fails in production.
A general-purpose LLM is optimized for breadth. It can write a poem, summarize a legal document, and debug your code in the same session. That flexibility is remarkable — but it comes at a cost. The model carries the weight of everything it might ever be asked to do, even when your use case is a narrow, well-defined slice of that capability.
What works better: Start by defining the boundary of the task. Is it classification? Extraction? Summarization? Generation? If the output space is constrained — a yes/no decision, a structured data extraction, a fixed set of categories — you almost certainly don’t need a general-purpose model. You need a model that’s purpose-built for that specific job.
Speed as a Product Constraint
Latency is the hidden killer of AI product adoption. A model that answers in 500ms feels responsive. A model that takes three seconds feels broken — especially in interactive use cases like chat, voice, or real-time decision support.
LLMs are slow because they’re big. Each inference requires loading billions of parameters and attending to the full context window. Even with optimization techniques like quantization and speculative decoding, you’re fighting physics.
SLMs change this calculus. A model with 500 million to 7 billion parameters can run inference in tens of milliseconds on modest hardware. It can run on a laptop. It can run on a phone. It can run on an edge device with no internet connection.
What works better: Map your latency requirements before you choose your model. If the user needs a response in under a second — and most users do — benchmark SLMs against your specific workload before assuming you need an LLM. The first time you see a 7-billion-parameter model classify a request in 50ms on a CPU, the tradeoff becomes obvious.
The Cost Reality That Nobody Talks About
LLM costs are visible and easy to track — per-token pricing shows up on a monthly invoice. But the real cost is broader. Every API call to a hosted LLM adds latency, dependency, and data exposure. Every self-hosted LLM requires GPU infrastructure that most teams don’t have and can’t easily provision.
SLMs change the cost structure entirely. A small model can run on existing infrastructure — the same servers you’re already using for your application. No GPU required. No per-token fees. No data leaving your network.
The numbers matter here. A typical SaaS product handling a million inference requests per day would pay thousands of dollars daily for an LLM API. The same workload on a fine-tuned SLM running on your own hardware costs the electricity to keep the server running.
What works better: Build a cost-per-inference model before committing to an architecture. Factor in infrastructure, latency-induced user drop-off, and data egress. For high-volume, narrow-scope tasks, SLMs almost always win on total cost of ownership. For low-volume, open-ended tasks, LLMs still make sense.
Privacy and Compliance as Differentiators
The regulatory landscape is shifting. Financial services, healthcare, defense, and any industry handling personal data are facing increasing scrutiny around where data is processed and who has access to it.
Sending customer data to a third-party API — even with contractual protections — creates exposure that many organizations can’t accept. Data classification in banking, medical record summarization in hospitals, contract analysis in legal — these use cases demand that data stays within the controlled environment.
SLMs that run locally or within a private network eliminate the data transfer risk entirely. The model weights are on your infrastructure. The inference happens on your infrastructure. The data never leaves.
What works better: Classify your data sensitivity before choosing your deployment model. Anything involving PII, PHI, financial transactions, or proprietary business logic should trigger a local-SLM conversation. If legal or compliance has veto power over how data is processed, a local SLM removes their objection at the architectural level.
The Fine-Tuning Advantage
LLMs resist fine-tuning. They’re so large and so generally trained that domain-specific fine-tuning requires enormous datasets, specialized expertise, and significant compute. Most teams end up using Retrieval-Augmented Generation (RAG) instead — bolting external context onto a frozen model.
RAG works, but it adds complexity. You need a vector database, an embedding pipeline, a retrieval strategy, and a prompt construction layer. Each component is a failure point.
SLMs are designed to be fine-tuned. A 7-billion-parameter model can be adapted to a specific domain with a fraction of the data and compute required for an LLM. The result is a model that doesn’t need RAG — it already knows the domain. It doesn’t need prompt engineering — it was trained on the exact task you’re using it for.
What works better: If your use case is stable and domain-specific — contract clause extraction, medical code classification, network fault diagnosis — fine-tune an SLM. The upfront investment of creating a training dataset pays back in lower operational complexity, faster inference, and more predictable behavior. RAG still makes sense for use cases where the knowledge base changes frequently, but it shouldn’t be the default.
When You Still Want an LLM
SLMs aren’t a replacement for LLMs in every scenario. There are use cases where the larger model is the right choice.
Creative generation, open-ended reasoning, tasks requiring broad world knowledge, and applications where the user’s request is genuinely unpredictable — these favor LLMs. If you’re building a coding assistant, a research tool, or a creative writing aid, the general capability of a large model is the right foundation.
The mistake is treating the exception as the rule. Most product use cases don’t need general intelligence. They need reliable, fast, and cost-effective execution of a narrow task.
What I’ve Learned
Three things that have shaped how I think about model selection:
Start small and prove you need bigger. The temptation is to start with the most capable model and optimize down. Reverse that. Start with the smallest model that could possibly work. Benchmark it. If it fails, scale up. You’ll be surprised how often small is enough.
The infrastructure cost of LLMs is higher than the API cost. Even if the per-token pricing looks manageable, factor in latency, error handling, retry logic, data egress, and the engineering time spent optimizing around the model’s limitations. SLMs that run on your existing stack eliminate nearly all of that overhead.
The best AI product is invisible. Users don’t care what model is running. They care that the feature is fast, accurate, and private. SLMs deliver that experience more consistently than LLMs for most product use cases. Optimize for the user’s experience, not the model’s capability.
The next time your team starts an AI project, resist the reflex to call the big-model API. Ask first: what’s the narrowest version of this problem we can solve? The answer will probably point you to a smaller model — and a better product.