Small language models (updated June 2025)

A list of small language models.

2024-12-30

Last updated: 2025-06-15

Small language models are increasingly capable of performing a wide range of tasks locally on-device and in the web browser. This page lists some interesting small language models. I am classifying small models as those with fewer than 1 billion parameters. This page will be updated regularly as I evaluate new models.

Why? Small language models are ideal for structured problems where reasoning and “thinking” are not necessary. This actually covers a wide range of use cases like entity extraction, structured data extraction, summarization, classification, multi-turn conversations, text composition, text revision, and content-tagging. Small LM’s are also ideal candidates for fine tuning to learn domain-specific knowledge.

Limitations: By virtue of being small and compressed, small language models have some important limitations. Complex reasoning tasks should be broken down to simpler steps. Small LM’s should avoid math and code generation tasks. They also have limited world knowledge, unlike larger models that have “overfit” or memorized large amounts of factual information up to their training cutoff date. As such, small LM’s are more likely to hallucinate (provide made-up information) when asked about facts or events. Although fine tuning is not necessary for many tasks and comes with its own challenges like forgetting world knowledge that is learned during training, it can still be useful to get the small LM to learn domain-specific knowledge.

SmolLM2 from HuggingFace (135M, 360M, 1.7B) - General-purpose. The smallest models can run efficiently in the web browser. Good for entity extraction, summarizing small text, structured data extraction

NuExtract-v1.5 from NuMind (Tiny, Base, Large) - Fine-tuned for structured entity extraction. This model takes a text input and an example JSON output and returns a JSON string that matches the example schema. This is different from structured output generation through token sampling. NuExtract generates JSON strings directly and is trained to do so with high accuracy. Token sampling on the other hand constrains the decoder logits to produce valid JSON that follows a predefined grammar or regex pattern. Token sampling is more flexible and works with any base model but has additional overhead in the generation process. NuExtract promises to be more efficient (by directly producing JSON strings) and accurate for structured extraction tasks.

Arctic-embed from Snowflake (22M-335M) - Embedding-only, excels at retrieval tasks. I’ve used this model for a few retrieval problems and it works as advertised. I’ve found these smaller embedding models are a good alternative to BERT for pure retrieval tasks.

Nomic-Embed-text from Nomic - Embedding-only, excels at retrieval tasks. I’ve found Nomic to be highly capable for it’s size. The model handles sequence lengthds from 2048 to 8192 input tokens. nomic-embed-text -v1.5 was trained with Matryoshka Representation learning, which means you can choose the output embedding dimension from 64 up to 768. The highest dimension 768 is most accurate and accuracy is decent down to 256 dimensions, after which it drops off quickly.

Qwen 3 provides a 0.6B parameter model with a 32K context length and 1.5GB size and Apache 2.0 license. This model is small but has thinking and reasoning capability. Ollama supports Qwen 3 and the thinking can be enabled/disabled using the think parameter or by passing /no_think in the prompt. Qwen 3 docs provide some guidance and best practices:

For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section.

#LLM