Generation AI has changed the way we interact with technology, but its widespread use presents challenges in terms of privacy, governance and resource efficiency. In this article we examine how Small Language Models (SLMs) offer a more controlled and efficient alternative to Large Language Models (LLMs) and how model quantification can further improve their performance.
What are SLMs?
Small Language Models (SLMs) are trained language models with a similar architecture to LLMs but with a much smaller number of parameters for natural language processing, understanding and content generation.
These are the advantages of the varieties:
- Low resource consumption (RAM/CPU/GPU)They require less processing power, which makes them easier to implement on more affordable hardware.
- Greater controlcan be used both in local and private environments, which
- ensures greater security and governance.
- Increased inferential speedThe answers are generated faster and more efficiently as there are fewer parameters.
Examples of known SLMs:
- Mistral 7B
- Phi-2 from Microsoft (~2.7B parameters)
- TinyLLaMA
- Gemma 2B of Google
Model Quantization: Downsizing without Precise Mortality
One of the main obstacles to training AI models is their size and the processing power required. Here, quantization plays a key role. This technique decreases model size by converting high-precision weights (FP32, FP16) to low-precision weights (INT8, INT4) without significantly affecting model performance. Advantages include:
- Reduced memory usageAllows models to be stored and managed on devices with limited capacity.
- Increased GPU/CPU efficiency: Reduces CPU load by speeding up mathematical operations.
- Accelerated inference: Models can react more quickly, decreasing the accuracy of calculations.
Comparison of SLM and LLM
While LLM has proven to be a powerful tool, SLM offers important advantages in situations where efficiency and privacy are paramount. Below, we compare the two approaches:

Recovery Augmented Generation (RAG)
The Retrieval Augmented Generation (RAG) technique is used to improve the accuracy and contextualisation of models. This method maximises responses by obtaining information from additional sources and improves the context prior to text generation. Its structure consists of:
- Chunkingfragmentation of data into manageable parts.
- Embeddings of documents: conversion of text into numeric vector.
- Vector database (VectorDB): )is a database that stores and retrieves relevant data.
- Information retrievalWhen faced with a query, it locates the most relevant pieces of information.
- Generating responses: synthesis of contextualised information to improve the output of the model.
Implementing SLMs + RAG: An Efficient and Secure Model
The combination of SLMs with the RAG strategy enables the creation of highly efficient and controllable generative AI systems. With this architecture, organisations can use optimised models that ensure greater privacy while using fewer resources.
Key Benefits:
- Optimised use of dataThe inclusion of information retrieval allows for more accurate and informed responses.
- Full control over the model: avoids the need for third-party services and allows customisation of AI behaviour.
- Execution in restricted environments: Due to their quantifiability and smaller size, SLMs can be deployed on edge devices or local servers.
Basic Architecture with Quantified Model, LangChain, RAG and FastAPI
The following architecture can be used to create an efficient SML environment:
- Loading of the quantized model: a model that has been previously quantified in INT8 or INT4 is used to optimise its performance.
- LangChain for prompts managementLangChain: LangChain allows to structure and extend the requests of the model.
- Use of RAG for enhanced recoverya: makes use of vector databases to improve the context of responses.
- REST API with FastAPI: this model explains how to use an API to facilitate integration with other applications.
Example code:
from fastapi import FastAPI, HTTPException
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# Load quantized model
tokenizer = AutoTokenizer.from_pretrained("model-quantified")
model = AutoModelForCausalLM.from_pretrained("model-quantified", torch_dtype=torch.int8)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)
# Load Vector Database for RAG
embeddings = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2")
db = FAISS.load_local("ruta_vector_db", embeddings)
retriever = db.as_retriever()
qa_chain = RetrievalQA(llm=llm, retriever=retriever)
app = FastAPI()
@app.post("/generate")
def generate_response(prompt: str, max_length: int = 100):
try:
response = qa_chain.run(prompt)
return {"answer": answer}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0.0″, port=8000)
Explanation
Loading the quantized model
tokenizer = AutoTokenizer.from_pretrained("model-quantified")
model = AutoModelForCausalLM.from_pretrained("model-quantified", torch_dtype=torch.int8)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)
- AutoTokenizer.from_pretrained("model-quantified"): Loads the tokeniser of the quantized model.
- AutoModelForCausalLM.from_pretrained("model-quantified", torch_dtype=torch.int8): Loads the quantized model in int8 precision, which reduces memory usage and speeds up inference.
- pipeline("text-generation", model=model, tokenizer=tokenizer): Creates a text generation pipeline based on the quantified model.
- HuggingFacePipeline(pipeline=pipe): Integrate the pipeline into LangChain for further use in the RAG architecture.
2. Retrieval-Augmented Generation (RAG) configuration:
embeddings = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2")
db = FAISS.load_local("ruta_vector_db", embeddings)
retriever = db.as_retriever()
qa_chain = RetrievalQA(llm=llm, retriever=retriever)
- HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2"): Uses an embedding model (MiniLM-L6-v2) to convert text into vector representations.
- FAISS.load_local("path_vector_db", embeddings): Loads a FAISS vector database with previously generated embeddings.
- db.as_retriever(): It turns the database into a search engine to retrieve relevant information.
- RetrievalQA(llm=llm, retriever=retriever)Combines quantified language modelling with information retrieval to improve response generation.
3. Creating the API with FastAPI:
app = FastAPI()
@app.post("/generate")
def generate_response(prompt: str, max_length: int = 100):
try:
response = qa_chain.run(prompt)
return {"answer": answer}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
- FastAPI(): Create a REST API to expose the RAG model and functionality..
- @app.post("/generate"): Defines a /generate endpoint that accepts POST requests with an input prompt.
- qa_chain.run(prompt): It uses a combination of information retrieval (RAG) and text generation to respond.
- Exception handling: If an error occurs, an HTTP 500 code is returned with the error message.
4. Server execution:
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0.0″, port=8000)
uvicorn.run(app, host="0.0.0.0", port=8000)
: Starts the server on port 8000, allowing access to the API.
Conclusion
Companies and developers can now adopt lighter, faster and more private models thanks to SLMs (Small Language Models)a key breakthrough in the evolution of generative AI. The quantization of models -which significantly reduces the size and computational requirements without compromising basic performance - allows these models to be run in environments on-premise or on resource-constrained devices, while maintaining full control over data and processes.
This approach is complemented by the architecture based on RAG (Retrieval-Augmented Generation)together with tools such as FastAPI and LangChainThese strategies enable the deployment of AI solutions that are governable, auditable and tailored to specific requirements. These strategies make it possible to fully controlled AI generationmaking it a realistic and effective choice for demanding sectors such as data analysis, scientific research or customer service.
The combination of quantified SLMs, modular architecture and autonomous deployment represents one of the most secure and efficient ways to integrate generative AI into your organisation.
Want to see how this translates into a real case?
Access the full paper and find out more.