Generative AI with Total Control: Small Language Models (SLMs) and Quantization

Generation AI has changed the way we interact with technology, but its widespread use presents challenges in terms of privacy, governance and resource efficiency. In this article we examine how Small Language Models (SLMs) offer a more controlled and efficient alternative to Large Language Models (LLMs) and how model quantification can further improve their performance.

What are SLMs?

Small Language Models (SLMs) are trained language models with a similar architecture to LLMs but with a much smaller number of parameters for natural language processing, understanding and content generation.

These are the advantages of the varieties:

  • Low resource consumption (RAM/CPU/GPU)They require less processing power, which makes them easier to implement on more affordable hardware.
  • Greater controlcan be used both in local and private environments, which
  • ensures greater security and governance.
  • Increased inferential speedThe answers are generated faster and more efficiently as there are fewer parameters.

Examples of known SLMs:

  • Mistral 7B
  • Phi-2 from Microsoft (~2.7B parameters)
  • TinyLLaMA
  • Gemma 2B of Google

Model Quantization: Downsizing without Precise Mortality

One of the main obstacles to training AI models is their size and the processing power required. Here, quantization plays a key role. This technique decreases model size by converting high-precision weights (FP32, FP16) to low-precision weights (INT8, INT4) without significantly affecting model performance. Advantages include:

  • Reduced memory usageAllows models to be stored and managed on devices with limited capacity.
  • Increased GPU/CPU efficiency: Reduces CPU load by speeding up mathematical operations.
  • Accelerated inference: Models can react more quickly, decreasing the accuracy of calculations.

Comparison of SLM and LLM

While LLM has proven to be a powerful tool, SLM offers important advantages in situations where efficiency and privacy are paramount. Below, we compare the two approaches:

Recovery Augmented Generation (RAG)

The Retrieval Augmented Generation (RAG) technique is used to improve the accuracy and contextualisation of models. This method maximises responses by obtaining information from additional sources and improves the context prior to text generation. Its structure consists of:

  1. Chunkingfragmentation of data into manageable parts.
  2. Embeddings of documents: conversion of text into numeric vector.
  3. Vector database (VectorDB): )is a database that stores and retrieves relevant data.
  4. Information retrievalWhen faced with a query, it locates the most relevant pieces of information.
  5. Generating responses: synthesis of contextualised information to improve the output of the model.

Implementing SLMs + RAG: An Efficient and Secure Model

The combination of SLMs with the RAG strategy enables the creation of highly efficient and controllable generative AI systems. With this architecture, organisations can use optimised models that ensure greater privacy while using fewer resources.

Key Benefits:
  •  Optimised use of dataThe inclusion of information retrieval allows for more accurate and informed responses.
  • Full control over the model: avoids the need for third-party services and allows customisation of AI behaviour.
  • Execution in restricted environments: Due to their quantifiability and smaller size, SLMs can be deployed on edge devices or local servers.

Basic Architecture with Quantified Model, LangChain, RAG and FastAPI

The following architecture can be used to create an efficient SML environment:

  1. Loading of the quantized model: a model that has been previously quantified in INT8 or INT4 is used to optimise its performance.
  2. LangChain for prompts managementLangChain: LangChain allows to structure and extend the requests of the model.
  3. Use of RAG for enhanced recoverya: makes use of vector databases to improve the context of responses.
  4. REST API with FastAPI: this model explains how to use an API to facilitate integration with other applications.
Example code:

from fastapi import FastAPI, HTTPException

from langchain.chains import RetrievalQA

from langchain.vectorstores import FAISS

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.llms import HuggingFacePipeline

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

import torch

# Load quantized model

tokenizer = AutoTokenizer.from_pretrained("model-quantified")

model = AutoModelForCausalLM.from_pretrained("model-quantified", torch_dtype=torch.int8)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

llm = HuggingFacePipeline(pipeline=pipe)

# Load Vector Database for RAG

embeddings = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2")

db = FAISS.load_local("ruta_vector_db", embeddings)

retriever = db.as_retriever()

qa_chain = RetrievalQA(llm=llm, retriever=retriever)

app = FastAPI()

@app.post("/generate")

def generate_response(prompt: str, max_length: int = 100):

            try:

            response = qa_chain.run(prompt)

            return {"answer": answer}

            except Exception as e:

            raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":

            import uvicorn

            uvicorn.run(app, host="0.0.0.0.0″, port=8000)

Explanation

Loading the quantized model

tokenizer = AutoTokenizer.from_pretrained("model-quantified")

model = AutoModelForCausalLM.from_pretrained("model-quantified", torch_dtype=torch.int8)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

llm = HuggingFacePipeline(pipeline=pipe)

  • AutoTokenizer.from_pretrained("model-quantified"): Loads the tokeniser of the quantized model.
  • AutoModelForCausalLM.from_pretrained("model-quantified", torch_dtype=torch.int8): Loads the quantized model in int8 precision, which reduces memory usage and speeds up inference.
  • pipeline("text-generation", model=model, tokenizer=tokenizer): Creates a text generation pipeline based on the quantified model.
  • HuggingFacePipeline(pipeline=pipe): Integrate the pipeline into LangChain for further use in the RAG architecture.
2. Retrieval-Augmented Generation (RAG) configuration:

embeddings = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2")

db = FAISS.load_local("ruta_vector_db", embeddings)

retriever = db.as_retriever()

qa_chain = RetrievalQA(llm=llm, retriever=retriever)

  • HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2"): Uses an embedding model (MiniLM-L6-v2) to convert text into vector representations.
  • FAISS.load_local("path_vector_db", embeddings): Loads a FAISS vector database with previously generated embeddings.
  • db.as_retriever(): It turns the database into a search engine to retrieve relevant information.
  • RetrievalQA(llm=llm, retriever=retriever)Combines quantified language modelling with information retrieval to improve response generation.
3. Creating the API with FastAPI:

app = FastAPI()

@app.post("/generate")

def generate_response(prompt: str, max_length: int = 100):

            try:

            response = qa_chain.run(prompt)

            return {"answer": answer}

            except Exception as e:

            raise HTTPException(status_code=500, detail=str(e))

  • FastAPI(): Create a REST API to expose the RAG model and functionality..
  • @app.post("/generate"): Defines a /generate endpoint that accepts POST requests with an input prompt.
  • qa_chain.run(prompt): It uses a combination of information retrieval (RAG) and text generation to respond.
  • Exception handling: If an error occurs, an HTTP 500 code is returned with the error message.
4. Server execution:

if __name__ == "__main__":

    import uvicorn

    uvicorn.run(app, host="0.0.0.0.0″, port=8000)

  • uvicorn.run(app, host="0.0.0.0", port=8000): Starts the server on port 8000, allowing access to the API.

Conclusion

Companies and developers can now adopt lighter, faster and more private models thanks to SLMs (Small Language Models)a key breakthrough in the evolution of generative AI. The quantization of models -which significantly reduces the size and computational requirements without compromising basic performance - allows these models to be run in environments on-premise or on resource-constrained devices, while maintaining full control over data and processes.

This approach is complemented by the architecture based on RAG (Retrieval-Augmented Generation)together with tools such as FastAPI and LangChainThese strategies enable the deployment of AI solutions that are governable, auditable and tailored to specific requirements. These strategies make it possible to fully controlled AI generationmaking it a realistic and effective choice for demanding sectors such as data analysis, scientific research or customer service.

The combination of quantified SLMs, modular architecture and autonomous deployment represents one of the most secure and efficient ways to integrate generative AI into your organisation.

Want to see how this translates into a real case?

Access the full paper and find out more.

IMPORTANT: Read our Privacy Policy before proceeding. The information you provide may contain personal information.

RELATED NEWS

More news...


Basic information on Data Protection

Responsible

EXCELTIC S.L.

Purpose

The purpose is to process your personal data in order to manage our commercial relationship and those requests sent through the contact form that you send us through the website.
Send you commercial communications about our products or services.

Legitimation

On the basis of the management, development and fulfilment of the commercial relationship.
Legitimate interest or consent of the data subject with regard to the sending of commercial communications.

Addressees

Official bodies where there is a legal obligation.
Persons who may have access to your personal data as a result of services provided to EXCELTIC S.L.
No international transfers are foreseen.

Rights

Access, rectify and delete data, as well as other rights, as explained in the additional information.

Additional Information

To view the full Privacy Policy, please click Here