Share

Back
Home / Blogs / Deploying local LLM hosting for free with vLLM

Deploying local LLM hosting for free with vLLM

10/03/2025
10/03/2025
Deploying local LLM hosting for free with vLLM

Deploying Large Language Models (LLMs) locally has become an increasingly viable option, especially with advancements in frameworks like vLLM. In a previous discussion, the process of serving LLMs using Ollama was explored, providing a streamlined approach for deployment. Building on that foundation, this guide delves into hosting LLMs locally with vLLM, offering a flexible and efficient solution for those looking to operate models on their hardware.

Overview

For individuals with a capable GPU and an interest in self-hosted AI models, vLLM offers a powerful and cost-effective alternative to cloud-based APIs, eliminating common constraints such as rate limits, high operational costs, and dependency on third-party services. By leveraging vLLM, users gain greater control over their model deployments, optimizing performance and customization to suit specific needs.

vLLM is an open source

vLLM is an open source library

This tutorial provides a comprehensive, step-by-step guide on setting up an LLM locally using vLLM, ensuring accessibility even for those with no prior experience. From installation to optimization, this guide covers essential configurations, allowing users to maximize efficiency while maintaining full ownership of their AI infrastructure

The GitHub repository is public at this GitHub repo. You can clone this, follow the instructions, or customize as you want to get your best experiment.

Prerequisites

First things first, make sure have the following: 

  • Python 3.8+ 

  • A basic understanding of Python (if you can write a print('Hello, world!'), you’re in)

  • (Optional) A modern GPU (ideally with 16GB+ VRAM, but lower-end ones might work too) 

  • If you want to take advantage of GPU, you should have NVIDIA CUDA installed (check your version: ```nvidia-smi```)

Tutorial for deploying local LLM hosting for free with vLLM

Step 1: Set up your environment

LLMs can be served using vLLM with either Docker or Python. For an isolated, hassle-free setup that avoids environment conflicts, Docker is a reliable choice.

1. Docker

Install Docker

Before running vLLM in a container, Docker must be installed on the system. If Docker has not been set up yet, the previous blog post on Ollama provides a step-by-step guide for installation.

Run vLLM in a Docker Container

To launch vLLM in a controlled environment, execute the following command in the terminal:

docker run -itd --rm --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
-e MAX_BATCH_SIZE=16 -e BATCH_TIMEOUT_MS=100 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2-1.5B-Instruct \
--tokenizer-mode auto \
--gpu-memory-utilization 0.8 \
--max-model-len 16384

Now, let's break down the command:

  • docker run -itd --rm

    • -i: Interactive mode (keeps the container open).

    • -t: Allocates a pseudo-TTY (allows you to interact with the container).

    • -d: Runs the container in detached mode (background process).

    • --rm: Automatically removes the container when stopped.

  • --runtime nvidia --gpus all

    • Enables GPU acceleration using NVIDIA's runtime.

    • If you have multiple GPUs but want to use only one, you can specify it with --gpus '"device=0"'.

  • -v ~/.cache/huggingface:/root/.cache/huggingface

    • Mounts your local Hugging Face model cache inside the container.

    • This avoids redownloading models every time you start a new container.

  • -p 8000:8000

    • Maps port 8000 from the container to the host machine, allowing access to the API.

  • -e MAX_BATCH_SIZE=16 -e BATCH_TIMEOUT_MS=100

    • Environment variables to control request batching:

      • MAX_BATCH_SIZE=16: Limits the number of requests processed at once.

      • BATCH_TIMEOUT_MS=100: Waits 100 milliseconds before processing a batch.

  • vllm/vllm-openai:latest

    • The official vLLM OpenAI-compatible API Docker image. If you haven't pulled it before, Docker will download it automatically (around 8.17GB).

  • --model Qwen/Qwen2-1.5B-Instruct

    • Specifies the model to use. You can replace this with another model that better suits your hardware and needs.

  • --tokenizer-mode auto

    • Automatically selects the best tokenizer mode for the model.

  • --gpu-memory-utilization 0.8

    • Allocates 80% of GPU memory to vLLM for model inference, leaving some space for other tasks. Adjust this if you experience out-of-memory (OOM) errors.

  • --max-model-len 16384

    • Sets the maximum sequence length (number of tokens) the model can handle in one request. Higher values increase memory usage.

Check container logs

To verify that the server is running correctly, review the logs for any errors or status updates.

docker logs -f <docker-container-id>

If the logs like this:

INFO: 	Started server process [1]
INFO: 	Waiting for application startup.
INFO: 	Application startup complete.

Congratulations! The LLM is now running locally via vLLM.

API requests can be sent to http://localhost:8000 to start interacting with the model.

For systems without a GPU or for those looking to test an LLM on a CPU, vLLM can be built and executed in a CPU-only environment. This approach optimizes the setup specifically for CPU inference while minimizing unnecessary dependencies related to GPU acceleration.

To begin, download the vLLM source code from GitHub and navigate to the project directory.

git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source

Build a CPU-optimized docker image

Since the default vLLM Docker image is designed for GPU usage, we need to build a CPU-only image. vLLM provides a dedicated Dockerfile.cpu, which creates a smaller image by excluding unnecessary GPU dependencies.

Run the following command to build the CPU-based image:

docker build -t vllm-cpu -f Dockerfile.cpu . 

This process may take a few minutes as it installs all required dependencies.

Run vLLM in a docker container (CPU mode)

Once the image is built, we can start the container in CPU mode. It is important to ensure that the device is explicitly set to CPU before running the container.

docker run -itd --rm \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 -p 8000:8000 \
 -e MAX_BATCH_SIZE=16 \
 vllm-cpu:latest \
 --model Qwen/Qwen2-1.5B-Instruct \
 --trust-remote-code \
 --device cpu \
 --dtype bfloat16 \
 --tokenizer-mode auto

Now, the LLM Qwen/Qwen2-1.5B-Instruct is running on the CPU.

2. Python

For those who prefer a more flexible and lightweight approach, vLLM can be run directly with Python. This method offers greater control over the environment and eliminates the overhead of running a Docker container.

Install Python

Ensure that Python 3.8 or higher is installed on the system. If an installation or update is needed, the latest version can be downloaded from the official Python website. This guarantees that any installed packages remain isolated from the system-wide Python environment.

Set up a virtual environment

To maintain a clean and isolated development environment, a virtual environment should be created and activated before proceeding.

python -m venv vllm-env  # you can use "uv venv vllm-env" alternatively, if you have uv installed
source vllm-env/bin/activate  # On Windows use "vllm-env\Scripts\activate"

This setup ensures that installed packages remain isolated from the system-wide Python environment, preventing potential conflicts.

Once the virtual environment is activated, vLLM can be installed along with its required dependencies. For systems utilizing a GPU, the appropriate CUDA drivers must be installed to enable acceleration.

pip install vllm  # use "uv pip install vllm" if your environment installed by uv

Start the vLLM Server

Run the following command to launch the vLLM OpenAI-compatible API server:

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--dtype half
  • python -m vllm.entrypoints.openai.api_server

    • Starts vLLM as an API server that follows the OpenAI API format.

  • --model Qwen/Qwen2-1.5B-Instruct

    • Loads the Qwen2-1.5B-Instruct model. You can replace it with a different model depending on your hardware and use case.

  • --trust-remote-code

    • Allows execution of custom model code downloaded from Hugging Face (some models require this).

  • --dtype half

    • Uses half-precision (FP16) for faster inference and lower memory usage.

After the server is up and running, interactions with the model can be made by sending requests to http://localhost:8000. This approach functions similarly to the Docker method, providing an API-compatible endpoint for LLM inference.

Step 2: Choose your model

Now that vLLM is set up, the next step is to choose the LLM model to deploy. Hugging Face provides a vast collection of pre-trained models, allowing selection based on specific requirements such as model size, performance, and intended use case.

Find the right model name on hugging face

To ensure compatibility, the correct model name must be retrieved from the Hugging Face Model Hub:

  • Visit Hugging Face

  • Search for a model (e.g., Llama 3, Mistral, Qwen).

  • Click on the model you want.

  • Copy the full repository name from the page URL, typically in org_name/model_name format. Example: Qwen/Qwen2-1.5BB-Instruct or microsoft/Magma-8B.

Note: Fine-tuned LLMs can also be used, including models trained with LoRA, DeepSpeed, or other optimization techniques. A detailed guide on fine-tuning LLMs will be covered in the LLM Fine-Tuning Series.

Install the transformers library (when using Python)

If Docker is not being used, the model must be downloaded manually. Before proceeding, ensure the transformers library is installed:

pip install transformers

Verify model availability

To check if the model can be downloaded and loaded properly, run:

python -c "from transformers import AutoModel; AutoModel.from_pretrained('Qwen/Qwen2-1.5BB-Instruct')"

Replace 'Qwen/Qwen2-1.5BB-Instruct' with the actual model name that copied from Hugging Face.

Tip: Some models require --trust-remote-code when running with vLLM. If an error related to "unsafe code execution" appears, enable this flag when starting the vLLM server.

With the model selected, the next step is to run it locally!

Step 3: Integrate the model into applications

With the LLM now running, the next step is to test it by sending a request from Python. vLLM can be integrated into applications using frameworks such as FastAPI or Gradio, enabling the creation of an API endpoint or a simple chat interface, similar to how Ollama operates.

Use FastAPI

import requests

from fastapi import FastAPI, Form

app = FastAPI()
VLLM_URL = "http://localhost:8000/v1/chat/completions"


@app.post("/chatbot")
async def chatbot(
model_alias: str = Form(default="Qwen/Qwen2-1.5B-Instruct"),
question: str = Form(default=""),
):
payload = {
"model": model_alias,
"messages": [{"role": "user", "content": question}],
"temperature": 0.2,
}
response = requests.post(VLLM_URL, json=payload).json()

if response.get("choices") is None:
return response["error"].get("message", "Error: No response from Ollama.")

return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")

Use Gradio

import json

import gradio as gr
import requests

VLLM_URL = "http://localhost:8000/v1/chat/completions"
HEADERS = {"Content-Type": "application/json"}
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"


def chat_with_llm(message, history):
messages = [{"role": "system", "content": "You are a helpful assistant."}]
if history:
for user_msg, bot_reply in history:
messages.append({"role": "user", "content": user_msg})
messages.append({"role": "assistant", "content": bot_reply})

# Append the latest user message
messages.append({"role": "user", "content": message})

payload = {
"model": MODEL_NAME,
"messages": messages,
"temperature": 0.2,
}

response = requests.post(VLLM_URL, headers=HEADERS, data=json.dumps(payload)).json()
if response.get("choices") is None:
return response["error"].get("message", "Error: No response from Ollama.")

return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")


gr.ChatInterface(fn=chat_with_llm, title="vLLM Chatbot").launch(share=True)

One key difference between vLLM and Ollama is the default API port:

Choosing the appropriate method, locally running LLM - an API-based approach with FastAPI and a web-based chatbot with Gradio, depends on the specific requirements of the project.

Small tip: Run vLLM with GPU on Google Colab

Don't have a GPU on local machine? No problem! vLLM can still be tested with GPU acceleration using cloud-based solutions such as Google Colab or Kaggle Notebook, both of which offer free GPU access.

Install vLLM and Gradio

Run the following command in a Colab notebook cell to install vLLM and Gradi:

!pip install -q vllm gradio

Now, define a Python script to interact with the LLM and serve it through a Gradio interface. Run this in a new Colab cell:

import json

import gradio as gr
import requests
API_URL = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"

def chat_with_llm(message, history):
headers = {"Content-Type": "application/json"}
messages = [{"role": "system", "content": "You are a helpful assistant."}]
if history:
for user_msg, bot_reply in history:
messages.append({"role": "user", "content": user_msg})
messages.append({"role": "assistant", "content": bot_reply})

# Append the latest user message
messages.append({"role": "user", "content": message})

payload = {
"model": MODEL_NAME,
"messages": messages,
"temperature": 0.2,
}

response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
try:
reply = response.json()["choices"][0]["message"]["content"]
except:
reply = "Error: Could not get response from model."

return reply

# Launch Gradio chat interface
gr.ChatInterface(fn=chat_with_llm, title="vLLM Chatbot").launch(share=True)

Start the vLLM Server on Google Colab

Now, run the third cell to launch vLLM:

!python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--dtype half

Important notes

To ensure a smooth experience when running vLLM with Gradio, certain steps must be followed in the correct order. Keeping the model server active is essential for maintaining chatbot functionality. Below are key points to keep in mind:

  • Order matters!

The correct order of execution is essential for proper functionality. First, the second cell should be executed to start Gradio, which initializes the chatbot UI. Once Gradio is running, the third cell must be executed to start vLLM, the model server responsible for processing LLM requests. Running these in the correct sequence ensures a stable and responsive environment.

  • Keep the third cell running!

For the chatbot to function properly, the vLLM server must remain active at all times. If the third cell, which runs the server, is stopped or interrupted, the chatbot will no longer be able to process responses. Keeping this process running ensures continuous interaction with the model.

  • Accessing the Gradio interface

Once the Gradio interface is successfully launched, a public link will be generated in the second cell. This link, typically formatted as: https://xxxxxxxxxxxxxxxxx.gradio.live/, allows direct access to the chatbot in a web-based interface. Clicking the link provides a convenient way to interact with the LLM without additional setup

This method provides a way to experience vLLM’s GPU acceleration - without requiring a high-end local machine.

Ollama vs. vLLM: Why vLLM is preferred for production?

Feature

Ollama

vLLM

Ease of Use

Very easy to set up, CLI-based

More complex setup, requires configuration

Performance

Optimized for local use, but lacks batching

High-performance serving with continuous batching

Scalability

Not designed for handling multiple concurrent requests

Built for multi-user and high-throughput inference

API Support

Custom API, simple usage

OpenAI-compatible API, easier integration with existing applications

GPU Utilization

Limited optimizations

Optimized for GPU memory management and large models

Best For

Personal projects, local experiments

Production-level AI serving, scalable applications

Platform Compatibility

Mac (Apple Silicon optimized), Linux

Works best on Linux with NVIDIA GPUs

While Ollama is well-suited for local experiments and quick testing, vLLM is designed for scalability, efficiency, and production-level AI serving. When deploying LLMs for real-world applications, vLLM offers a more robust and performance-optimized solution.

The advantages and disadvantages

When deploying LLMs, it is essential to evaluate both the advantages and limitations of vLLM.

Advantages

  • High performance 

vLLM is designed for efficiency, utilizing pipelined execution and continuous batching to process multiple requests simultaneously. This approach maximizes GPU utilization, significantly reducing latency and improving response times for large-scale applications.

  • Scalability 

One of the key strengths of vLLM is its ability to handle multiple concurrent requests with minimal latency. Unlike traditional models that may struggle under heavy workloads, vLLM optimizes resource allocation, making it ideal for production environments that require real-time AI inference at scale.

  • OpenAI API compatibility 

vLLM is fully compatible with OpenAI’s GPT API format, allowing developers to integrate it seamlessly into existing AI applications. This compatibility reduces development effort, as applications designed for OpenAI models can transition to vLLM with minimal modifications.

  • Efficient GPU utilization 

By implementing dynamic memory management, vLLM optimizes GPU resources to support larger models and longer sequences without excessive memory consumption. This feature allows AI models to run efficiently, even when processing complex inputs, making it a valuable solution for high-performance machine learning tasks.

Disadvantages

  • More complex setup 

Unlike simpler alternatives, vLLM requires manual configuration for GPU acceleration, memory optimization, and API serving. Setting up the environment involves fine-tuning system parameters, which may pose challenges for those unfamiliar with deep learning infrastructure.

  • Higher hardware requirements 

Although vLLM supports CPU-based inference, it is highly optimized for NVIDIA GPUs and relies on CUDA acceleration for optimal performance. Running vLLM on a CPU significantly reduces processing speed, making GPU access almost essential for real-world deployments.

  • Not as beginner-friendly 

Compared to user-friendly solutions like Ollama, vLLM demands a deeper understanding of LLM deployment and hardware optimization. Users without prior experience in AI model serving may find the learning curve steeper, requiring additional effort to configure and manage the system effectively.

Performances

For the performances, here is the benchmark of  some LLMs running on an NVIDIA T4 GPU:

Model

Memory usage (GB)

Tokens/s

Qwen2-1.5B-Instruct

3.2

44.6

Qwen2-VL-2B-Instruct-GPTQ-Int4

1.8

28.8

Dolphin3.0-Qwen2.5-0.5B

1.1

62.9

Conclusion

Setting up a local LLM server with vLLM unlocks a powerful and flexible way to run AI models efficiently. Whether for experimentation, fine-tuning, or production-scale deployment, vLLM optimizes performance while keeping resource usage in check. Its ability to handle multiple models, integrate seamlessly with existing APIs, and maximize hardware potential makes it a strong alternative to cloud-based solutions. With the right configurations, vLLM can streamline AI workflows and bring scalable, high-performance language models closer to real-world applications.


Share


Duong Tran
Experienced AI Engineer with nearly 4 years of expertise in developing and managing AI services focused on banking and computer vision applications—an enthusiast in LLM, Computer Vision, and AI Automation, driven by innovation and cutting-edge technology.
Find blog
Deploying local LLM hosting for free with vLLM
Deploying local LLM hosting for free with vLLM
Duong Tran
10/03/2025
10/03/2025
A step-by-step guide to fine-tuning LLaMA 3 using LoRA and QLoRA
A step-by-step guide to fine-tuning LLaMA 3 using LoRA and QLoRA
Tien Anh Nguyen
07/03/2025
10/03/2025
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025

We’re here to help you

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank
Find blog
Tags
Deploying local LLM hosting for free with vLLM
Deploying local LLM hosting for free with vLLM
Duong Tran
10/03/2025
10/03/2025
A step-by-step guide to fine-tuning LLaMA 3 using LoRA and QLoRA
A step-by-step guide to fine-tuning LLaMA 3 using LoRA and QLoRA
Tien Anh Nguyen
07/03/2025
10/03/2025
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025

Let’s talk

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank