Table of Contents

Deploying local LLM hosting for free with vLLM

LLM

vLLM

Deploying Large Language Models (LLMs) locally has become an increasingly viable option, especially with advancements in frameworks like vLLM. In a previous discussion, the process of serving LLMs using Ollama was explored, providing a streamlined approach for deployment. Building on that foundation, this guide delves into hosting LLMs locally with vLLM, offering a flexible and efficient solution for those looking to operate models on their hardware.

Overview

For individuals with a capable GPU and an interest in self-hosted AI models, vLLM offers a powerful and cost-effective alternative to cloud-based APIs, eliminating common constraints such as rate limits, high operational costs, and dependency on third-party services. By leveraging vLLM, users gain greater control over their model deployments, optimizing performance and customization to suit specific needs.

vLLM is an open source library

This tutorial provides a comprehensive, step-by-step guide on setting up an LLM locally using vLLM, ensuring accessibility even for those with no prior experience. From installation to optimization, this guide covers essential configurations, allowing users to maximize efficiency while maintaining full ownership of their AI infrastructure

The GitHub repository is public at this GitHub repo. You can clone this, follow the instructions, or customize as you want to get your best experiment.

Prerequisites

First things first, make sure have the following:

Python 3.8+
A basic understanding of Python (if you can write a print('Hello, world!'), you’re in)
(Optional) A modern GPU (ideally with 16GB+ VRAM, but lower-end ones might work too)
If you want to take advantage of GPU, you should have NVIDIA CUDA installed (check your version: ```nvidia-smi```)

Tutorial for deploying local LLM hosting for free with vLLM

Step 1: Set up your environment

LLMs can be served using vLLM with either Docker or Python. For an isolated, hassle-free setup that avoids environment conflicts, Docker is a reliable choice.

1. Docker

Install Docker

Before running vLLM in a container, Docker must be installed on the system. If you haven’t installed it yet, download and install Docker from the official website.

Once installed, verify that Docker is working by running this command in your terminal:

docker --version

You should see output similar to:

Docker version 24.0.5, build 1234567

Run vLLM in a Docker Container

To launch vLLM in a controlled environment, execute the following command in the terminal:

docker run -itd --rm --runtime nvidia --gpus all \
                -v ~/.cache/huggingface:/root/.cache/huggingface \
                -p 8000:8000 \
                -e MAX_BATCH_SIZE=16 -e BATCH_TIMEOUT_MS=100 \
                vllm/vllm-openai:latest \
                --model Qwen/Qwen2-1.5B-Instruct \
                --tokenizer-mode auto \
                --gpu-memory-utilization 0.8 \
                --max-model-len 16384

Now, let's break down the command:

docker run -itd --rm
- -i: Interactive mode (keeps the container open).
- -t: Allocates a pseudo-TTY (allows you to interact with the container).
- -d: Runs the container in detached mode (background process).
- --rm: Automatically removes the container when stopped.
--runtime nvidia --gpus all
- Enables GPU acceleration using NVIDIA's runtime.
- If you have multiple GPUs but want to use only one, you can specify it with --gpus '"device=0"'.
-v ~/.cache/huggingface:/root/.cache/huggingface
- Mounts your local Hugging Face model cache inside the container.
- This avoids redownloading models every time you start a new container.
-p 8000:8000
- Maps port 8000 from the container to the host machine, allowing access to the API.
-e MAX_BATCH_SIZE=16 -e BATCH_TIMEOUT_MS=100
- Environment variables to control request batching:
  - MAX_BATCH_SIZE=16: Limits the number of requests processed at once.
  - BATCH_TIMEOUT_MS=100: Waits 100 milliseconds before processing a batch.
vllm/vllm-openai:latest
- The official vLLM OpenAI-compatible API Docker image. If you haven't pulled it before, Docker will download it automatically (around 8.17GB).
--model Qwen/Qwen2-1.5B-Instruct
- Specifies the model to use. You can replace this with another model that better suits your hardware and needs.
--tokenizer-mode auto
- Automatically selects the best tokenizer mode for the model.
--gpu-memory-utilization 0.8
- Allocates 80% of GPU memory to vLLM for model inference, leaving some space for other tasks. Adjust this if you experience out-of-memory (OOM) errors.
--max-model-len 16384
- Sets the maximum sequence length (number of tokens) the model can handle in one request. Higher values increase memory usage.

Check container logs

To verify that the server is running correctly, review the logs for any errors or status updates.

docker logs -f <docker-container-id>

If the logs like this:

INFO: 	Started server process [1]
                INFO: 	Waiting for application startup.
                INFO: 	Application startup complete.

Congratulations! The LLM is now running locally via vLLM.

API requests can be sent to http://localhost:8000 to start interacting with the model.

For systems without a GPU or for those looking to test an LLM on a CPU, vLLM can be built and executed in a CPU-only environment. This approach optimizes the setup specifically for CPU inference while minimizing unnecessary dependencies related to GPU acceleration.

To begin, download the vLLM source code from GitHub and navigate to the project directory.

git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source

Build a CPU-optimized docker image

Since the default vLLM Docker image is designed for GPU usage, we need to build a CPU-only image. vLLM provides a dedicated Dockerfile.cpu, which creates a smaller image by excluding unnecessary GPU dependencies.

Run the following command to build the CPU-based image:

docker build -t vllm-cpu -f Dockerfile.cpu .

This process may take a few minutes as it installs all required dependencies.

Run vLLM in a docker container (CPU mode)

Once the image is built, we can start the container in CPU mode. It is important to ensure that the device is explicitly set to CPU before running the container.

docker run -itd --rm \
                 -v ~/.cache/huggingface:/root/.cache/huggingface \
                 -p 8000:8000 \
                 -e MAX_BATCH_SIZE=16 \
                 vllm-cpu:latest \
                 --model Qwen/Qwen2-1.5B-Instruct \
                 --trust-remote-code \
                 --device cpu \
                 --dtype bfloat16 \
                 --tokenizer-mode auto

Now, the LLM Qwen/Qwen2-1.5B-Instruct is running on the CPU.

2. Python

For those who prefer a more flexible and lightweight approach, vLLM can be run directly with Python. This method offers greater control over the environment and eliminates the overhead of running a Docker container.

Install Python

Ensure that Python 3.8 or higher is installed on the system. If an installation or update is needed, the latest version can be downloaded from the official Python website. This guarantees that any installed packages remain isolated from the system-wide Python environment.

Set up a virtual environment

To maintain a clean and isolated development environment, a virtual environment should be created and activated before proceeding.

python -m venv vllm-env  # you can use "uv venv vllm-env" alternatively, if you have uv installed
                source vllm-env/bin/activate  # On Windows use "vllm-env\Scripts\activate"

This setup ensures that installed packages remain isolated from the system-wide Python environment, preventing potential conflicts.

Once the virtual environment is activated, vLLM can be installed along with its required dependencies. For systems utilizing a GPU, the appropriate CUDA drivers must be installed to enable acceleration.

pip install vllm  # use "uv pip install vllm" if your environment installed by uv

Start the vLLM Server

Run the following command to launch the vLLM OpenAI-compatible API server:

python -m vllm.entrypoints.openai.api_server \
                --model Qwen/Qwen2-1.5B-Instruct \
                --trust-remote-code \
                --dtype half

python -m vllm.entrypoints.openai.api_server
- Starts vLLM as an API server that follows the OpenAI API format.
--model Qwen/Qwen2-1.5B-Instruct
- Loads the Qwen2-1.5B-Instruct model. You can replace it with a different model depending on your hardware and use case.
--trust-remote-code
- Allows execution of custom model code downloaded from Hugging Face (some models require this).
--dtype half
- Uses half-precision (FP16) for faster inference and lower memory usage.

After the server is up and running, interactions with the model can be made by sending requests to http://localhost:8000. This approach functions similarly to the Docker method, providing an API-compatible endpoint for LLM inference.

Step 2: Choose your model

Now that vLLM is set up, the next step is to choose the LLM model to deploy. Hugging Face provides a vast collection of pre-trained models, allowing selection based on specific requirements such as model size, performance, and intended use case.

Find the right model name on hugging face

To ensure compatibility, the correct model name must be retrieved from the Hugging Face Model Hub:

Visit Hugging Face
Search for a model (e.g., Llama 3, Mistral, Qwen).
Click on the model you want.
Copy the full repository name from the page URL, typically in org_name/model_name format. Example: Qwen/Qwen2-1.5BB-Instruct or microsoft/Magma-8B.

Note: Fine-tuned LLMs can also be used, including models trained with LoRA, DeepSpeed, or other optimization techniques. A detailed guide on fine-tuning LLMs will be covered in the LLM Fine-Tuning Series.

Install the transformers library (when using Python)

If Docker is not being used, the model must be downloaded manually. Before proceeding, ensure the transformers library is installed:

pip install transformers

Verify model availability

To check if the model can be downloaded and loaded properly, run:

python -c "from transformers import AutoModel; AutoModel.from_pretrained('Qwen/Qwen2-1.5BB-Instruct')"

Replace 'Qwen/Qwen2-1.5BB-Instruct' with the actual model name that copied from Hugging Face.

Tip: Some models require --trust-remote-code when running with vLLM. If an error related to "unsafe code execution" appears, enable this flag when starting the vLLM server.

With the model selected, the next step is to run it locally!

Step 3: Integrate the model into applications

With the LLM now running, the next step is to test it by sending a request from Python. vLLM can be integrated into applications using frameworks such as FastAPI or Gradio, enabling the creation of an API endpoint or a simple chat interface, similar to how Ollama operates.

Use FastAPI

import requests
                
                from fastapi import FastAPI, Form
                
                app = FastAPI()
                VLLM_URL = "http://localhost:8000/v1/chat/completions"
                
                
                @app.post("/chatbot")
                async def chatbot(
                model_alias: str = Form(default="Qwen/Qwen2-1.5B-Instruct"),
                question: str = Form(default=""),
                ):
                payload = {
                "model": model_alias,
                "messages": [{"role": "user", "content": question}],
                "temperature": 0.2,
                }
                response = requests.post(VLLM_URL, json=payload).json()
                
                if response.get("choices") is None:
                return response["error"].get("message", "Error: No response from Ollama.")
                
                return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")

Use Gradio

import json
                
                import gradio as gr
                import requests
                
                VLLM_URL = "http://localhost:8000/v1/chat/completions"
                HEADERS = {"Content-Type": "application/json"}
                MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"
                
                
                def chat_with_llm(message, history):
                messages = [{"role": "system", "content": "You are a helpful assistant."}]
                if history:
                for user_msg, bot_reply in history:
                messages.append({"role": "user", "content": user_msg})
                messages.append({"role": "assistant", "content": bot_reply})
                
                # Append the latest user message
                messages.append({"role": "user", "content": message})
                
                payload = {
                "model": MODEL_NAME,
                "messages": messages,
                "temperature": 0.2,
                }
                
                response = requests.post(VLLM_URL, headers=HEADERS, data=json.dumps(payload)).json()
                if response.get("choices") is None:
                return response["error"].get("message", "Error: No response from Ollama.")
                
                return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")
                
                
                gr.ChatInterface(fn=chat_with_llm, title="vLLM Chatbot").launch(share=True)

One key difference between vLLM and Ollama is the default API port:

vLLM runs on port 8000 (http://localhost:8000).
Ollama runs on port 11434 (http://localhost:11434).

Choosing the appropriate method, locally running LLM - an API-based approach with FastAPI and a web-based chatbot with Gradio, depends on the specific requirements of the project.

Small tip: Run vLLM with GPU on Google Colab

Don't have a GPU on local machine? No problem! vLLM can still be tested with GPU acceleration using cloud-based solutions such as Google Colab or Kaggle Notebook, both of which offer free GPU access.

Install vLLM and Gradio

Run the following command in a Colab notebook cell to install vLLM and Gradi:

!pip install -q vllm gradio

Now, define a Python script to interact with the LLM and serve it through a Gradio interface. Run this in a new Colab cell:

import json
                
                import gradio as gr
                import requests
                API_URL = "http://localhost:8000/v1/chat/completions"
                MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"
                
                def chat_with_llm(message, history):
                headers = {"Content-Type": "application/json"}
                messages = [{"role": "system", "content": "You are a helpful assistant."}]
                if history:
                for user_msg, bot_reply in history:
                messages.append({"role": "user", "content": user_msg})
                messages.append({"role": "assistant", "content": bot_reply})
                
                # Append the latest user message
                messages.append({"role": "user", "content": message})
                
                payload = {
                "model": MODEL_NAME,
                "messages": messages,
                "temperature": 0.2,
                }
                
                response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
                try:
                reply = response.json()["choices"][0]["message"]["content"]
                except:
                reply = "Error: Could not get response from model."
                
                return reply
                
                # Launch Gradio chat interface
                gr.ChatInterface(fn=chat_with_llm, title="vLLM Chatbot").launch(share=True)

Start the vLLM Server on Google Colab

Now, run the third cell to launch vLLM:

!python -m vllm.entrypoints.openai.api_server \
                --model Qwen/Qwen2-1.5B-Instruct \
                --trust-remote-code \
                --dtype half

Important notes

To ensure a smooth experience when running vLLM with Gradio, certain steps must be followed in the correct order.

Order matters!

The correct order of execution is essential for proper functionality. First, the second cell should be executed to start Gradio, which initializes the chatbot UI. Once Gradio is running, the third cell must be executed to start vLLM, the model server responsible for processing LLM requests.

Keep the third cell running!

For the chatbot to function properly, the vLLM server must remain active at all times. If the third cell, which runs the server, is stopped or interrupted, the chatbot will no longer be able to process responses.

Accessing the Gradio interface

Once the Gradio interface is successfully launched, a public link will be generated in the second cell. This link, typically formatted as: https://xxxxxxxxxxxxxxxxx.gradio.live/, allows direct access to the chatbot in a web-based interface. Clicking the link provides a convenient way to interact with the LLM without additional setup

This method provides a way to experience vLLM’s GPU acceleration - without requiring a high-end local machine.

Ollama vs. vLLM: Why vLLM is preferred for production?

Feature	Ollama	vLLM
Ease of Use	Very easy to set up, CLI-based	More complex setup, requires configuration
Performance	Optimized for local use, but lacks batching	High-performance serving with continuous batching
Scalability	Not designed for handling multiple concurrent requests	Built for multi-user and high-throughput inference
API Support	Custom API, simple usage	OpenAI-compatible API, easier integration with existing applications
GPU Utilization	Limited optimizations	Optimized for GPU memory management and large models
Best For	Personal projects, local experiments	Production-level AI serving, scalable applications
Platform Compatibility	Mac (Apple Silicon optimized), Linux	Works best on Linux with NVIDIA GPUs

While Ollama is well-suited for local experiments and quick testing, vLLM is designed for scalability, efficiency, and production-level AI serving. When deploying LLMs for real-world applications, vLLM offers a more robust and performance-optimized solution.

The advantages and disadvantages

When deploying LLMs, it is essential to evaluate both the advantages and limitations of vLLM.

Advantages

High performance

vLLM is designed for efficiency, utilizing pipelined execution and continuous batching to process multiple requests simultaneously. This approach maximizes GPU utilization, significantly reducing latency and improving response times for large-scale applications.

Scalability

One of the key strengths of vLLM is its ability to handle multiple concurrent requests with minimal latency. Unlike traditional models that may struggle under heavy workloads, vLLM optimizes resource allocation, making it ideal for production environments that require real-time AI inference at scale.

OpenAI API compatibility

vLLM is fully compatible with OpenAI’s GPT API format, allowing developers to integrate it seamlessly into existing AI applications. This compatibility reduces development effort, as applications designed for OpenAI models can transition to vLLM with minimal modifications.

Efficient GPU utilization

By implementing dynamic memory management, vLLM optimizes GPU resources to support larger models and longer sequences without excessive memory consumption. This feature allows AI models to run efficiently, even when processing complex inputs, making it a valuable solution for high-performance machine learning tasks.

Disadvantages

More complex setup

Unlike simpler alternatives, vLLM requires manual configuration for GPU acceleration, memory optimization, and API serving. Setting up the environment involves fine-tuning system parameters, which may pose challenges for those unfamiliar with deep learning infrastructure.

Higher hardware requirements

Although vLLM supports CPU-based inference, it is highly optimized for NVIDIA GPUs and relies on CUDA acceleration for optimal performance. Running vLLM on a CPU significantly reduces processing speed, making GPU access almost essential for real-world deployments.

Not as beginner-friendly

Compared to user-friendly solutions like Ollama, vLLM demands a deeper understanding of LLM deployment and hardware optimization. Users without prior experience in AI model serving may find the learning curve steeper, requiring additional effort to configure and manage the system effectively.

Performances

For the performances, here is the benchmark of some LLMs running on an NVIDIA T4 GPU:

Model	Memory usage (GB)	Tokens/s
Qwen2-1.5B-Instruct	3.2	44.6
Qwen2-VL-2B-Instruct-GPTQ-Int4	1.8	28.8
Dolphin3.0-Qwen2.5-0.5B	1.1	62.9

Conclusion

Setting up a local LLM server with vLLM unlocks a powerful and flexible way to run AI models efficiently. Whether for experimentation, fine-tuning, or production-scale deployment, vLLM optimizes performance while keeping resource usage in check. Its ability to handle multiple models, integrate seamlessly with existing APIs, and maximize hardware potential makes it a strong alternative to cloud-based solutions. With the right configurations, vLLM can streamline AI workflows and bring scalable, high-performance language models closer to real-world applications.

Written by

Duong Tran

Blog

Relative articles

E-commerce

retail

E-commerce Custom Development: A practical guide for retail business

For retail businesses, expanding to E-commerce and attracting online customers is no longer optional—it’s essential

Vy Nguyen

04/07/2025

E-commerce

Business

Hybrid E-commerce: Should your business pay attention to it in 2025?

Wondering if Hybrid E-commerce is right for your business in 2025? The answer depends on where your business is headed — but for many, it’s the smart strategic!

Vy Nguyen

19/06/2025

E-commerce

Not just a tool! AI is redefining E-commerce predictions

Can AI really predict what your customers will want next?

Vy Nguyen

06/06/2025