Share
Deploying Large Language Models (LLMs) locally has become an increasingly viable option, especially with advancements in frameworks like vLLM. In a previous discussion, the process of serving LLMs using Ollama was explored, providing a streamlined approach for deployment. Building on that foundation, this guide delves into hosting LLMs locally with vLLM, offering a flexible and efficient solution for those looking to operate models on their hardware.
For individuals with a capable GPU and an interest in self-hosted AI models, vLLM offers a powerful and cost-effective alternative to cloud-based APIs, eliminating common constraints such as rate limits, high operational costs, and dependency on third-party services. By leveraging vLLM, users gain greater control over their model deployments, optimizing performance and customization to suit specific needs.
vLLM is an open source library
This tutorial provides a comprehensive, step-by-step guide on setting up an LLM locally using vLLM, ensuring accessibility even for those with no prior experience. From installation to optimization, this guide covers essential configurations, allowing users to maximize efficiency while maintaining full ownership of their AI infrastructure
The GitHub repository is public at this GitHub repo. You can clone this, follow the instructions, or customize as you want to get your best experiment.
First things first, make sure have the following:
Python 3.8+
A basic understanding of Python (if you can write a print('Hello, world!'), you’re in)
(Optional) A modern GPU (ideally with 16GB+ VRAM, but lower-end ones might work too)
If you want to take advantage of GPU, you should have NVIDIA CUDA installed (check your version: ```nvidia-smi```)
LLMs can be served using vLLM with either Docker or Python. For an isolated, hassle-free setup that avoids environment conflicts, Docker is a reliable choice.
Install Docker
Before running vLLM in a container, Docker must be installed on the system. If Docker has not been set up yet, the previous blog post on Ollama provides a step-by-step guide for installation.
Run vLLM in a Docker Container
To launch vLLM in a controlled environment, execute the following command in the terminal:
docker run -itd --rm --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
-e MAX_BATCH_SIZE=16 -e BATCH_TIMEOUT_MS=100 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2-1.5B-Instruct \
--tokenizer-mode auto \
--gpu-memory-utilization 0.8 \
--max-model-len 16384
Now, let's break down the command:
docker run -itd --rm
-i: Interactive mode (keeps the container open).
-t: Allocates a pseudo-TTY (allows you to interact with the container).
-d: Runs the container in detached mode (background process).
--rm: Automatically removes the container when stopped.
--runtime nvidia --gpus all
Enables GPU acceleration using NVIDIA's runtime.
If you have multiple GPUs but want to use only one, you can specify it with --gpus '"device=0"'.
-v ~/.cache/huggingface:/root/.cache/huggingface
Mounts your local Hugging Face model cache inside the container.
This avoids redownloading models every time you start a new container.
-p 8000:8000
Maps port 8000 from the container to the host machine, allowing access to the API.
-e MAX_BATCH_SIZE=16 -e BATCH_TIMEOUT_MS=100
Environment variables to control request batching:
MAX_BATCH_SIZE=16: Limits the number of requests processed at once.
BATCH_TIMEOUT_MS=100: Waits 100 milliseconds before processing a batch.
vllm/vllm-openai:latest
The official vLLM OpenAI-compatible API Docker image. If you haven't pulled it before, Docker will download it automatically (around 8.17GB).
--model Qwen/Qwen2-1.5B-Instruct
Specifies the model to use. You can replace this with another model that better suits your hardware and needs.
--tokenizer-mode auto
Automatically selects the best tokenizer mode for the model.
--gpu-memory-utilization 0.8
Allocates 80% of GPU memory to vLLM for model inference, leaving some space for other tasks. Adjust this if you experience out-of-memory (OOM) errors.
--max-model-len 16384
Sets the maximum sequence length (number of tokens) the model can handle in one request. Higher values increase memory usage.
Check container logs
To verify that the server is running correctly, review the logs for any errors or status updates.
docker logs -f <docker-container-id>
If the logs like this:
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
Congratulations! The LLM is now running locally via vLLM.
API requests can be sent to http://localhost:8000 to start interacting with the model.
For systems without a GPU or for those looking to test an LLM on a CPU, vLLM can be built and executed in a CPU-only environment. This approach optimizes the setup specifically for CPU inference while minimizing unnecessary dependencies related to GPU acceleration.
To begin, download the vLLM source code from GitHub and navigate to the project directory.
git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source
Build a CPU-optimized docker image
Since the default vLLM Docker image is designed for GPU usage, we need to build a CPU-only image. vLLM provides a dedicated Dockerfile.cpu, which creates a smaller image by excluding unnecessary GPU dependencies.
Run the following command to build the CPU-based image:
docker build -t vllm-cpu -f Dockerfile.cpu .
This process may take a few minutes as it installs all required dependencies.
Run vLLM in a docker container (CPU mode)
Once the image is built, we can start the container in CPU mode. It is important to ensure that the device is explicitly set to CPU before running the container.
docker run -itd --rm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
-e MAX_BATCH_SIZE=16 \
vllm-cpu:latest \
--model Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--device cpu \
--dtype bfloat16 \
--tokenizer-mode auto
Now, the LLM Qwen/Qwen2-1.5B-Instruct is running on the CPU.
For those who prefer a more flexible and lightweight approach, vLLM can be run directly with Python. This method offers greater control over the environment and eliminates the overhead of running a Docker container.
Install Python
Ensure that Python 3.8 or higher is installed on the system. If an installation or update is needed, the latest version can be downloaded from the official Python website. This guarantees that any installed packages remain isolated from the system-wide Python environment.
Set up a virtual environment
To maintain a clean and isolated development environment, a virtual environment should be created and activated before proceeding.
python -m venv vllm-env # you can use "uv venv vllm-env" alternatively, if you have uv installed
source vllm-env/bin/activate # On Windows use "vllm-env\Scripts\activate"
This setup ensures that installed packages remain isolated from the system-wide Python environment, preventing potential conflicts.
Once the virtual environment is activated, vLLM can be installed along with its required dependencies. For systems utilizing a GPU, the appropriate CUDA drivers must be installed to enable acceleration.
pip install vllm # use "uv pip install vllm" if your environment installed by uv
Start the vLLM Server
Run the following command to launch the vLLM OpenAI-compatible API server:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--dtype half
python -m vllm.entrypoints.openai.api_server
Starts vLLM as an API server that follows the OpenAI API format.
--model Qwen/Qwen2-1.5B-Instruct
Loads the Qwen2-1.5B-Instruct model. You can replace it with a different model depending on your hardware and use case.
--trust-remote-code
Allows execution of custom model code downloaded from Hugging Face (some models require this).
--dtype half
Uses half-precision (FP16) for faster inference and lower memory usage.
After the server is up and running, interactions with the model can be made by sending requests to http://localhost:8000. This approach functions similarly to the Docker method, providing an API-compatible endpoint for LLM inference.
Now that vLLM is set up, the next step is to choose the LLM model to deploy. Hugging Face provides a vast collection of pre-trained models, allowing selection based on specific requirements such as model size, performance, and intended use case.
To ensure compatibility, the correct model name must be retrieved from the Hugging Face Model Hub:
Visit Hugging Face
Search for a model (e.g., Llama 3, Mistral, Qwen).
Click on the model you want.
Copy the full repository name from the page URL, typically in org_name/model_name format. Example: Qwen/Qwen2-1.5BB-Instruct or microsoft/Magma-8B.
Note: Fine-tuned LLMs can also be used, including models trained with LoRA, DeepSpeed, or other optimization techniques. A detailed guide on fine-tuning LLMs will be covered in the LLM Fine-Tuning Series.
If Docker is not being used, the model must be downloaded manually. Before proceeding, ensure the transformers library is installed:
pip install transformers
To check if the model can be downloaded and loaded properly, run:
python -c "from transformers import AutoModel; AutoModel.from_pretrained('Qwen/Qwen2-1.5BB-Instruct')"
Replace 'Qwen/Qwen2-1.5BB-Instruct' with the actual model name that copied from Hugging Face.
Tip: Some models require --trust-remote-code when running with vLLM. If an error related to "unsafe code execution" appears, enable this flag when starting the vLLM server.
With the model selected, the next step is to run it locally!
With the LLM now running, the next step is to test it by sending a request from Python. vLLM can be integrated into applications using frameworks such as FastAPI or Gradio, enabling the creation of an API endpoint or a simple chat interface, similar to how Ollama operates.
import requests
from fastapi import FastAPI, Form
app = FastAPI()
VLLM_URL = "http://localhost:8000/v1/chat/completions"
@app.post("/chatbot")
async def chatbot(
model_alias: str = Form(default="Qwen/Qwen2-1.5B-Instruct"),
question: str = Form(default=""),
):
payload = {
"model": model_alias,
"messages": [{"role": "user", "content": question}],
"temperature": 0.2,
}
response = requests.post(VLLM_URL, json=payload).json()
if response.get("choices") is None:
return response["error"].get("message", "Error: No response from Ollama.")
return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")
import json
import gradio as gr
import requests
VLLM_URL = "http://localhost:8000/v1/chat/completions"
HEADERS = {"Content-Type": "application/json"}
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"
def chat_with_llm(message, history):
messages = [{"role": "system", "content": "You are a helpful assistant."}]
if history:
for user_msg, bot_reply in history:
messages.append({"role": "user", "content": user_msg})
messages.append({"role": "assistant", "content": bot_reply})
# Append the latest user message
messages.append({"role": "user", "content": message})
payload = {
"model": MODEL_NAME,
"messages": messages,
"temperature": 0.2,
}
response = requests.post(VLLM_URL, headers=HEADERS, data=json.dumps(payload)).json()
if response.get("choices") is None:
return response["error"].get("message", "Error: No response from Ollama.")
return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")
gr.ChatInterface(fn=chat_with_llm, title="vLLM Chatbot").launch(share=True)
One key difference between vLLM and Ollama is the default API port:
vLLM runs on port 8000 (http://localhost:8000).
Ollama runs on port 11434 (http://localhost:11434).
Choosing the appropriate method, locally running LLM - an API-based approach with FastAPI and a web-based chatbot with Gradio, depends on the specific requirements of the project.
Don't have a GPU on local machine? No problem! vLLM can still be tested with GPU acceleration using cloud-based solutions such as Google Colab or Kaggle Notebook, both of which offer free GPU access.
Run the following command in a Colab notebook cell to install vLLM and Gradi:
!pip install -q vllm gradio
Now, define a Python script to interact with the LLM and serve it through a Gradio interface. Run this in a new Colab cell:
import json
import gradio as gr
import requests
API_URL = "http://localhost:8000/v1/chat/completions"
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"
def chat_with_llm(message, history):
headers = {"Content-Type": "application/json"}
messages = [{"role": "system", "content": "You are a helpful assistant."}]
if history:
for user_msg, bot_reply in history:
messages.append({"role": "user", "content": user_msg})
messages.append({"role": "assistant", "content": bot_reply})
# Append the latest user message
messages.append({"role": "user", "content": message})
payload = {
"model": MODEL_NAME,
"messages": messages,
"temperature": 0.2,
}
response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
try:
reply = response.json()["choices"][0]["message"]["content"]
except:
reply = "Error: Could not get response from model."
return reply
# Launch Gradio chat interface
gr.ChatInterface(fn=chat_with_llm, title="vLLM Chatbot").launch(share=True)
Now, run the third cell to launch vLLM:
!python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--dtype half
To ensure a smooth experience when running vLLM with Gradio, certain steps must be followed in the correct order. Keeping the model server active is essential for maintaining chatbot functionality. Below are key points to keep in mind:
Order matters!
The correct order of execution is essential for proper functionality. First, the second cell should be executed to start Gradio, which initializes the chatbot UI. Once Gradio is running, the third cell must be executed to start vLLM, the model server responsible for processing LLM requests. Running these in the correct sequence ensures a stable and responsive environment.
Keep the third cell running!
For the chatbot to function properly, the vLLM server must remain active at all times. If the third cell, which runs the server, is stopped or interrupted, the chatbot will no longer be able to process responses. Keeping this process running ensures continuous interaction with the model.
Accessing the Gradio interface
Once the Gradio interface is successfully launched, a public link will be generated in the second cell. This link, typically formatted as: https://xxxxxxxxxxxxxxxxx.gradio.live/, allows direct access to the chatbot in a web-based interface. Clicking the link provides a convenient way to interact with the LLM without additional setup
This method provides a way to experience vLLM’s GPU acceleration - without requiring a high-end local machine.
Feature | Ollama | vLLM |
Ease of Use | Very easy to set up, CLI-based | More complex setup, requires configuration |
Performance | Optimized for local use, but lacks batching | High-performance serving with continuous batching |
Scalability | Not designed for handling multiple concurrent requests | Built for multi-user and high-throughput inference |
API Support | Custom API, simple usage | OpenAI-compatible API, easier integration with existing applications |
GPU Utilization | Limited optimizations | Optimized for GPU memory management and large models |
Best For | Personal projects, local experiments | Production-level AI serving, scalable applications |
Platform Compatibility | Mac (Apple Silicon optimized), Linux | Works best on Linux with NVIDIA GPUs |
While Ollama is well-suited for local experiments and quick testing, vLLM is designed for scalability, efficiency, and production-level AI serving. When deploying LLMs for real-world applications, vLLM offers a more robust and performance-optimized solution.
When deploying LLMs, it is essential to evaluate both the advantages and limitations of vLLM.
High performance
vLLM is designed for efficiency, utilizing pipelined execution and continuous batching to process multiple requests simultaneously. This approach maximizes GPU utilization, significantly reducing latency and improving response times for large-scale applications.
Scalability
One of the key strengths of vLLM is its ability to handle multiple concurrent requests with minimal latency. Unlike traditional models that may struggle under heavy workloads, vLLM optimizes resource allocation, making it ideal for production environments that require real-time AI inference at scale.
OpenAI API compatibility
vLLM is fully compatible with OpenAI’s GPT API format, allowing developers to integrate it seamlessly into existing AI applications. This compatibility reduces development effort, as applications designed for OpenAI models can transition to vLLM with minimal modifications.
Efficient GPU utilization
By implementing dynamic memory management, vLLM optimizes GPU resources to support larger models and longer sequences without excessive memory consumption. This feature allows AI models to run efficiently, even when processing complex inputs, making it a valuable solution for high-performance machine learning tasks.
More complex setup
Unlike simpler alternatives, vLLM requires manual configuration for GPU acceleration, memory optimization, and API serving. Setting up the environment involves fine-tuning system parameters, which may pose challenges for those unfamiliar with deep learning infrastructure.
Higher hardware requirements
Although vLLM supports CPU-based inference, it is highly optimized for NVIDIA GPUs and relies on CUDA acceleration for optimal performance. Running vLLM on a CPU significantly reduces processing speed, making GPU access almost essential for real-world deployments.
Not as beginner-friendly
Compared to user-friendly solutions like Ollama, vLLM demands a deeper understanding of LLM deployment and hardware optimization. Users without prior experience in AI model serving may find the learning curve steeper, requiring additional effort to configure and manage the system effectively.
For the performances, here is the benchmark of some LLMs running on an NVIDIA T4 GPU:
Model | Memory usage (GB) | Tokens/s |
Qwen2-1.5B-Instruct | 3.2 | 44.6 |
Qwen2-VL-2B-Instruct-GPTQ-Int4 | 1.8 | 28.8 |
Dolphin3.0-Qwen2.5-0.5B | 1.1 | 62.9 |
Setting up a local LLM server with vLLM unlocks a powerful and flexible way to run AI models efficiently. Whether for experimentation, fine-tuning, or production-scale deployment, vLLM optimizes performance while keeping resource usage in check. Its ability to handle multiple models, integrate seamlessly with existing APIs, and maximize hardware potential makes it a strong alternative to cloud-based solutions. With the right configurations, vLLM can streamline AI workflows and bring scalable, high-performance language models closer to real-world applications.
Share