Share

Back
Home / Blogs / Fewest scripts, maximum power: Serve LLMs locally with Ollama

Fewest scripts, maximum power: Serve LLMs locally with Ollama

03/03/2025
28/02/2025
Fewest scripts, maximum power: Serve LLMs locally with Ollama

Hey there, fellow coding enthusiasts! Today, we’re diving into the world of Large Language Models (LLMs) and how you can run one on your local machine using Ollama.

The GitHub repository is public at https://github.com/rabiloo/llm-on-local. You can clone this, follow the instructions, or customize it as you want to get your best practice.

What’s an LLM, and why should you care?

Large Language Models (LLMs) are AI models trained on massive amounts of text data to understand and generate human-like text. They are used for various applications, including:

  • Chatbots & virtual assistants – AI-powered bots like ChatGPT, customer service assistants, and personal AI companions.

  • Code generation & debugging – AI tools that help developers write code, find bugs, and generate documentation.

  • Content creation – Writing articles, summaries, and even creative stories.

  • Data analysis & research – Extracting insights from large amounts of text data.

  • Automation & workflow optimization – Automating repetitive tasks using natural language commands.

Typically, LLMs run on cloud-based services like OpenAI’s API, but what if you want to run them entirely on your local machine? That’s where Ollama comes in.

Running an LLM locally means you get:

  • More control over privacy – No data is sent to external servers.
    Customization – Modify the model for your specific use case.

  • Offline capabilities – No need for an internet connection after setup.

Step 1: Check your hardware & set up your environment

Before diving in, make sure your machine is up for the task. LLMs can be resource-intensive, so having the right hardware will make your experience much smoother.

Recommended hardware

Component

Minimum requirement

Recommended for better performance

CPU

4-core processor (Intel i5 / Ryzen 5)

8-core or higher (Intel i7 / Ryzen 7)

RAM

8GB RAM

16GB+ RAM (the more, the better!)

GPU (optional)

No dedicated GPU is required

NVIDIA GPU with CUDA support for faster processing

Storage

At least 20GB free space

SSD is recommended for faster model loading

If you don’t have a powerful GPU, don’t worry—Ollama can still run LLMs on the CPU, but performance may be slower.

1. Install docker

Ollama runs LLMs in a containerized environment, making Docker a must-have. If you haven’t installed it yet, download and install Docker from the official website.

Once installed, verify that Docker is working by running this command in your terminal:

docker --version

You should see output similar to:

Docker version 24.0.5, build 1234567

2. Install Ollama

Now, let’s install Ollama. Open a terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

Once the installation is complete, confirm that Ollama is installed by running:

ollama --version

You should see a version number like this:

ollama version is 0.5.12

Step 2: Choose and get your model

Ollama supports different LLMs, and you can choose the one that best fits your needs.

1. List svailable models

Run this command to see what models are downloaded:

ollama list

If this is your first time using Ollama, the list might be empty or have llama3:latest model. Llama3:latest is the alias of Model Meta llama 3 8B Instruct Q4_0 (Q4_0 is the quantization type of this model, we will talk about it later in another series, stay tuned)

You can browse available models at Ollama’s Model Library.

Here is a table summarizing some state-of-the-art Large Language Models (LLMs) available on Ollama, along with their descriptions and best use cases:

Model Name

Description

Best For

Llama 3.3 70B

A 70-billion-parameter model from Meta, offering performance comparable to the Llama 3.1 405B model but in a more compact size.

Applications requiring high-performance natural language processing and reasoning.

DeepSeek-R1

The first generation of reasoning models from DeepSeek, comprising six dense models distilled from DeepSeek-R1 based on Llama and Qwen, delivering performance comparable to OpenAI-o1. 

Complex data analysis and reasoning tasks.

Phi-4 14B

A 14-billion-parameter state-of-the-art open model from Microsoft, known for its robust language processing and reasoning capabilities. 

Tasks demanding high accuracy in language analysis and logical reasoning.

Mistral 7B

A 7-billion-parameter model released by Mistral AI, updated to version 0.3, providing high performance in language tasks with a compact footprint. 

Applications needing efficient models with high performance, suitable for deployment on devices with limited resources.

Qwen2.5 72B

A new series of models from Alibaba, pretrained on a large-scale dataset of up to 18 trillion tokens, supporting up to 128K tokens in context and offering multilingual support. 

Tasks requiring long-context processing and multilingual capabilities.

Gemma 2B/7B/9B

A family of lightweight, high-performance models from Google DeepMind, updated to version 1.1, suitable for various tasks with limited resources. 

Applications needing compact, efficient models for deployment on personal devices or mobile applications.

Dolphin 3.0 Llama 3.1 8B

The next generation of the Dolphin series, fine-tuned from Llama 3.1 8B, designed as an optimal general-purpose local model, enabling coding, math, agentic functions, function calling, and general use cases. 

Developers seeking a versatile model for tasks like coding, mathematics, and local agent deployment.

Considerations for local deployment:

  • Model size and hardware requirements: Larger models typically require more RAM and computational resources. For instance, the Llama 3.3 70B model may necessitate significant memory, whereas smaller models like Mistral 7B or Gemma 2B/7B are more suitable for systems with limited resources.

  • Specific use cases: Identify your primary use case (e.g., coding, natural language processing, reasoning) to select a model fine-tuned for that purpose.

  • Updates and support: Check the frequency of updates and community support for the model of interest to ensure timely improvements and bug fixes.

2. Download a Model

Once you've chosen a model, pull it with:

ollama pull <model-name>

For example, to download Deepseek-R1, run: 

ollama pull deepseek-r1

Downloading may take a few minutes, depending on the model size and your internet speed. Once completed, the model is stored locally and ready for use.

3. Delete a Model

If you do not need a model, and want to remove it for saving storage:

ollama rm <model-name>

Step 3: Run the Model Locally

Now that we’ve downloaded a model, let’s get it up and running on our local machine.

1. Start the Model

To serve the model locally, run the following command in your terminal:

ollama run <model-name>

For example, if you downloaded Deepseek-R1, you would run:

ollama run deepseek-r1

Once you execute the command, Ollama will:

  • Initialize the model – Load it into memory (this may take a few seconds or minutes, depending on your hardware).

  • Start a local API server – By default, the model will be accessible at:

http://localhost:11434/v1
  • Display logs – You’ll see logs showing that the model is running and ready to accept input.

If you just want to handle multiple API requests called to the model, you can run Ollama in the background:

ollama serve &>/dev/null &

2. Verify the Model is Running

To check if the model is properly running, open a new terminal window and run:

curl http://localhost:11434/v1/models

If everything is working, you should see a response listing the available models:

{
  "object": "list",
  "data": [
	{
  	"id": "deepseek-r1:latest",
  	"object": "model",
  	"created": 1700000000,
  	"owned_by": "library"
	},
	{
  	"id": "llama3:latest",
  	"object": "model",
  	"created": 1700000000,
  	"owned_by": "library"
	}
  ]
}

This confirms that the model is loaded and ready to process requests.

3. Send Requests to the Model

Now, let’s interact with the model by sending text prompts.

Using cURL

You can send a basic request using curl:

curl -X POST http://localhost:11434/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
          "model": "llama3",
          "messages": [{"role": "user", "content": "Explain recursion in simple terms."}],
          "max_tokens": 100
        }'

If the available system memory is not enough for loading the specified model, Ollama will return this response:

{
  "error": {
	"message": "model requires more system memory (X GiB) than is available (Y GiB)",
	"type": "api_error",
	"param": null,
	"code": null
  }
}

If not, the response will look something like this:

{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "llama3",
  "system_fingerprint": "fp_ollama",
  "choices": [
	{
  	"index": 0,
  	"message": {
    	"role": "assistant",
    	"content": "Recursion is a way of solving problems by breaking them down into smaller versions of the same problem. A function calls itself repeatedly until it reaches a simple solution. Think of it like a never-ending staircase that leads to the answer: \"To solve this problem, I'll jump down one step and let myself fall again.\""
  	},
  	"finish_reason": "stop"
	}
  ],
  "usage": {
	"prompt_tokens": 17,
	"completion_tokens": 65,
	"total_tokens": 82
  }
}

Using Python

If you prefer Python, here’s how you can interact with the model:

import requests

url = "http://localhost:11434/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "llama3",
    "messages": [{"role": "user", "content": "Explain recursion in simple terms."}],
    "max_tokens": 100
}

response = requests.post(url, json=data, headers=headers)
print(response.json())

This script sends a chat request to the model and prints the response.

4. Stop Ollama

When you’re done using the model, you can use:

pkill ollama

Alternatively, you can find the process ID and kill it:

ps aux | grep ollama
kill -9 <process_id>

Step 4: Customize for Your Purpose

Now that you have your LLM running locally, let’s explore how to customize it to better fit your needs. This step covers fine-tuning, adjusting response parameters, integrating into applications, and optimizing performance.

1. Adjust Response Parameters

When sending requests to the model, you can tweak different parameters to control the response style. Here are the key ones:

Parameter

Description

Example Value

max_tokens

Limits the response length

100 (short) / 500 (long)

temperature

Controls randomness (higher = more creative, lower = more deterministic)

0.7 (balanced) / 0.1 (focused)

top_p

Limits diversity by narrowing token selection

0.9 (high diversity) / 0.5 (low diversity)

frequency_penalty

Reduces repetition

0.2 (mild) / 1.0 (strong)

presence_penalty

Encourages introducing new topics

0.0 (neutral) / 0.8 (more diverse)

Example: Adjust Response Style

To generate a more creative response, set a higher temperature:

curl -X POST http://localhost:11434/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
          "model": "llama3",
          "messages": [{"role": "user", "content": "Give me a sci-fi story idea!"}],
          "temperature": 1.0,
          "max_tokens": 200
        }'

For a more structured and predictable response, lower the temperature:

...
          "temperature": 0.2
...

2. Integrate the Model into Applications

Your local LLM can be integrated into various applications:

  • Chatbot with FastAPI (Python)

import requests
from fastapi import FastAPI, Form

app = FastAPI()
OLLAMA_URL = "http://localhost:11434/v1/chat/completions"

@app.post("/chatbot")
async def chatbot(
	model_alias: str = Form(default="llama3.2:1b"),
	question: str = Form(default=""),
):
	payload = {
    	"model": model_alias,
    	"messages": [{"role": "user", "content": question}],
    	"temperature": 0.2,
	}
	response = requests.post(OLLAMA_URL, json=payload).json()

	if response.get("choices") is None:
    	return response["error"].get("message", "Error: No response from Ollama.")

	return response["choices"][0]["message"].get("content", "Error: No response from Ollama.")
# Run this with: uvicorn script_name:app --reload

Now, your LLM is accessible via an API endpoint (e.g., chat) that any front-end app can call.

  • Web Interface with Gradio

If you want a quick chatbot UI, install Gradio and create a simple chat app:

import gradio as gr
import requests

URL = "http://localhost:11434/v1/chat/completions"

def chat_function(message, history):
	payload = {
    	"model": "llama3",
    	"messages": [{"role": "user", "content": message}],
	}
	response = requests.post(URL, json=payload)
	reply = response.json()["choices"][0]["message"].get("content", "Error: No response from Ollama.")
	history.append((message, reply))
	return "", history

with gr.Blocks() as demo:
	with gr.Tab("Chatbot"):
    	gr.Markdown(value="## Local AI Chatbot")
    	with gr.Row():
        	with gr.Column():
            	msg = gr.Textbox(placeholder="Type your message...")
            	btn = gr.Button("Send")
        	chatbot = gr.Chatbot(height=800)
    	btn.click(chat_function, inputs=[msg, chatbot], outputs=[msg, chatbot])
demo.launch(server_name="0.0.0.0", server_port=7860, share=True)

Run the app with:

python chatbot.py

Then you can try your app in http://0.0.0.0:7860/. The UI will look like:

3. Optimize Performance for Local Inference

Running LLMs locally can be demanding. Here are some tips to improve performance:

Use a GPU (If Available)

If you have an NVIDIA GPU, install CUDA and enable GPU acceleration:

export OLLAMA_USE_CUDA=1
```
Verify if your GPU is detected:
```bash
nvidia-smi

Reduce model load time

If your LLM takes too long to start, try:

  • Using a smaller model

  • Running with fewer threads (for better resource management):

OLLAMA_NUM_THREADS=4 ollama run llama3

Increase Response Speed

  • Lower temperature for more predictable outputs.

  • Optimize memory usage by closing background apps if RAM is limited.

Why use ollama? advantages & ease of use

Ollama makes running Large Language Models (LLMs) locally simple and efficient. Here are some key reasons why it stands out:

Easy setup & installation: A single command installs Ollama, with no complex dependencies or manual configuration.

Runs entirely on your local machine

  • No need for cloud APIs - everything stays private.

  • Works offline, making it ideal for secure environments.

Built-in model management: Easily download, list, and serve models with simple commands

Simple API for developers

  • Runs a local API server, allowing easy integration with apps.

  • Works with Python, FastAPI, Flask, Streamlit, and more.

Privacy & security

  • Since everything runs locally, no data is sent to external servers.

  • Ideal for confidential applications (legal, medical, enterprise).

Ollama is one of the easiest ways to run an LLM locally, whether you're a beginner or an experienced developer.

Disadvantages of using Ollama

While Ollama makes it easy to run LLMs locally, it does come with some limitations:

High hardware requirements

  • Running LLMs, especially larger models, requires a powerful CPU and a lot of RAM.

  • GPU acceleration is not always optimized, and some models may still run slowly on consumer GPUs.

Limited model selection

  • Compared to cloud-based APIs (e.g., OpenRouter, MS Azure AI, ...), fewer models are officially supported.

  • Some SOTA proprietary models (GPT-4, Claude, Gemini) are not available due to licensing restrictions.

Large storage consumption

  • Each LLM download takes up a significant amount of disk space.

No fine-tuning built-in

  • While you can modify prompts and adjust parameters, Ollama does not yet support full model fine-tuning.

  • For deeper customization, you might need external tools like Hugging Face, LoRA, or full retraining pipelines.

Performance

  • Slower Speed: Ollama’s inference can be slower than llama.cpp or vllm, especially on CPUs or with larger models, due to less aggressive optimization.

  • Quantization: Its default quantization (e.g., 4-bit) may not be as efficient as some others (such as llama.cpp), costing speed and memory efficiency.

  • Accuracy: Outputs might be slightly less precise due to differences in model handling or tokenization.

When should you use ollama?

✔ If you want full privacy and offline capabilities.
✔ If you're experimenting with local AI applications.
✔ If you have a powerful enough machine to handle LLMs.

When might you need something else?

❌ If you need access to proprietary models (GPT-4, Claude, Gemini, ...).
❌ If you need more accurate, faster or real-time AI responses at scale.
❌ If you want fully managed fine-tuning options.

Ollama is great for local AI experimentation, but if you need production serving, scalability, fine-tuning, or cutting-edge commercial models, other solutions might be a better fit. We will discuss them in the related series.

Final thoughts

So, now that you know how to customize your LLM, the possibilities are endless!

  • Adjust responses to match your needs.
    Fine-tune for specific tasks.

  • Integrate into real-world, frameworks, plugins, apps.

  • Optimize performance for smooth local inference.

What’s next?

  • Try different models and compare their performance.

  • Experiment with custom prompts to get better responses.

  • Build a small project using your local LLM.

If you run into any issues or have cool ideas to share, drop a comment below! Happy coding!


Share


Duong Tran
Experienced AI Engineer with nearly 4 years of expertise in developing and managing AI services focused on banking and computer vision applications—an enthusiast in LLM, Computer Vision, and AI Automation, driven by innovation and cutting-edge technology.
Find blog
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Duy Dao
04/03/2025
04/03/2025

We’re here to help you

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank
Find blog
Tags
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Duy Dao
04/03/2025
04/03/2025

Let’s talk

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank