Share

Back
Home / Blogs / Trends and Technology / Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners

Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners

04/03/2025
04/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners

Large Language Models (LLMs) are evolving rapidly, but managed services like OpenAI aren't always the best fit. Some use cases require self-hosting—whether for handling sensitive data, supporting less common languages, or optimizing performance for specialized tasks. Fortunately, open-source LLMs are now powerful enough to compete with proprietary models, even on modest hardware. In this guide, we’ll take you through the entire process of deploying an LLM in AWS Lambda—no fluff, just the essentials to get it up and running.

All the code used in this blog post is available on GitHub for you to practice!

Why use AWS Lambda for LLM inference?

AWS Lambda is great because:

  • It’s serverless—no need to manage EC2 instances.

  • It scales automatically—you don’t have to worry about handling requests.

  • It’s cost-effective—you only pay for execution time.

AWS Lambda has a cold start problem and limited computing (10GB RAM & 6 vCPUs max). This means you won’t be running GPT-4-sized models here, but smaller models like GPT-2, Llama-3.2-1B, Queen-2.5-1.5B or Mistral-7B (with quantization) can work

Setting up your local environment

Before deploying to AWS, let’s first get everything running locally.

Prerequisites

  1. You need an AWS account and AWS CLI installed & configured (aws configure)

  2. Docker installed (docker --version to check)

  3. Basic knowledge of AWS Lambda & Docker

  4. A Code Editor like Visual Studio Code

Create a Lambda function locally with FastAPI and Docker

We’ll use FastAPI to create a simple REST API that loads an LLM and returns responses.

Step 1: Write the API (app.py)

from fastapi import FastAPI
from mangum import Mangum
app = FastAPI()
@app.post("/chat")
async def root(data: dict):
   return f"Hello {data['input_text']}"
handler = Mangum(app)

This is a minimal API that accepts input and returns a response.

Step 2: Define dependencies (requirements.txt)

fastapi
mangum

Step 3: Create a Dockerfile

FROM public.ecr.aws/lambda/python:3.12

# Copy requirements.txt
COPY requirements.txt ${LAMBDA_TASK_ROOT}

# Install the specified packages
RUN pip install -r requirements.txt

# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler
CMD [ "app.handler" ]

Step 4: Build & run the container

docker build -t llm-lambda:latest .
docker run -it -p 8080:8080 llm-lambda:latest

Step 5: Test the API

Open another terminal and send a request:

curl -X POST "http://localhost:8080/2015-03-31/functions/function/invocations"   -H "Content-Type: application/json"   -d '{
    "resource": "/chat",
    "path": "/chat",
    "httpMethod": "POST",
    "requestContext": {
      "resourcePath": "/chat",
      "httpMethod": "POST"
    },
    "body": "{\"input_text\": \"How to learn LLM.\"}"
  }'

Example output:

Hello How to learn LLM.

Run an LLM inside a container

AWS Lambda has a cold start issue and limited computing resources. Since it only provides CPU-based execution, it's best to use quantized models (less computationally intensive but slightly lower quality) in the GGUF format, which is optimized for CPUs.

In this tutorial, we use Llama-3.2-1B-Instruct. However, you can easily choose a different model from Unsloth's repository to fit your needs.

Step 1: Download the model

wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-F16.gguf

Step 2: Update dependencies (requirements.txt)

To run an LLM, we need to add llama-cpp-python to our requirements.txt.

llama-cpp-python
fastapi
mangum

Step 3: Modify Dockerfile to support LLM inference

This is because the default Amazon docker image doesn’t include the build tools required for llama-cpp-python. To use multi-threaded optimizations, we are doing a pip install with the CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" flags.

# Stage 1: Build environment using a Python base image
FROM python:3.12 as builder


# Install build tools
RUN apt-get update && apt-get install -y gcc g++ cmake zip


# Copy requirements.txt and install packages with appropriate CMAKE_ARGS
COPY requirements.txt .
RUN CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install --upgrade pip && pip install -r requirements.txt


# Stage 2: Final image using AWS Lambda Python image
FROM public.ecr.aws/lambda/python:3.12


# Install OpenMP library
RUN  dnf -y install libgomp


# Copy installed packages from builder stage
COPY --from=builder /usr/local/lib/python3.12/site-packages/ /var/lang/lib/python3.12/site-packages/


# Copy lambda function code
COPY Llama-3.2-1B-Instruct-f16.gguf ${LAMBDA_TASK_ROOT}
COPY app.py ${LAMBDA_TASK_ROOT}


CMD [ "app.handler" ]

Step 4: Modify Lambda code for LLM inference (app.py)

Let’s modify our lambda code to run LLM inference. Some more work is required to get the prompt out of the request body in the production environment.

import json


from fastapi import FastAPI
from mangum import Mangum
from llama_cpp import Llama




# Load the LLM, outside the handler so it persists between runs
llm = Llama(
   model_path="Llama-3.2-1B-Instruct-f16.gguf", # change if different model
   n_ctx=2048, # context length
   n_threads=6,  # maximum in AWS Lambda
)


app = FastAPI()


@app.post("/chat")
async def root(data: dict):
   output = llm(
       f"Instruct: {data['input_text']}\nOutput:",
       max_tokens=512,
       echo=True,
   )
   return {
       "statusCode": 200,
       "body": json.dumps(output)
   }


handler = Mangum(app)

Step 5: Build & run the container

docker build -f Dockerfile -t llm-lambda:latest .
docker run -it -p 8080:8080 llm-lambda:latest

Step 6: Test the API

Open another terminal and send a request:

curl -X POST "http://localhost:8080/2015-03-31/functions/function/invocations"   -H "Content-Type: application/json"   -d '{
    "resource": "/chat",
    "path": "/chat",
    "httpMethod": "POST",
    "requestContext": {
      "resourcePath": "/chat",
      "httpMethod": "POST"
    },
    "body": "{\"input_text\": \"How to learn  LLM.\"}"
  }'

Example output:

{
  "statusCode": 200,
  "headers": {
    "content-length": "3030",
    "content-type": "application/json"
  },
  "multiValueHeaders": {},
  "body": "{\"statusCode\":200,\"body\":\"{\\\"id\\\": \\\"cmpl-db2a5250-ce4e-4061-befe-83e1fdd4d30e\\\", \\\"object\\\": \\\"text_completion\\\", \\\"created\\\": 1741055057, \\\"model\\\": \\\"Llama-3.2-1B-Instruct-f16.gguf\\\", \\\"choices\\\": [{\\\"text\\\": \\\"Instruct: How to learn LLM.\\\\nOutput: The process of learning to learn language models like LLM is called learning to learn. This process requires a combination of natural language processing, machine learning, and cognitive psychology. To learn LLM, one needs to understand the concept of deep learning, neural networks, and the importance of training data. It also requires the ability to think critically and creatively, as well as to be able to analyze and evaluate the output of the model.\\\\n\\\\n### Step 1: Understand the Basics\\\\n- **Learning to Learn**: This is the process of becoming proficient in learning. In the context of LLM, it involves understanding how to incorporate new information and adjust the model accordingly.\\\\n- **Natural Language Processing (NLP)**: Understanding how to process and generate natural language.\\\\n- **Machine Learning**: The ability to train models to make predictions without being explicitly programmed.\\\\n- **Cognitive Psychology**: The study of how we process and understand language.\\\\n\\\\n### Step 2: Gather the Necessary Tools and Data\\\\n- **Choose a Dataset**: Select a dataset that is relevant to the task at hand. For example, if you want to build a language model for general conversation, you might use a dataset like the Stanford Natural Language Inference (SNLI) dataset.\\\\n- **Data Preprocessing**: Clean and preprocess the data to prepare it for training.\\\\n- **Model Selection**: Decide on the type of language model you want to build (e.g., transformer, recurrent neural network).\\\\n\\\\n### Step 3: Learn the Basics of Deep Learning\\\\n- **Surround Yourself with Knowledge**: Read books, articles, and online resources to get an understanding of deep learning concepts.\\\\n- **Take Online Courses**: Enroll in courses on deep learning, NLP, and machine learning to deepen your knowledge.\\\\n- **Practice, Practice, Practice**: Apply what you've learned by building simple models and experimenting with different techniques.\\\\n\\\\n### Step 4: Develop Your Skills\\\\n- **Read Research Papers**: Stay up-to-date with the latest research in the field by reading papers and attending conferences.\\\\n- **Join Online Communities**: Participate in online forums to discuss ideas and learn from others.\\\\n- **Collaborate**: Collaborate with others to build models and learn from their expertise.\\\\n\\\\n### Step 5: Fine-Tune Your Model\\\\n- **Experiment with Different Models**: Try out different models and techniques to see what works best for your dataset.\\\\n- **Evaluate Your Model**: Assess your model's performance using metrics like accuracy and F1 score.\\\\n- **Refine Your Model**: Based on your evaluation,\\\", \\\"index\\\": 0, \\\"logprobs\\\": null, \\\"finish_reason\\\": \\\"length\\\"}], \\\"usage\\\": {\\\"prompt_tokens\\\": 12, \\\"completion_tokens\\\": 512, \\\"total_tokens\\\": 524}}\"}",
  "isBase64Encoded": false
}

Deploying to AWS Lambda

Now comes the fun part—deploying it as a Lambda function using a container. I will choose Singapore region: ap-southeast-1 in my test

export AWS_REGION=ap-southeast-1

Create content of trust-policy.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Step 1: Create an ECR repository

AWS Lambda supports container images stored in Amazon Elastic Container Registry (ECR).

aws ecr create-repository --repository-name llm-lambda --region $AWS_REGION

Get the repository URI:

aws ecr describe-repositories --repository-names llm-lambda --region $AWS_REGION

Step 2: Tag and Push the Image to ECR

Get account id

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Authenticate Docker with AWS:

aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

Tag the image:

docker tag llm-lambda:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/llm-lambda:latest

Push to ECR:

docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/llm-lambda:latest

Step 3: Create the AWS Lambda function

export LAMBDA_ROLE_NAME="llm-lambda-role" # Role name to create, not ARN
export IAM_POLICY_FILE="trust-policy.json"

Create Lambda IAM role

aws iam create-role --role-name $LAMBDA_ROLE_NAME --assume-role-policy-document file://$IAM_POLICY_FILE
aws iam attach-role-policy --role-name $LAMBDA_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Get the IAM role ARN

export LAMBDA_ROLE_ARN=$(aws iam get-role --role-name $LAMBDA_ROLE_NAME --query 'Role.Arn' --output text)

Create function

aws lambda create-function \
    --function-name llm-lambda \
    --role $LAMBDA_ROLE_ARN \
    --timeout 300 \
    --memory-size 10240 \
    --package-type Image \
    --code ImageUri=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/llm-lambda:latest

Config url

aws lambda create-function-url-config \
    --function-name llm-lambda \
    --auth-type "NONE" --region $AWS_REGION

Add permission to allow public access to the Function URL

aws lambda add-permission \
    --function-name llm-lambda \
    --region $AWS_REGION \
    --statement-id "FunctionURLAllowPublicAccess" \
    --action "lambda:InvokeFunctionUrl" \
    --principal "*" \
    --function-url-auth-type "NONE"

Get function url

export FUNCTION_URL=$(aws lambda get-function-url-config --region $AWS_REGION --function-name llm-lambda --query 'FunctionUrl' --output text)
echo "Lambda Function URL: $FUNCTION_URL"

Step 4: Test the Lambda function

Invoke the function:

curl -X POST "$FUNCTION_URL"   -H "Content-Type: application/json"   -d '{
    "resource": "/chat",
    "path": "/chat",
    "httpMethod": "POST",
    "requestContext": {
      "resourcePath": "/chat",
      "httpMethod": "POST"
    },
    "body": "{\"input_text\": \"How to learn LLM.\"}"
  }'

Or test it in the AWS console.

Example output:

{
  "statusCode": 200,
  "headers": {
    "content-length": "2479",
    "content-type": "application/json"
  },
  "multiValueHeaders": {},
  "body": "{\"statusCode\":200,\"body\":\"{\\\"id\\\": \\\"cmpl-165e3c31-aa74-4f12-a44d-69e14de0488c\\\", \\\"object\\\": \\\"text_completion\\\", \\\"created\\\": 1741059885, \\\"model\\\": \\\"Llama-3.2-1B-Instruct-f16.gguf\\\", \\\"choices\\\": [{\\\"text\\\": \\\"Instruct: How to learn LLM.\\\\nOutput: In this section, we will guide you on how to learn an Artificial Intelligence (AI) Language Model (LLM) to build, train, and deploy models.\\\\n\\\\n## Step 1: Setting Up the Environment\\\\nTo learn an LLM, you first need to set up your environment. This includes installing the necessary libraries and tools, such as TensorFlow, PyTorch, or Hugging Face's Transformers.\\\\n\\\\n## Step 2: Choosing a LLM Model\\\\nSelect a suitable LLM model based on your specific use case. Popular choices include BERT, RoBERTa, DistilBERT, and XLNet. Each model has its strengths and weaknesses, so choose one that aligns with your problem and dataset.\\\\n\\\\n## Step 3: Preparing Your Dataset\\\\nPrepare your dataset by loading and preprocessing the data. This can include tokenization, encoding, and normalization. You may need to create custom data loaders for specific tasks, such as data augmentation or transfer learning.\\\\n\\\\n## Step 4: Training the LLM\\\\nUse the chosen LLM model to train your dataset. You can use a variety of techniques, such as supervised, unsupervised, or reinforcement learning, depending on your problem. The training process typically involves feeding the dataset into the model, and it will output predictions or labels.\\\\n\\\\n## Step 5: Evaluating and Optimizing the Model\\\\nEvaluate the performance of the trained LLM using metrics such as accuracy, precision, recall, and F1 score. You may also want to tune hyperparameters to improve the model's accuracy and efficiency.\\\\n\\\\n## Step 6: Deploying the LLM\\\\nOnce you are satisfied with the model's performance, you can deploy it in production. This involves setting up a model serving pipeline, such as TensorFlow Serving or Hugging Face's Model Hub, to make the model accessible to users.\\\\n\\\\n## Step 7: Maintaining and Updating the Model\\\\nFinally, you should regularly maintain and update the model to ensure it remains accurate and efficient. This may involve retraining the model using new data, fine-tuning the hyperparameters, or updating the model architecture.\\\\n\\\\nThe final answer is: $\\\\\\\\boxed{7}$\\\", \\\"index\\\": 0, \\\"logprobs\\\": null, \\\"finish_reason\\\": \\\"stop\\\"}], \\\"usage\\\": {\\\"prompt_tokens\\\": 12, \\\"completion_tokens\\\": 436, \\\"total_tokens\\\": 448}}\"}",
  "isBase64Encoded": false
}

Performance

I benchmark with some models with a lambda configuration: 6-core multi-threading, 10GB of RAM, and a 5-minute timeout.

Model

Total Token

Time Execute (s)

Token/s

Max memory used (MB)

Money ($0.0000001667/1ms in singapore)

Llama-3.2-1B-Instruct

448

26

17.23

713

4.33$/1000 req

DeepSeek-R1-Distill-Qwen-1.5B

524

18,5

28.32

1158

3.1$/1000 req

Qwen2.5-Coder-3B-Instruct

212

13

16.3

1922

2.17$/1000 req

Next steps

As you can see, running LLMs in hardware-constrained environments like AWS Lambda is possible. While the accuracy may not be as generalizable as larger paid models like GPT-4o, Gemini, or DeepSeek, for specific tasks, fine-tuned smaller models (~1B parameters) can achieve accuracy comparable to commercial models.

If you’d like to continue exploring this project, here are some potential next steps:

  1. Try bigger model

  2. Try with a small configuration

  3. Running models with different cloud platforms

Conclusion

Deploying an LLM in AWS Lambda might seem complex, but with the right approach, it becomes a straightforward process. Self-hosting an LLM not only provides greater control over cost, security, and performance but also unlocks new possibilities beyond managed services. Now that you have the step-by-step guide, it's time to put it into action—check out the GitHub repository, experiment with different models, and start building your own serverless LLM deployment today

Share


Duy Dao
As an Engineering Manager at Rabiloo, I am passionate about optimization and emerging technologies, constantly exploring new ways to create smarter, more efficient solutions.
Find blog
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Duy Dao
04/03/2025
04/03/2025

We’re here to help you

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank
Find blog
Tags
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Duy Dao
04/03/2025
04/03/2025

Let’s talk

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank