Share

Back
Home / Blogs / Fine-tuning a reasoning model with GRPO for passport data extraction

Fine-tuning a reasoning model with GRPO for passport data extraction

07/03/2025
06/03/2025
Fine-tuning a reasoning model with GRPO for passport data extraction

Extracting structured data from passports isn’t just about OCR—it’s about reasoning. Traditional OCR methods struggle with formatting inconsistencies, multilingual text, and real-world variations, but fine-tuning a model with GRPO enhances contextual understanding, improving accuracy and adaptability. This blog helps developers fine-tune a reasoning model with GRPO to optimize passport data extraction, covering key challenges, implementation techniques, and best practices.

Overview

Fine-tuning a language model isn’t just about feeding it data and hoping for the best. If you’re extracting structured data—like passport details—you need a model that reasons through the problem, not one that just memorizes patterns. That’s where Group Relative Policy Optimization (GRPO) comes in.

In this post, we’ll walk through fine-tuning a reasoning model for passport data extraction using GRPO. We’ll start with Supervised Fine-Tuning (SFT) and then refine it using reinforcement learning (RL) to improve accuracy and reasoning.

We’ll use:

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct

  • Dataset: Custom Passport EN dataset

  • Training Method: SFT + GRPO

All code at Github

Why GRPO?

Supervised fine-tuning ( or SFT) is effective for training a baseline model, but it struggles with generalization. When extracting structured data, slight variations in input format can lead to errors. Standard SFT lacks the adaptive reasoning needed to handle these cases effectively.

This is where GRPO improves the model. The DeepSeekMath paper introduces GRPO as an RL post-training technique designed to enhance reasoning skills in large language models (LLMs). Unlike traditional heuristic-based search methods, GRPO relies solely on RL for optimization, helping the model generalize better to unseen input variations.

GRPO has been used in DeepSeek-R1, and its training approach appears similar to the methods used in OpenAI o1 and o3 models, though exact details are unconfirmed. The Hugging Face Science team is working to reproduce the DeepSeek-R1 training process in their Open-R1 project, which is worth exploring for more insights.

We’ll implement GRPO using the TRL library (Transformer Reinforcement Learning) library and focus on improving structured data extraction from passports.the extraction of structured data

What is the GRPO reinforcement learning algorithm?

To understand GRPO, let’s break it down with an example before diving into the technical details.

Intuition Behind GRPO

GRPO helps a model learn by comparing different actions in groups and making controlled updates. Instead of updating the model after every single observation, it collects multiple observations before adjusting its strategy—similar to mini-batch gradient updates in deep learning.

Example: A robot navigating a maze

1. Sampling different paths: The robot tries out each path multiple times and records the results:

  • Path A: Reaches the goal 2 out of 3 times.

  • Path B: Reaches the goal 1 out of 3 times.

  • Path C: Reaches the goal 3 out of 3 times.

2. Evaluating performance: It calculates the success rate:

  • Path A → 66.67% success

  • Path B → 33.33% success

  • Path C → 100% success

3. Comparing paths: It identifies Path C as the best option but doesn’t ignore the other paths completely.

4. Adjusting strategy: The robot increases the probability of choosing Path C, but it occasionally tries A and B to avoid overfitting to one solution.

5. Controlled updates: Instead of jumping to a 100% preference for Path C, it gradually shifts probabilities while maintaining exploration.

GRPO algorithm mathematics

Now, we will go throught GRPO algorithm mathematics

1. Policy and Actions

The policy, denoted as πθ (where θ represents the policy’s trainable parameters), defines the decision-making strategy. For a given state ( s ), the policy outputs a probability distribution over possible actions, written as πθ(a∣s). This represents the likelihood of selecting action ( a ) in state ( s ).

The objective is to optimize the policy parameters to maximize the expected cumulative reward J(θ), defined as:

where τ=(s0,a0,s1,a1,… ) is a trajectory (a sequence of states and actions), and r(st,at) is the reward received at time ( t ).

2. Group Sampling

For a given state ( s ), GRPO samples a group of ( N ) actions, {a1,a2,…,aN}, using the current policy πθ. Each action a_i is independently drawn from the policy distribution:

This creates a set of candidate actions to evaluate.

3. Reward Scoring

Each sampled action a_i is assessed using a reward function R(ai), which quantifies the action’s quality. This function might represent the immediate reward r(s,a_i) or a discounted sum of future rewards starting from state ( s ) and action a_i, depending on the problem setup.

4. Advantage Calculation

The advantage A(a_i) measures how much better or worse action a_i performs compared to the average performance across the sampled group. It is typically calculated as:

Where the second term is the average reward of all ( N ) actions. A positive advantage indicates an above-average action, while a negative advantage indicates a below-average one.

5. Policy Update

GRPO adjusts the policy parameters θ to favor actions with positive advantages (increasing their probability) and disfavour actions with negative advantages (decreasing their probability). This is done by optimizing the policy based on the advantage values, typically via gradient-based methods.

6. KL Divergence Constraint

To prevent the updated policy πθ from straying too far from the previous policy, GRPO imposes a Kullback-Leibler (KL) divergence constraint:

where πθ_old is the policy before the update, and is a small threshold. This ensures stable and gradual policy improvements.

7. GRPO Objective

The overarching goal of GRPO is to maximize the expected cumulative reward J(θ) while maintaining stability in policy updates. This is achieved by balancing the reward improvement (guided by the advantage) with the KL divergence constraint, resulting in an optimization problem of the form:

Differences Between PPO and GRPO

Applying GRPO to passport data extraction

Just like the robot learns from multiple trials, GRPO enables an LLM to refine its structured data extraction ability by:

  • Evaluating different extraction strategies for variations in passport formats.

  • Adjusting model predictions based on relative success rather than absolute correctness.

  • Ensuring controlled updates, allowing for better generalization.

Training

Creating the passport dataset

We needed a diverse and extensive collection of passport samples to build a high-quality dataset for passport information extraction. We designed a structured framework to systematically extract key details while ensuring accuracy and consistency. Raw passport images and text had to be transformed into a structured format that machine learning models could efficiently process. To achieve this, we developed a token-based representation system that encodes essential details in a machine-readable format.

However, we needed to go beyond simple structuring and convert this format into a more intuitive, reasoning-based representation—essentially translating structured data into a "language" the model can process effectively. This system allows us to generate step-by-step navigation sequences, making the representation highly suitable for training models to reason through information extraction tasks.

Inspired by Camel-AI, a multi-agent framework for data generation that utilizes the Gemini API, we leveraged this approach to generate dataset reasoning (as reflected in my code). As a result, we have compiled a dataset of 400 examples, available here: ```dataset_public grpo ```

Here's how we structured the dataset:

1. Field extraction: We systematically extract essential passport information, including:

  • Passport Number

  • Full Name

  • Gender

  • Nationality

  • Date of Birth

  • Place of Birth

  • Date of Issue

  • Date of Expiry

  • Place of Issue (if available)

  • Machine Readable Zone (MRZ)

2. Token-based representation: Instead of relying solely on free-text extraction, we encode passport data into structured tokens, ensuring uniformity across all entries. This transformation makes it easier for models to learn patterns and generalize across different passports.

  • Entity tokens: Each extracted field is wrapped with a special token (e.g., “passport_number”: “200858064”), making it easier for models to identify and process key details.

  • Date formatting: Dates are standardized to a consistent format (DD-MM-YYYY) to maintain consistency across all samples.

  • MRZ encoding: The Machine Readable Zone (MRZ) is preserved as a key feature, allowing the model to cross-validate extracted details.

3. Dataset structure: Each passport entry in the dataset follows this structured format:

{
	"passport_number": "200858064",
	"full_name": "Daniel Warren",
	"gender": "M",
	"nationality": "United Kingdom",
	"dob": "06-05-1945",
	"place_of_birth": "Gloucestershire",
	"place_of_issue": null,
	"issue_date": "12-12-2020",
	"expire_date": "06-05-2032",
	"mrz": "PAGBRDANIEL<<WARREN<<<<<<<<<<<<<<<<<<<<<<<<<\nB1137484W7GBR4505064M3205068200858064<<<<<50"
}

4. Step-by-step extraction process:

  • The system scans the passport document and detects predefined fields.

  • Each field is extracted with high-confidence OCR and natural language processing techniques.

  • The extracted information is formatted into structured JSON, ensuring consistency and completeness.

  • A verification step cross-checks the MRZ data with the extracted fields to validate correctness.

5. Scalability & diversity: Our dataset includes a variety of passports with different layouts, fonts, and security features, ensuring robustness against variations in real-world documents. By leveraging automated pipelines, we generate thousands of high-quality labeled passport samples for model training.

Stage 1: SFT for passport data extraction

Goals

  • Train the model to accurately extract structured passport information from raw OCR text.

  • Establish a strong baseline for later stages of reinforcement learning or additional finetuning.

Methodology

We explored two training strategies:

  • Direct extraction model: The model is trained to predict all passport fields directly from the raw OCR text.

  • Step-by-step extraction model: The model is trained to extract each field sequentially, incorporating intermediate reasoning to verify information consistency (e.g., cross-checking MRZ data with extracted fields).

SFT Hyperparameters:

Parameter

Value

model_name_or_path

Qwen/Qwen2.5-1.5B-Instruct

disable_gradient_checkpointing

true

finetuning_type

lora

deepspeed

ds0

cutoff_len

4096

train_on_prompt

true

per_device_train_batch_size

32

gradient_accumulation_steps

1

learning_rate

1.0e-5

num_train_epochs

1.0

lr_scheduler_type

cosine

warmup_ratio

0.1

bf16

true

This structured approach ensures that our passport dataset is optimized for training AI models in document analysis, identity verification, and automated data extraction, making it highly reliable and scalable for real-world applications.

We use llama-factory to SFT the model.

Stage 2: Reinforcement learning with GRPO for passport extraction

After completing SFT, we applied GRPO to refine the model’s extraction capabilities. GRPO helps the model learn from its own generated extractions, iteratively improving accuracy and robustness.

Constructing the GRPO reward function

A well-designed reward function is essential for reinforcement learning. We developed a composite reward function that evaluates multiple aspects of accurate passport extraction:

  • Format Reward

    • Purpose: Ensures the model outputs responses in a structured format.

    • Mechanism: Uses regex to verify that the output follows a predefined structure (e.g., <think>...</think> <answer>...</answer>).

    • Reward: +1.0 for correctly formatted output, 0.0 otherwise.

  • Accuracy Reward:

    • Purpose: Checks if the model's output matches the ground truth.

    • Mechanism: 

      i. Compares the model’s JSON output with the labeled data.

      ii. If the extracted fields perfectly match the ground truth, the model receives full reward.

    iii. If only some fields match, the reward is proportional to the number of correctly extracted fields. 

  • Reward: +1.0 for a perfect match, partial score for partial correctness, 0.0 for invalid JSON.

Total forward calculation

  • Final reward = Correctness + Field Validity + MRZ Consistency + Structured Formatting

  • Maximum possible reward: 3.5

GRPO training hyperparameters

Parameter

Value

learning_rate

1e-6

alpha

128

r

128

weight_decay

0.1

warmup_ratio

0.1

lr_scheduler_type

cosine

optim

paged_adamw_8bit

per_device_train_batch_size

8

gradient_accumulation_steps

1

num_generations

4

max_prompt_length

612

max_completion_length

4096

max_grad_norm

0.1

Data preparation

Before training, we need to format our dataset properly. Each data sample consists of a user query (passport data) and an assistant response (extracted structured details). We structure the conversation for the model using a helper function:

def make_conversation(example):
    """Prepare conversation data for training."""
    conversation = example["conversations"]

    user_content = next((msg["content"] for msg in conversation if msg["role"] == "user"), "")
    assistant_content = next((msg["content"] for msg in conversation if msg["role"] == "assistant"), "")

    SYSTEM_PROMPT = (
        "A conversation between User and Assistant. The user provides passport data, "
        "and the Assistant extracts relevant details. The Assistant first reasons through "
        "the problem before providing structured output."
    )

    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
        "response": assistant_content
    }

We apply this function to the dataset to structure the input for training.

Reward functions

GRPO optimizes the model using reward signals. We define two reward functions:

  • Format reward: Ensures the model outputs responses in a structured format.

  • Accuracy reward: Compares the model output with ground truth data.

Format-based reward function

import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    return [1.0 if re.match(pattern, content) else 0.0 for content in completion_contents]

Accuracy-based reward function

import json

def accuracy_reward(prompts, completions, **kwargs):
    """Reward function that checks if the model output matches the ground truth."""
    rewards = []
    conversations_list = kwargs.get("conversations", [])

    for completion, conversations in zip(completions, conversations_list):
        try:
            # Extract ground truth from assistant responses
            assistant_response = next(
                (msg["content"] for msg in conversations if msg["role"] == "assistant"),"")
            # Extract model output
            model_output = completion[0]["content"]

            # Parse as JSON for comparison
            parsed_solution = json.loads(assistant_response)
            parsed_output = json.loads(model_output)

            # Compare exact match or calculate matching fields ratio
            if parsed_solution == parsed_output:
                rewards.append(1.0)
            else:
                matching_keys = sum(
                    1 for key in parsed_solution if key in parsed_output and parsed_solution[key] == parsed_output[key]
                )
                total_keys = len(parsed_solution)
                rewards.append(matching_keys / total_keys)
        except Exception:
            rewards.append(0.0)  # If JSON parsing fails, return 0
    return rewards

Training with GRPO

With the dataset and reward functions ready, we now train the model using GRPO with LoRA fine-tuning.

  • Hardware Setup

The training experiments were conducted on a server equipped with an NVIDIA GTX 3090 GPU. This setup allowed us to fine-tune the model efficiently while managing memory constraints, particularly when using LoRA for reduced VRAM consumption.

  • Training Script

from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import wandb
import torch

def train_model(args):
    """Train the model using GRPO."""
    # Initialize wandb
    wandb.init(project="grpo")

    # Load and preprocess dataset
    dataset = load_dataset("json", data_files=args.dataset, split="train")
    dataset = dataset.map(make_conversation)

    # Split dataset
    train_test_split = dataset.train_test_split(test_size=0.1)
    train_dataset = train_test_split["train"]
    test_dataset = train_test_split["test"]

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        args.model_id,
        torch_dtype="auto",
        device_map="auto",
    )

    # Configure LoRA
    lora_config = LoraConfig(
        task_type="CAUSAL_LM",
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj"],
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Configure training arguments
    training_args = GRPOConfig(
        output_dir=args.output_dir,
        learning_rate=args.learning_rate,
        remove_unused_columns=False,
        gradient_accumulation_steps=args.batch_size,
        num_train_epochs=args.epochs,
        bf16=True,
        max_completion_length=64,
        num_generations=4,
        max_prompt_length=128,
        report_to=["tensorboard"],
        logging_steps=10,
        push_to_hub=True,
        save_strategy="steps",
        save_steps=10,
    )

    # Initialize and train GRPO Trainer
    trainer = GRPOTrainer(
        model=model,
        reward_funcs=[format_reward, accuracy_reward],
        args=training_args,
        train_dataset=train_dataset
    )

    trainer.train()
    trainer.save_model(training_args.output_dir)
    trainer.push_to_hub(dataset_name="passport_en_grpo")
  • This script:

  • Loads and processes the dataset

  • Applies LoRA for efficient fine-tuning

  • Trains the model using GRPO with the defined reward functions

  • Saves and uploads the trained model

Results

Supervised Fine-Tuning

We used llama-factory to fine-tune the model with SFT. The results show that Qwen2.5-1.5B_SFT_Lora significantly outperformed the base instruct model, improving accuracy from 66.58% to 87.62%. However, when evaluated against the 14B instruct model, the accuracy was only 70%, indicating that fine-tuning on targeted tasks provides a more substantial performance boost. Certain structured fields, particularly Machine Readable Zone (MRZ) data, remained prone to extraction errors.

GRPO Post-Training

Applying GRPO after SFT further improved accuracy to 92.38%. The reinforcement learning approach helped the model handle structured fields more effectively. Notably, fields like MRZ, which previously had errors in the SFT_Lora model, were extracted with higher accuracy in the GRPO_SFT_Lora variant.

These results highlight the benefit of reinforcement learning in structured data extraction, especially for fields with strict formatting rules like MRZ.

Results of supervised fine-tuning and GRPO post-training

Results of supervised fine-tuning and GRPO post-training

However, despite these improvements, some extraction errors still persist. One possible reason is the quality of the "think" data generated using the free Gemini API, which may not always provide highly accurate or structured reasoning sequences. A potential approach to further enhance performance is to improve the quality of the "think" data by using a more reliable data-generation method or a higher-quality reasoning API.

Key takeaways

  • GRPO without a cold start is challenging: Starting GRPO training from a base model (without SFT) proved difficult. The model struggled to learn the task and exhibited "overthinking" behavior, exceeding the context length.

  • SFT provides a strong foundation: Initializing with an SFT model significantly accelerated learning and improved the effectiveness of GRPO. The model quickly converged and produced usable results.

  • LoRA works well: Using LoRA, was sufficient for achieving good results on this task, demonstrating its potential for resource-constrained scenarios.

  • Reward function design is crucial: The carefully designed reward function played a critical role in shaping the model's behavior, guiding it toward correct solutions and proper formatting.

Conclusion

We fine-tuned a reasoning model for passport data extraction using a two-stage approach:

  • SFT for baseline accuracy.

  • GRPO for reasoning and structured output.

By designing a structured dataset and a reward-based GRPO system, we improved accuracy, consistency, and reasoning abilities. The GRPO-enhanced model demonstrated better handling of structured fields, particularly MRZ, compared to SFT alone.

However, some fields still contain extraction errors, even after GRPO fine-tuning. To further enhance performance, we need to refine the reward function and explore more advanced optimization techniques.

Share


Phong Ngo
I'm an AI Engineer with 5 years of experience working on everything from machine learning to natural language processing. I love tackling complex problems and finding ways to apply AI in real-world scenarios. I'm always looking for new challenges and enjoy sharing what I’ve learned along the way.
Find blog
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Duy Dao
04/03/2025
04/03/2025

We’re here to help you

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank
Find blog
Tags
Small Language Models: Smaller, faster, smarter for AI
Small Language Models: Smaller, faster, smarter for AI
Vy Nguyen
05/03/2025
05/03/2025
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Deploying an LLM in AWS Lambda: A no-nonsense guide for beginners
Duy Dao
04/03/2025
04/03/2025

Let’s talk

Please fill in the blank
Please fill in the blank
Please fill in the blank
Please fill in the blank