Share
Extracting structured data from passports isn’t just about OCR—it’s about reasoning. Traditional OCR methods struggle with formatting inconsistencies, multilingual text, and real-world variations, but fine-tuning a model with GRPO enhances contextual understanding, improving accuracy and adaptability. This blog helps developers fine-tune a reasoning model with GRPO to optimize passport data extraction, covering key challenges, implementation techniques, and best practices.
Fine-tuning a language model isn’t just about feeding it data and hoping for the best. If you’re extracting structured data—like passport details—you need a model that reasons through the problem, not one that just memorizes patterns. That’s where Group Relative Policy Optimization (GRPO) comes in.
In this post, we’ll walk through fine-tuning a reasoning model for passport data extraction using GRPO. We’ll start with Supervised Fine-Tuning (SFT) and then refine it using reinforcement learning (RL) to improve accuracy and reasoning.
We’ll use:
Base Model: Qwen/Qwen2.5-1.5B-Instruct
Dataset: Custom Passport EN dataset
Training Method: SFT + GRPO
All code at Github
Supervised fine-tuning ( or SFT) is effective for training a baseline model, but it struggles with generalization. When extracting structured data, slight variations in input format can lead to errors. Standard SFT lacks the adaptive reasoning needed to handle these cases effectively.
This is where GRPO improves the model. The DeepSeekMath paper introduces GRPO as an RL post-training technique designed to enhance reasoning skills in large language models (LLMs). Unlike traditional heuristic-based search methods, GRPO relies solely on RL for optimization, helping the model generalize better to unseen input variations.
GRPO has been used in DeepSeek-R1, and its training approach appears similar to the methods used in OpenAI o1 and o3 models, though exact details are unconfirmed. The Hugging Face Science team is working to reproduce the DeepSeek-R1 training process in their Open-R1 project, which is worth exploring for more insights.
We’ll implement GRPO using the TRL library (Transformer Reinforcement Learning) library and focus on improving structured data extraction from passports.the extraction of structured data
To understand GRPO, let’s break it down with an example before diving into the technical details.
GRPO helps a model learn by comparing different actions in groups and making controlled updates. Instead of updating the model after every single observation, it collects multiple observations before adjusting its strategy—similar to mini-batch gradient updates in deep learning.
1. Sampling different paths: The robot tries out each path multiple times and records the results:
Path A: Reaches the goal 2 out of 3 times.
Path B: Reaches the goal 1 out of 3 times.
Path C: Reaches the goal 3 out of 3 times.
2. Evaluating performance: It calculates the success rate:
Path A → 66.67% success
Path B → 33.33% success
Path C → 100% success
3. Comparing paths: It identifies Path C as the best option but doesn’t ignore the other paths completely.
4. Adjusting strategy: The robot increases the probability of choosing Path C, but it occasionally tries A and B to avoid overfitting to one solution.
5. Controlled updates: Instead of jumping to a 100% preference for Path C, it gradually shifts probabilities while maintaining exploration.
Now, we will go throught GRPO algorithm mathematics
1. Policy and Actions
The policy, denoted as πθ (where θ represents the policy’s trainable parameters), defines the decision-making strategy. For a given state ( s ), the policy outputs a probability distribution over possible actions, written as πθ(a∣s). This represents the likelihood of selecting action ( a ) in state ( s ).
The objective is to optimize the policy parameters to maximize the expected cumulative reward J(θ), defined as:
where τ=(s0,a0,s1,a1,… ) is a trajectory (a sequence of states and actions), and r(st,at) is the reward received at time ( t ).
2. Group Sampling
For a given state ( s ), GRPO samples a group of ( N ) actions, {a1,a2,…,aN}, using the current policy πθ. Each action a_i is independently drawn from the policy distribution:
This creates a set of candidate actions to evaluate.
3. Reward Scoring
Each sampled action a_i is assessed using a reward function R(ai), which quantifies the action’s quality. This function might represent the immediate reward r(s,a_i) or a discounted sum of future rewards starting from state ( s ) and action a_i, depending on the problem setup.
4. Advantage Calculation
The advantage A(a_i) measures how much better or worse action a_i performs compared to the average performance across the sampled group. It is typically calculated as:
Where the second term is the average reward of all ( N ) actions. A positive advantage indicates an above-average action, while a negative advantage indicates a below-average one.
5. Policy Update
GRPO adjusts the policy parameters θ to favor actions with positive advantages (increasing their probability) and disfavour actions with negative advantages (decreasing their probability). This is done by optimizing the policy based on the advantage values, typically via gradient-based methods.
6. KL Divergence Constraint
To prevent the updated policy πθ from straying too far from the previous policy, GRPO imposes a Kullback-Leibler (KL) divergence constraint:
where πθ_old is the policy before the update, and is a small threshold. This ensures stable and gradual policy improvements.
7. GRPO Objective
The overarching goal of GRPO is to maximize the expected cumulative reward J(θ) while maintaining stability in policy updates. This is achieved by balancing the reward improvement (guided by the advantage) with the KL divergence constraint, resulting in an optimization problem of the form:
Just like the robot learns from multiple trials, GRPO enables an LLM to refine its structured data extraction ability by:
Evaluating different extraction strategies for variations in passport formats.
Adjusting model predictions based on relative success rather than absolute correctness.
Ensuring controlled updates, allowing for better generalization.
We needed a diverse and extensive collection of passport samples to build a high-quality dataset for passport information extraction. We designed a structured framework to systematically extract key details while ensuring accuracy and consistency. Raw passport images and text had to be transformed into a structured format that machine learning models could efficiently process. To achieve this, we developed a token-based representation system that encodes essential details in a machine-readable format.
However, we needed to go beyond simple structuring and convert this format into a more intuitive, reasoning-based representation—essentially translating structured data into a "language" the model can process effectively. This system allows us to generate step-by-step navigation sequences, making the representation highly suitable for training models to reason through information extraction tasks.
Inspired by Camel-AI, a multi-agent framework for data generation that utilizes the Gemini API, we leveraged this approach to generate dataset reasoning (as reflected in my code). As a result, we have compiled a dataset of 400 examples, available here: ```dataset_public grpo ```
Here's how we structured the dataset:
1. Field extraction: We systematically extract essential passport information, including:
Passport Number
Full Name
Gender
Nationality
Date of Birth
Place of Birth
Date of Issue
Date of Expiry
Place of Issue (if available)
Machine Readable Zone (MRZ)
2. Token-based representation: Instead of relying solely on free-text extraction, we encode passport data into structured tokens, ensuring uniformity across all entries. This transformation makes it easier for models to learn patterns and generalize across different passports.
Entity tokens: Each extracted field is wrapped with a special token (e.g., “passport_number”: “200858064”), making it easier for models to identify and process key details.
Date formatting: Dates are standardized to a consistent format (DD-MM-YYYY) to maintain consistency across all samples.
MRZ encoding: The Machine Readable Zone (MRZ) is preserved as a key feature, allowing the model to cross-validate extracted details.
3. Dataset structure: Each passport entry in the dataset follows this structured format:
{
"passport_number": "200858064",
"full_name": "Daniel Warren",
"gender": "M",
"nationality": "United Kingdom",
"dob": "06-05-1945",
"place_of_birth": "Gloucestershire",
"place_of_issue": null,
"issue_date": "12-12-2020",
"expire_date": "06-05-2032",
"mrz": "PAGBRDANIEL<<WARREN<<<<<<<<<<<<<<<<<<<<<<<<<\nB1137484W7GBR4505064M3205068200858064<<<<<50"
}
4. Step-by-step extraction process:
The system scans the passport document and detects predefined fields.
Each field is extracted with high-confidence OCR and natural language processing techniques.
The extracted information is formatted into structured JSON, ensuring consistency and completeness.
A verification step cross-checks the MRZ data with the extracted fields to validate correctness.
5. Scalability & diversity: Our dataset includes a variety of passports with different layouts, fonts, and security features, ensuring robustness against variations in real-world documents. By leveraging automated pipelines, we generate thousands of high-quality labeled passport samples for model training.
Goals
Train the model to accurately extract structured passport information from raw OCR text.
Establish a strong baseline for later stages of reinforcement learning or additional finetuning.
Methodology
We explored two training strategies:
Direct extraction model: The model is trained to predict all passport fields directly from the raw OCR text.
Step-by-step extraction model: The model is trained to extract each field sequentially, incorporating intermediate reasoning to verify information consistency (e.g., cross-checking MRZ data with extracted fields).
SFT Hyperparameters:
Parameter | Value |
model_name_or_path | Qwen/Qwen2.5-1.5B-Instruct |
disable_gradient_checkpointing | true |
finetuning_type | lora |
deepspeed | ds0 |
cutoff_len | 4096 |
train_on_prompt | true |
per_device_train_batch_size | 32 |
gradient_accumulation_steps | 1 |
learning_rate | 1.0e-5 |
num_train_epochs | 1.0 |
lr_scheduler_type | cosine |
warmup_ratio | 0.1 |
bf16 | true |
This structured approach ensures that our passport dataset is optimized for training AI models in document analysis, identity verification, and automated data extraction, making it highly reliable and scalable for real-world applications.
We use llama-factory to SFT the model.
After completing SFT, we applied GRPO to refine the model’s extraction capabilities. GRPO helps the model learn from its own generated extractions, iteratively improving accuracy and robustness.
Constructing the GRPO reward function
A well-designed reward function is essential for reinforcement learning. We developed a composite reward function that evaluates multiple aspects of accurate passport extraction:
Format Reward
Purpose: Ensures the model outputs responses in a structured format.
Mechanism: Uses regex to verify that the output follows a predefined structure (e.g., <think>...</think> <answer>...</answer>).
Reward: +1.0 for correctly formatted output, 0.0 otherwise.
Accuracy Reward:
Purpose: Checks if the model's output matches the ground truth.
Mechanism:
i. Compares the model’s JSON output with the labeled data.
ii. If the extracted fields perfectly match the ground truth, the model receives full reward.
iii. If only some fields match, the reward is proportional to the number of correctly extracted fields.
Reward: +1.0 for a perfect match, partial score for partial correctness, 0.0 for invalid JSON.
Total forward calculation
Final reward = Correctness + Field Validity + MRZ Consistency + Structured Formatting
Maximum possible reward: 3.5
GRPO training hyperparameters
Parameter | Value |
learning_rate | 1e-6 |
alpha | 128 |
r | 128 |
weight_decay | 0.1 |
warmup_ratio | 0.1 |
lr_scheduler_type | cosine |
optim | paged_adamw_8bit |
per_device_train_batch_size | 8 |
gradient_accumulation_steps | 1 |
num_generations | 4 |
max_prompt_length | 612 |
max_completion_length | 4096 |
max_grad_norm | 0.1 |
Data preparation
Before training, we need to format our dataset properly. Each data sample consists of a user query (passport data) and an assistant response (extracted structured details). We structure the conversation for the model using a helper function:
def make_conversation(example):
"""Prepare conversation data for training."""
conversation = example["conversations"]
user_content = next((msg["content"] for msg in conversation if msg["role"] == "user"), "")
assistant_content = next((msg["content"] for msg in conversation if msg["role"] == "assistant"), "")
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The user provides passport data, "
"and the Assistant extracts relevant details. The Assistant first reasons through "
"the problem before providing structured output."
)
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_content},
],
"response": assistant_content
}
We apply this function to the dataset to structure the input for training.
Reward functions
GRPO optimizes the model using reward signals. We define two reward functions:
Format reward: Ensures the model outputs responses in a structured format.
Accuracy reward: Compares the model output with ground truth data.
Format-based reward function
import re
def format_reward(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
completion_contents = [completion[0]["content"] for completion in completions]
return [1.0 if re.match(pattern, content) else 0.0 for content in completion_contents]
Accuracy-based reward function
import json
def accuracy_reward(prompts, completions, **kwargs):
"""Reward function that checks if the model output matches the ground truth."""
rewards = []
conversations_list = kwargs.get("conversations", [])
for completion, conversations in zip(completions, conversations_list):
try:
# Extract ground truth from assistant responses
assistant_response = next(
(msg["content"] for msg in conversations if msg["role"] == "assistant"),"")
# Extract model output
model_output = completion[0]["content"]
# Parse as JSON for comparison
parsed_solution = json.loads(assistant_response)
parsed_output = json.loads(model_output)
# Compare exact match or calculate matching fields ratio
if parsed_solution == parsed_output:
rewards.append(1.0)
else:
matching_keys = sum(
1 for key in parsed_solution if key in parsed_output and parsed_solution[key] == parsed_output[key]
)
total_keys = len(parsed_solution)
rewards.append(matching_keys / total_keys)
except Exception:
rewards.append(0.0) # If JSON parsing fails, return 0
return rewards
Training with GRPO
With the dataset and reward functions ready, we now train the model using GRPO with LoRA fine-tuning.
Hardware Setup
The training experiments were conducted on a server equipped with an NVIDIA GTX 3090 GPU. This setup allowed us to fine-tune the model efficiently while managing memory constraints, particularly when using LoRA for reduced VRAM consumption.
Training Script
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import wandb
import torch
def train_model(args):
"""Train the model using GRPO."""
# Initialize wandb
wandb.init(project="grpo")
# Load and preprocess dataset
dataset = load_dataset("json", data_files=args.dataset, split="train")
dataset = dataset.map(make_conversation)
# Split dataset
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]
# Load model
model = AutoModelForCausalLM.from_pretrained(
args.model_id,
torch_dtype="auto",
device_map="auto",
)
# Configure LoRA
lora_config = LoraConfig(
task_type="CAUSAL_LM",
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Configure training arguments
training_args = GRPOConfig(
output_dir=args.output_dir,
learning_rate=args.learning_rate,
remove_unused_columns=False,
gradient_accumulation_steps=args.batch_size,
num_train_epochs=args.epochs,
bf16=True,
max_completion_length=64,
num_generations=4,
max_prompt_length=128,
report_to=["tensorboard"],
logging_steps=10,
push_to_hub=True,
save_strategy="steps",
save_steps=10,
)
# Initialize and train GRPO Trainer
trainer = GRPOTrainer(
model=model,
reward_funcs=[format_reward, accuracy_reward],
args=training_args,
train_dataset=train_dataset
)
trainer.train()
trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name="passport_en_grpo")
This script:
Loads and processes the dataset
Applies LoRA for efficient fine-tuning
Trains the model using GRPO with the defined reward functions
Saves and uploads the trained model
We used llama-factory to fine-tune the model with SFT. The results show that Qwen2.5-1.5B_SFT_Lora significantly outperformed the base instruct model, improving accuracy from 66.58% to 87.62%. However, when evaluated against the 14B instruct model, the accuracy was only 70%, indicating that fine-tuning on targeted tasks provides a more substantial performance boost. Certain structured fields, particularly Machine Readable Zone (MRZ) data, remained prone to extraction errors.
Applying GRPO after SFT further improved accuracy to 92.38%. The reinforcement learning approach helped the model handle structured fields more effectively. Notably, fields like MRZ, which previously had errors in the SFT_Lora model, were extracted with higher accuracy in the GRPO_SFT_Lora variant.
These results highlight the benefit of reinforcement learning in structured data extraction, especially for fields with strict formatting rules like MRZ.
Results of supervised fine-tuning and GRPO post-training
However, despite these improvements, some extraction errors still persist. One possible reason is the quality of the "think" data generated using the free Gemini API, which may not always provide highly accurate or structured reasoning sequences. A potential approach to further enhance performance is to improve the quality of the "think" data by using a more reliable data-generation method or a higher-quality reasoning API.
GRPO without a cold start is challenging: Starting GRPO training from a base model (without SFT) proved difficult. The model struggled to learn the task and exhibited "overthinking" behavior, exceeding the context length.
SFT provides a strong foundation: Initializing with an SFT model significantly accelerated learning and improved the effectiveness of GRPO. The model quickly converged and produced usable results.
LoRA works well: Using LoRA, was sufficient for achieving good results on this task, demonstrating its potential for resource-constrained scenarios.
Reward function design is crucial: The carefully designed reward function played a critical role in shaping the model's behavior, guiding it toward correct solutions and proper formatting.
We fine-tuned a reasoning model for passport data extraction using a two-stage approach:
SFT for baseline accuracy.
GRPO for reasoning and structured output.
By designing a structured dataset and a reward-based GRPO system, we improved accuracy, consistency, and reasoning abilities. The GRPO-enhanced model demonstrated better handling of structured fields, particularly MRZ, compared to SFT alone.
However, some fields still contain extraction errors, even after GRPO fine-tuning. To further enhance performance, we need to refine the reward function and explore more advanced optimization techniques.
Share