OCR with Driver's Licenses

Prompt

You are given an image of a driver's license.

Your task is to extract structured data that is clearly visible in the image.

Return ONLY valid JSON.

Do not include explanation.

Do not include markdown.

Do not include commentary.

---------------------------------------

INSTRUCTIONS:

1. Extract only text that is clearly readable.

2. Do NOT guess or infer missing values.

3. If a field is not visible or unreadable, return null.

4. Preserve capitalization exactly as shown.

5. Preserve spacing inside fields where applicable.

6. Do NOT reformat dates unless the format is explicitly clear.

7. Do NOT fabricate data.

8. If the image is not a driver's license, return:

{

"document_type": "unknown"

}

---------------------------------------

RETURN THIS EXACT JSON STRUCTURE:

{

"document_type": "drivers_license",

"issuing_state_or_country": null,

"license_number": null,

"first_name": null,

"middle_name": null,

"last_name": null,

"date_of_birth": null,

"issue_date": null,

"expiration_date": null,

"address": {

"street": null,

"city": null,

"state": null,

"postal_code": null

"sex": null,

"height": null,

"weight": null,

"eye_color": null,

"hair_color": null,

"restrictions": null,

"endorsements": null,

"class": null

}

---------------------------------------

If multiple values appear for a field, return the most prominent official value.

Return only the JSON object.

Description

Verifying user identity is an important ability for many businesses however manually reviewing and typing out customer information from an ID card is slow, expensive, and highly prone to human error. Thus, being able to accurately and automatically extract this information is vital for a smooth user experience and for productivity in business. The Structured OCR task on the Abilities tab can act as a powerful Optical Character Recognition (OCR) tool, reading the text on a document and outputting it into a clean, structured format.

For example, if a user uploads a photo of a standard California Driver's License, the model should examine the image and categorize the data into specific fields. For example, the model will pull "Jane Doe" for First Name, "Sample" for Last Name, "05/15/1985" for Date of Birth, and isolate the specific DL Number.

In contrast, if a user uploads an image that is severely blurred, cut off, or covered by a harsh glare, the model should ideally flag that the necessary text cannot be confidently extracted, prompting the user to retake the photo.

Our expected inputs are images of identification documents, and the expected output will be a structured text format, a JSON file for this example, containing the extracted text from the target fields.

‍

SDK Tutorial

First, let’s define the ability. Get early access to Abilities here >

ability_prototypes = [
   VlmAbilityCreate(
       name=f"{NAMESPACE_PREFIX}.structured-ocr.OCR-Read-Drivers-License",
       description="Extract relevant information from drivers license",
       worker_release="qwen3-instruct",
       text_prompt=license_prompt,
       transform_into=TransformInto(),
       config=InferRuntimeConfig(
           max_new_tokens=250,
           image_size=512
       ),
       is_public=False
   )
]

‍

The prompt we can use here is:

You are given an image of a driver's license.
Your task is to extract structured data that is clearly visible in the image.


           Return ONLY valid JSON.
           Do not include explanation.
           Do not include markdown.
           Do not include commentary.


           ---------------------------------------
           INSTRUCTIONS:


           1. Extract only text that is clearly readable.
           2. Do NOT guess or infer missing values.
           3. If a field is not visible or unreadable, return null.
           4. Preserve capitalization exactly as shown.
           5. Preserve spacing inside fields where applicable.
           6. Do NOT reformat dates unless the format is explicitly clear.
           7. Do NOT fabricate data.
           8. If the image is not a driver's license, return:
           {
             "document_type": "unknown"
           }


           ---------------------------------------
           RETURN THIS EXACT JSON STRUCTURE:


           {
             "document_type": "drivers_license",
             "issuing_state_or_country": null,
             "license_number": null,
             "first_name": null,
             "middle_name": null,
             "last_name": null,
             "date_of_birth": null,
             "issue_date": null,
             "expiration_date": null,
             "address": {
               "street": null,
               "city": null,
               "state": null,
               "postal_code": null
             },
             "sex": null,
             "height": null,
             "weight": null,
             "eye_color": null,
             "hair_color": null,
             "restrictions": null,
             "endorsements": null,
             "class": null
           }


           ---------------------------------------


  If multiple values appear for a field, return the most prominent official value.


  Return only the JSON object.

‍

Next, we can actually create the ability with the following code:

with EyePopSdk.dataEndpoint(api_key=EYEPOP_API_KEY, account_id=EYEPOP_ACCOUNT_ID) as endpoint:
   for ability_prototype in ability_prototypes:
       ability_group = endpoint.create_vlm_ability_group(VlmAbilityGroupCreate(
           name=ability_prototype.name,
           description=ability_prototype.description,
           default_alias_name=ability_prototype.name,
       ))
       ability = endpoint.create_vlm_ability(
           create=ability_prototype,
           vlm_ability_group_uuid=ability_group.uuid,
       )
       ability = endpoint.publish_vlm_ability(
           vlm_ability_uuid=ability.uuid,
           alias_name=ability_prototype.name,
       )
       ability = endpoint.add_vlm_ability_alias(
           vlm_ability_uuid=ability.uuid,
           alias_name=ability_prototype.name,
           tag_name="latest"
       )
       print(f"created ability {ability.uuid} with alias entries {ability.alias_entries}")

‍

That’s it! To run the prompt against an image here is some sample evaluation code:

from pathlib import Path


pop = Pop(components=[
   InferenceComponent(
       ability=f"{NAMESPACE_PREFIX}.structured-ocr.OCR-Read-Drivers-License:latest"
   )
])


with EyePopSdk.workerEndpoint(api_key=EYEPOP_API_KEY) as endpoint:
   endpoint.set_pop(pop)
   sample_img_path = Path("/content/sample_img.jpg")
   job = endpoint.upload(sample_img_path)
   while result := job.predict():
     print(json.dumps(result, indent=2))
          
print("Done")

‍

After running the evaluation you can see what the model labelled and compare it to your source of truth. With this, you can improve your prompts and thus improve your accuracy. Get early access to Abilities here >