AWS Rekognition & Textract: Vision AI Services
What Problems AWS Rekognition & Textract Solve
AWS Rekognition and Textract eliminate the complexity of building custom computer vision models by providing pre-trained APIs for common vision tasks—image analysis, video analysis, and document extraction.
Traditional computer vision challenges:
- Building custom ML models for face detection, object recognition, or text extraction requires months of data labeling and model training
- Maintaining accuracy as new object types or document formats appear requires continuous retraining
- Scaling inference infrastructure to handle millions of images/videos requires managing GPU clusters
- Integrating vision capabilities into applications requires ML expertise that most development teams lack
Concrete scenario: Your e-commerce platform needs to moderate user-uploaded product images (detect inappropriate content), extract product attributes from photos (brand logos, colors, text), and verify seller identity documents (driver’s licenses, passports). Building custom models would require hiring ML engineers, labeling 100,000+ images, training models on GPU infrastructure, and maintaining inference servers. Estimated cost: $500,000 first year (team + infrastructure). Estimated timeline: 6-12 months to production.
What Rekognition & Textract provide: Pre-trained APIs you call with one line of code. Rekognition detects objects, faces, text, moderation labels, and celebrities in images/videos. Textract extracts text, forms, and tables from documents with understanding of layout and relationships. No ML expertise required. Pay per image/document processed.
Real-world impact: After adopting Rekognition and Textract, image moderation became automated (block images flagged as explicit/suggestive). Product attributes extracted automatically (brand logos detected at 95% accuracy). Identity verification reduced from 2-day manual review to 5-minute automated extraction + human verification of extracted data. Total cost: $800/month (processing 100,000 images + 5,000 documents). Time to production: 2 weeks of integration work.
AWS Rekognition
What it is: Managed computer vision service that analyzes images and videos to detect objects, faces, text, activities, and inappropriate content.
Core Capabilities
1. Object and Scene Detection
Identify objects, scenes, activities, and concepts in images.
Example use cases:
- E-commerce: Detect product types in user uploads (“this is a shoe, size 10, Nike brand”)
- Content moderation: Flag images containing weapons, alcohol, or drugs
- Asset management: Automatically tag photos by content (beach, sunset, people, cars)
API call:
import boto3
rekognition = boto3.client('rekognition')
response = rekognition.detect_labels(
Image={'S3Object': {'Bucket': 'my-bucket', 'Name': 'product.jpg'}},
MaxLabels=10,
MinConfidence=90
)
for label in response['Labels']:
print(f"{label['Name']}: {label['Confidence']:.2f}%")
# Output: Shoe: 98.5%, Sneaker: 97.2%, Footwear: 99.1%, Nike: 95.3%
Confidence scores: Rekognition returns confidence percentage for each label. Use thresholds to filter low-confidence results (recommended: 90%+ for production use).
2. Face Detection and Analysis
Detect faces and analyze attributes (age range, gender, emotions, facial features).
Example use cases:
- Photo apps: Suggest photo tags based on detected faces
- Security: Count people entering building via camera feed
- Marketing: Analyze customer demographics at retail locations
API call:
response = rekognition.detect_faces(
Image={'S3Object': {'Bucket': 'my-bucket', 'Name': 'crowd.jpg'}},
Attributes=['ALL'] # Include age, gender, emotions, quality
)
for face in response['FaceDetails']:
print(f"Age: {face['AgeRange']['Low']}-{face['AgeRange']['High']}")
print(f"Gender: {face['Gender']['Value']} ({face['Gender']['Confidence']:.1f}%)")
print(f"Emotions: {face['Emotions'][0]['Type']} ({face['Emotions'][0]['Confidence']:.1f}%)")
# Output: Age: 25-35, Gender: Male (98.2%), Emotions: HAPPY (95.7%)
3. Face Comparison and Search
Compare faces to verify identity or search for specific person across image collection.
Example use cases:
- Identity verification: Compare selfie to government ID photo
- Security: Search for person of interest across surveillance footage
- Social media: Find all photos containing specific person
Face comparison:
response = rekognition.compare_faces(
SourceImage={'S3Object': {'Bucket': 'my-bucket', 'Name': 'selfie.jpg'}},
TargetImage={'S3Object': {'Bucket': 'my-bucket', 'Name': 'id-photo.jpg'}},
SimilarityThreshold=90
)
if response['FaceMatches']:
similarity = response['FaceMatches'][0]['Similarity']
print(f"Faces match with {similarity:.2f}% confidence")
else:
print("Faces do not match")
Face collection (index faces for search):
# Create face collection
rekognition.create_collection(CollectionId='employees')
# Index face
rekognition.index_faces(
CollectionId='employees',
Image={'S3Object': {'Bucket': 'my-bucket', 'Name': 'employee-123.jpg'}},
ExternalImageId='employee-123',
MaxFaces=1
)
# Search for face in collection
response = rekognition.search_faces_by_image(
CollectionId='employees',
Image={'S3Object': {'Bucket': 'my-bucket', 'Name': 'camera-feed.jpg'}},
MaxFaces=1,
FaceMatchThreshold=90
)
if response['FaceMatches']:
match = response['FaceMatches'][0]
print(f"Match found: {match['Face']['ExternalImageId']} ({match['Similarity']:.2f}%)")
4. Text Detection (OCR)
Extract text from images (street signs, product labels, license plates).
Example use cases:
- Inventory management: Read product serial numbers from photos
- License plate recognition: Extract plate numbers from parking lot cameras
- Document digitization: Extract text from scanned forms
API call:
response = rekognition.detect_text(
Image={'S3Object': {'Bucket': 'my-bucket', 'Name': 'sign.jpg'}}
)
for text in response['TextDetections']:
if text['Type'] == 'LINE': # Get full lines, not individual words
print(f"{text['DetectedText']} ({text['Confidence']:.2f}%)")
# Output: STOP (99.8%), ONE WAY (98.5%)
5. Content Moderation
Detect inappropriate content (explicit, suggestive, violent, visually disturbing).
Example use cases:
- Social media: Auto-flag user uploads for review
- Marketplace: Block listings with inappropriate product images
- Dating apps: Filter profile photos containing nudity
API call:
response = rekognition.detect_moderation_labels(
Image={'S3Object': {'Bucket': 'my-bucket', 'Name': 'user-upload.jpg'}},
MinConfidence=75
)
if response['ModerationLabels']:
print("Image flagged for moderation:")
for label in response['ModerationLabels']:
print(f"- {label['Name']} ({label['Confidence']:.2f}%)")
# Output: Explicit Nudity (92.5%), Graphic Violence (88.3%)
else:
print("Image passed moderation")
Moderation categories: Explicit Nudity, Suggestive, Violence, Visually Disturbing, Rude Gestures, Drugs, Tobacco, Alcohol, Gambling, Hate Symbols.
6. Video Analysis
Analyze videos to detect objects, faces, text, activities, and moderation labels over time.
Example use cases:
- Security: Detect people entering restricted areas
- Sports analytics: Track ball movement and player positions
- Content moderation: Flag inappropriate segments in uploaded videos
Start video analysis job:
response = rekognition.start_label_detection(
Video={'S3Object': {'Bucket': 'my-bucket', 'Name': 'security-footage.mp4'}},
MinConfidence=90
)
job_id = response['JobId']
Poll for results:
import time
while True:
response = rekognition.get_label_detection(JobId=job_id)
status = response['JobStatus']
if status == 'SUCCEEDED':
for label in response['Labels']:
timestamp = label['Timestamp'] # Milliseconds from video start
print(f"At {timestamp}ms: {label['Label']['Name']} ({label['Label']['Confidence']:.2f}%)")
break
elif status == 'FAILED':
print(f"Job failed: {response['StatusMessage']}")
break
time.sleep(5) # Check every 5 seconds
Video analysis is asynchronous: Jobs can take minutes to hours depending on video length. Use SNS notifications instead of polling for production use.
AWS Textract
What it is: Managed OCR service that extracts text, forms, and tables from documents with understanding of layout and relationships.
Core Capabilities
1. Text Detection
Extract raw text from documents (similar to Rekognition OCR but optimized for documents).
Example use cases:
- Invoice processing: Extract invoice number, date, amount
- Contract analysis: Extract key terms and clauses
- Form digitization: Convert paper forms to digital text
API call:
import boto3
textract = boto3.client('textract')
response = textract.detect_document_text(
Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}}
)
for block in response['Blocks']:
if block['BlockType'] == 'LINE':
print(block['Text'])
# Output: Invoice #12345, Date: 2024-11-15, Amount: $1,234.56
2. Forms Extraction (Key-Value Pairs)
Extract form fields and their values (e.g., “Name: John Doe”, “Date: 2024-11-15”).
Example use cases:
- Government forms: Extract data from tax forms, applications, permits
- Medical records: Extract patient information from intake forms
- Loan applications: Extract applicant details from mortgage forms
API call:
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'form.pdf'}},
FeatureTypes=['FORMS']
)
# Extract key-value pairs
key_map = {}
value_map = {}
block_map = {}
for block in response['Blocks']:
block_map[block['Id']] = block
if block['BlockType'] == 'KEY_VALUE_SET':
if 'KEY' in block['EntityTypes']:
key_map[block['Id']] = block
else:
value_map[block['Id']] = block
# Get key-value relationships
for key_id, key_block in key_map.items():
value_block = find_value(key_block, value_map, block_map)
key_text = get_text(key_block, block_map)
value_text = get_text(value_block, block_map) if value_block else ""
print(f"{key_text}: {value_text}")
# Output: Name: John Doe, SSN: ***-**-1234, Date of Birth: 01/15/1990
Note: Helper functions find_value() and get_text() traverse Textract’s relationship graph to extract text.
3. Tables Extraction
Extract tables with rows, columns, and cell values preserved.
Example use cases:
- Financial statements: Extract line items from balance sheets
- Purchase orders: Extract product SKUs, quantities, prices
- Lab results: Extract test names and values from medical reports
API call:
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'statement.pdf'}},
FeatureTypes=['TABLES']
)
tables = []
for block in response['Blocks']:
if block['BlockType'] == 'TABLE':
table = extract_table(block, response['Blocks'])
tables.append(table)
# Example extracted table:
# [
# ['SKU', 'Product', 'Quantity', 'Price'],
# ['ABC123', 'Widget', '10', '$25.00'],
# ['DEF456', 'Gadget', '5', '$50.00']
# ]
4. Queries (Textract Queries)
Ask natural language questions about document content.
Example use cases:
- Invoices: “What is the total amount due?”
- Contracts: “What is the contract end date?”
- Receipts: “What is the vendor name?”
API call:
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [
{'Text': 'What is the invoice number?'},
{'Text': 'What is the total amount?'},
{'Text': 'What is the due date?'}
]
}
)
for block in response['Blocks']:
if block['BlockType'] == 'QUERY_RESULT':
query_text = block['Query']['Text']
answer_text = block['Text']
confidence = block['Confidence']
print(f"Q: {query_text}")
print(f"A: {answer_text} ({confidence:.2f}%)")
# Q: What is the invoice number?
# A: INV-2024-11-15-001 (98.5%)
5. Identity Documents (AnalyzeID)
Extract data from government IDs (driver’s licenses, passports).
Example use cases:
- KYC verification: Extract customer identity information
- Age verification: Confirm date of birth from ID
- Address verification: Extract residential address
API call:
response = textract.analyze_id(
DocumentPages=[{
'S3Object': {'Bucket': 'my-bucket', 'Name': 'drivers-license.jpg'}
}]
)
for doc in response['IdentityDocuments']:
for field in doc['IdentityDocumentFields']:
print(f"{field['Type']['Text']}: {field['ValueDetection']['Text']}")
# Output: FIRST_NAME: John, LAST_NAME: Doe, DATE_OF_BIRTH: 01/15/1990,
# DOCUMENT_NUMBER: D1234567, EXPIRATION_DATE: 01/15/2028
Pricing and Cost Optimization
Rekognition Pricing
Image analysis: $1.00 per 1,000 images (first 1 million/month), then $0.80 per 1,000
Example costs:
- 100,000 images/month: 100 × $1.00 = $100/month
- 5 million images/month: 1,000 × $1.00 + 4,000 × $0.80 = $1,000 + $3,200 = $4,200/month
Video analysis: $0.10 per minute of video processed
Example: 1,000 hours video/month = 60,000 minutes × $0.10 = $6,000/month
Face collections: $0.001 per face stored per month
Example: Store 1 million faces = 1,000,000 × $0.001 = $1,000/month
Textract Pricing
Text detection: $1.50 per 1,000 pages (first 1 million/month)
Forms/tables extraction: $50.00 per 1,000 pages (first 1 million/month)
Queries: $1.00 per 1,000 document pages + $1.00 per 1,000 query pages
Example costs:
- 10,000 pages text detection only: 10 × $1.50 = $15/month
- 10,000 pages forms extraction: 10 × $50.00 = $500/month
- 10,000 pages with 3 queries each: 10 × $50 + 30 × $1.00 = $500 + $30 = $530/month
Cost Optimization Strategies
1. Batch processing instead of real-time
Process documents/images in batches overnight instead of on-demand to reduce API call volume.
2. Cache results
Store extraction results in DynamoDB/S3 to avoid reprocessing same documents.
3. Use appropriate feature detection
Don’t request FORMS and TABLES extraction if you only need text. Text detection costs $1.50 per 1,000 pages vs $50 for forms/tables.
4. Implement confidence thresholds
Filter low-confidence results client-side to avoid manual review costs.
5. Pre-filter images
Use image metadata (size, format, EXIF) to skip processing of irrelevant images.
Integration Patterns
Pattern 1: Serverless Document Processing Pipeline
Use case: Process uploaded documents asynchronously.
Architecture:
S3 Upload → S3 Event → Lambda (Textract) → DynamoDB (Results) → SNS (Notification)
Lambda function:
import boto3
import json
s3 = boto3.client('s3')
textract = boto3.client('textract')
dynamodb = boto3.resource('dynamodb')
sns = boto3.client('sns')
def lambda_handler(event, context):
# Get S3 object from event
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Process with Textract
response = textract.analyze_document(
Document={'S3Object': {'Bucket': bucket, 'Name': key}},
FeatureTypes=['FORMS', 'TABLES']
)
# Extract data (simplified)
extracted_data = parse_textract_response(response)
# Store in DynamoDB
table = dynamodb.Table('document-results')
table.put_item(Item={
'document_id': key,
'extracted_data': extracted_data,
'timestamp': int(time.time())
})
# Notify completion
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:document-processed',
Message=json.dumps({'document_id': key, 'status': 'complete'})
)
return {'statusCode': 200}
Pattern 2: Real-Time Image Moderation
Use case: Block inappropriate user uploads immediately.
Architecture:
Upload → API Gateway → Lambda (Rekognition) → S3 (if approved) / Reject
Lambda function:
import boto3
import base64
rekognition = boto3.client('rekognition')
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Get image from API Gateway request
image_data = base64.b64decode(event['body'])
# Check content moderation
response = rekognition.detect_moderation_labels(
Image={'Bytes': image_data},
MinConfidence=75
)
# Block if inappropriate content detected
if response['ModerationLabels']:
return {
'statusCode': 400,
'body': json.dumps({
'error': 'Image contains inappropriate content',
'labels': [label['Name'] for label in response['ModerationLabels']]
})
}
# Upload to S3 if approved
image_id = str(uuid.uuid4())
s3.put_object(
Bucket='user-uploads',
Key=f'images/{image_id}.jpg',
Body=image_data
)
return {
'statusCode': 200,
'body': json.dumps({'image_id': image_id})
}
Pattern 3: Identity Verification Workflow
Use case: Verify customer identity with government ID.
Workflow:
- User uploads ID photo
- Textract AnalyzeID extracts name, DOB, address
- Compare extracted data to user-provided registration data
- Flag mismatches for manual review
Implementation:
def verify_identity(id_image_s3_key, user_data):
textract = boto3.client('textract')
# Extract ID data
response = textract.analyze_id(
DocumentPages=[{
'S3Object': {'Bucket': 'id-uploads', 'Name': id_image_s3_key}
}]
)
# Parse extracted data
extracted = {}
for doc in response['IdentityDocuments']:
for field in doc['IdentityDocumentFields']:
field_type = field['Type']['Text']
field_value = field['ValueDetection']['Text']
extracted[field_type] = field_value
# Compare with user-provided data
matches = {
'name': extracted.get('FIRST_NAME') == user_data['first_name'],
'dob': extracted.get('DATE_OF_BIRTH') == user_data['dob'],
'address': extracted.get('ADDRESS') == user_data['address']
}
# Return verification result
if all(matches.values()):
return {'status': 'verified', 'confidence': 'high'}
elif any(matches.values()):
return {'status': 'partial', 'mismatches': [k for k, v in matches.items() if not v]}
else:
return {'status': 'failed', 'reason': 'no_matches'}
When to Use Rekognition & Textract
Use Rekognition when:
- ✅ Need pre-built computer vision (object detection, face recognition, moderation)
- ✅ Want to avoid custom ML model development (months of work, ML expertise required)
- ✅ Processing images/videos at scale (thousands to millions per day)
- ✅ Integration simplicity prioritized (REST API vs managing inference infrastructure)
Use Textract when:
- ✅ Extracting text from documents (PDFs, scans, photos of documents)
- ✅ Need structured extraction (forms, tables, key-value pairs)
- ✅ Processing government IDs, invoices, receipts, contracts
- ✅ Want higher accuracy than open-source OCR (Tesseract) without training
Consider alternatives when:
- ❌ Need custom models for domain-specific objects → SageMaker for training custom models
- ❌ Extremely high volume, cost-sensitive → Self-hosted open-source (Tesseract OCR, OpenCV) if you can manage infrastructure
- ❌ Real-time video processing at edge → AWS Panorama or edge ML (Greengrass + local models)
- ❌ Simple text extraction from clean PDFs → Open-source PDF libraries (PyPDF2, pdfplumber) much cheaper
Common Pitfalls
Processing High-Volume Images Without Caching
Symptom: Processing same product images repeatedly (e.g., thumbnail generation triggers Rekognition on every page load).
Cost impact: 1 million image loads/month × $1/1,000 = $1,000/month wasted.
Solution: Cache Rekognition results in DynamoDB or S3. Check cache before calling API.
Not Filtering by Confidence Score
Symptom: Low-confidence labels cause incorrect application logic (detecting “dog” at 45% confidence when image is actually a cat).
Solution: Set minimum confidence threshold (90%+ for production use).
labels = [l for l in response['Labels'] if l['Confidence'] >= 90]
Using Textract Forms Extraction for Simple Text
Symptom: Paying $50/1,000 pages for forms extraction when only need plain text ($1.50/1,000 pages).
Solution: Use detect_document_text() instead of analyze_document() with FORMS feature if you don’t need key-value extraction.
Synchronous Processing of Large Videos
Symptom: Lambda timeout (15 minutes) when processing hour-long videos synchronously.
Solution: Use asynchronous video analysis APIs with SNS notifications. Don’t poll in Lambda.
Key Takeaways
Rekognition and Textract eliminate months of custom ML development by providing pre-trained APIs for common computer vision tasks. Call the API with an image or document, get structured results in seconds without managing ML infrastructure.
Cost scales with usage volume. Rekognition costs $1 per 1,000 images, Textract costs $1.50-$50 per 1,000 pages depending on features. Cache results to avoid reprocessing, filter by confidence scores to reduce downstream manual review costs.
Use Rekognition for images/videos (object detection, face recognition, moderation, text in images). Use Textract for documents (PDFs, scans, forms, tables, IDs). Don’t use Rekognition for document OCR—Textract is optimized for documents and provides structured extraction.
Integration patterns are serverless-first. S3 triggers Lambda on upload, Lambda calls Rekognition/Textract, stores results in DynamoDB, notifies via SNS. For real-time use cases (image moderation), call APIs synchronously from API Gateway + Lambda.
Set confidence thresholds (90%+) to filter low-quality results. Don’t assume 100% accuracy—use confidence scores to route uncertain cases to human review.
Choose between Rekognition/Textract vs SageMaker based on customization needs. Use Rekognition/Textract for standard use cases (face detection works out-of-box). Use SageMaker when you need custom models (detect specific product defects unique to your manufacturing process).
Video analysis is asynchronous and can take minutes to hours. Don’t poll synchronously—use SNS notifications to trigger downstream processing when video analysis completes.
Textract Queries provide natural language interface to document extraction. Instead of parsing complex JSON responses to find specific fields, ask “What is the invoice total?” and get the answer directly with confidence score.
Found this guide helpful? Share it with your team:
Share on LinkedIn