AWS Transcribe & Polly - Architecture Insights

What Problems AWS Transcribe & Polly Solve

The Speech AI Challenge: Organizations need to convert speech to text for accessibility, searchability, and analytics. They also need to generate natural-sounding speech from text for voice assistants, content narration, and accessibility features. Building custom speech recognition and synthesis systems requires deep expertise in signal processing, linguistics, and machine learning.

What AWS Provides: AWS offers two complementary services for speech AI:

AWS Transcribe: Automatic speech recognition (ASR) that converts audio to text
AWS Polly: Neural text-to-speech (TTS) that converts text to lifelike audio

Both services use deep learning models trained on massive datasets, eliminating the need to build and maintain custom speech models.

AWS Transcribe: Speech-to-Text

Core Capabilities

1. Standard Transcription (Batch Processing)

Process pre-recorded audio files stored in S3:

import boto3

transcribe = boto3.client('transcribe')

# Start transcription job
response = transcribe.start_transcription_job(
    TranscriptionJobName='customer-call-123',
    LanguageCode='en-US',
    MediaFormat='wav',
    Media={
        'MediaFileUri': 's3://my-bucket/customer-calls/call-123.wav'
    },
    OutputBucketName='my-transcription-output',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 2,
        'ChannelIdentification': True
    }
)

# Check job status
job = transcribe.get_transcription_job(
    TranscriptionJobName='customer-call-123'
)

print(f"Status: {job['TranscriptionJob']['TranscriptionJobStatus']}")
# Output location: s3://my-transcription-output/customer-call-123.json

Output format (JSON):

{
  "jobName": "customer-call-123",
  "results": {
    "transcripts": [{
      "transcript": "Hello, I need help with my account. Sure, I can help you with that."
    }],
    "speaker_labels": {
      "speakers": 2,
      "segments": [
        {
          "start_time": "0.0",
          "end_time": "3.5",
          "speaker_label": "spk_0",
          "items": [/* word-level timestamps */]
        }
      ]
    },
    "items": [
      {
        "start_time": "0.0",
        "end_time": "0.5",
        "alternatives": [{
          "confidence": "0.99",
          "content": "Hello"
        }],
        "type": "pronunciation"
      }
    ]
  }
}

2. Real-Time Transcription (Streaming)

Transcribe audio as it’s spoken for live captioning or voice assistants:

import asyncio
from amazon_transcribe.client import TranscribeStreamingClient
from amazon_transcribe.handlers import TranscriptResultStreamHandler
from amazon_transcribe.model import TranscriptEvent

class MyEventHandler(TranscriptResultStreamHandler):
    async def handle_transcript_event(self, transcript_event: TranscriptEvent):
        results = transcript_event.transcript.results
        for result in results:
            if not result.is_partial:
                for alt in result.alternatives:
                    print(f"Final transcript: {alt.transcript}")

async def transcribe_stream(audio_stream):
    client = TranscribeStreamingClient(region="us-east-1")

    stream = await client.start_stream_transcription(
        language_code="en-US",
        media_sample_rate_hz=16000,
        media_encoding="pcm",
    )

    async def write_chunks():
        async for chunk in audio_stream:
            await stream.input_stream.send_audio_event(audio_chunk=chunk)
        await stream.input_stream.end_stream()

    handler = MyEventHandler(stream.output_stream)
    await asyncio.gather(write_chunks(), handler.handle_events())

# Usage with microphone input
# asyncio.run(transcribe_stream(get_microphone_audio()))

3. Custom Vocabulary

Improve accuracy for domain-specific terms:

# Create custom vocabulary for medical terminology
transcribe.create_vocabulary(
    VocabularyName='medical-terms',
    LanguageCode='en-US',
    Phrases=[
        'tachycardia',
        'myocardial infarction',
        'cerebrovascular accident',
        'electrocardiogram'
    ]
)

# Use in transcription job
transcribe.start_transcription_job(
    TranscriptionJobName='medical-consult-456',
    LanguageCode='en-US',
    Media={'MediaFileUri': 's3://my-bucket/consults/456.mp3'},
    Settings={
        'VocabularyName': 'medical-terms'
    }
)

4. Speaker Diarization

Identify and separate different speakers:

transcribe.start_transcription_job(
    TranscriptionJobName='meeting-recording',
    LanguageCode='en-US',
    Media={'MediaFileUri': 's3://my-bucket/meetings/team-sync.mp3'},
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 5  # Up to 10 speakers supported
    }
)

5. Channel Identification

Separate audio channels (e.g., customer vs agent in call center):

transcribe.start_transcription_job(
    TranscriptionJobName='support-call',
    LanguageCode='en-US',
    Media={'MediaFileUri': 's3://my-bucket/calls/stereo-call.wav'},
    Settings={
        'ChannelIdentification': True
    }
)

6. Content Redaction (PII Removal)

Automatically redact sensitive information:

transcribe.start_transcription_job(
    TranscriptionJobName='compliant-transcription',
    LanguageCode='en-US',
    Media={'MediaFileUri': 's3://my-bucket/calls/sensitive.mp3'},
    ContentRedaction={
        'RedactionType': 'PII',
        'RedactionOutput': 'redacted',  # 'redacted' or 'redacted_and_unredacted'
        'PiiEntityTypes': [
            'CREDIT_DEBIT_NUMBER',
            'SSN',
            'EMAIL',
            'PHONE',
            'NAME',
            'ADDRESS'
        ]
    }
)

Output: “My credit card number is [PII] and my SSN is [PII]”

7. Language Identification

Automatically detect the spoken language:

transcribe.start_transcription_job(
    TranscriptionJobName='multilingual-audio',
    IdentifyLanguage=True,
    LanguageOptions=['en-US', 'es-US', 'fr-FR', 'de-DE'],
    Media={'MediaFileUri': 's3://my-bucket/audio/unknown-language.mp3'}
)

8. Medical and Call Analytics Specializations

Transcribe Medical (HIPAA-eligible):

transcribe_medical = boto3.client('transcribe')

transcribe_medical.start_medical_transcription_job(
    MedicalTranscriptionJobName='patient-visit-789',
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={'MediaFileUri': 's3://my-bucket/medical/visit-789.mp3'},
    OutputBucketName='my-medical-transcripts',
    Specialty='PRIMARYCARE',  # PRIMARYCARE, CARDIOLOGY, NEUROLOGY, ONCOLOGY, RADIOLOGY, UROLOGY
    Type='CONVERSATION'  # CONVERSATION or DICTATION
)

Call Analytics (sentiment, talk time, interruptions):

transcribe.start_call_analytics_job(
    CallAnalyticsJobName='support-call-analytics',
    Media={'MediaFileUri': 's3://my-bucket/calls/support-123.mp3'},
    ChannelDefinitions=[
        {'ChannelId': 0, 'ParticipantRole': 'AGENT'},
        {'ChannelId': 1, 'ParticipantRole': 'CUSTOMER'}
    ]
)

Output includes:

Sentiment analysis (positive, negative, neutral) per speaker
Talk time percentage per speaker
Interruptions and overtalk
Non-talk time (silence)
Loudness scores
Issue detection and categorization

AWS Polly: Text-to-Speech

Core Capabilities

1. Standard Text-to-Speech

Convert text to natural-sounding audio:

import boto3

polly = boto3.client('polly')

# Synthesize speech
response = polly.synthesize_speech(
    Text='Hello! Welcome to our customer support. How can I help you today?',
    OutputFormat='mp3',  # mp3, ogg_vorbis, pcm, json (for speech marks)
    VoiceId='Joanna',    # US English female voice
    Engine='neural'      # 'neural' or 'standard'
)

# Save audio to file
with open('greeting.mp3', 'wb') as f:
    f.write(response['AudioStream'].read())

print(f"Characters: {response['RequestCharacters']}")
print(f"Content-Type: {response['ContentType']}")

2. Available Voices

Polly offers multiple voices across languages and genders:

# List available voices
voices = polly.describe_voices(LanguageCode='en-US')

for voice in voices['Voices']:
    print(f"{voice['Name']} ({voice['Gender']}) - {voice['LanguageCode']}")
    print(f"  Engine: {voice['SupportedEngines']}")

Popular voices:

US English: Joanna (F, neural), Matthew (M, neural), Salli (F, neural)
British English: Amy (F, neural), Brian (M, neural), Emma (F, neural)
Australian English: Olivia (F, neural)
Spanish: Lupe (F, neural, US), Lucia (F, neural, Spain)
French: Léa (F, neural), Mathieu (M, neural)

Neural voices (higher quality, more natural):

Supported for major languages (English, Spanish, French, German, Italian, Portuguese, Japanese, Korean)
More expensive but significantly better quality
Support for newscaster speaking style

3. Speech Synthesis Markup Language (SSML)

Control pronunciation, pace, pitch, and pauses:

ssml_text = """
<speak>
    <prosody rate="medium" pitch="medium">
        Welcome to <emphasis level="strong">Acme Corporation</emphasis>.
    </prosody>

    <break time="500ms"/>

    <prosody rate="slow">
        Please listen carefully to the following options.
    </prosody>

    <break time="300ms"/>

    <say-as interpret-as="telephone">1-800-555-1234</say-as>

    <break time="500ms"/>

    Your account balance is <say-as interpret-as="currency">$1,234.56</say-as>

    <break time="300ms"/>

    <phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>
</speak>
"""

response = polly.synthesize_speech(
    Text=ssml_text,
    TextType='ssml',  # 'text' or 'ssml'
    OutputFormat='mp3',
    VoiceId='Joanna',
    Engine='neural'
)

SSML features:

<break>: Insert pauses
<emphasis>: Add emphasis (strong, moderate, reduced)
<prosody>: Control rate, pitch, volume
<say-as>: Interpret as date, time, phone, currency, etc.
<phoneme>: Specify phonetic pronunciation
<sub>: Substitute pronunciation (alias)
<amazon:domain>: Use specialized voices (news, conversational)

4. Neural Newscaster Style

Use newscaster speaking style for announcements:

ssml_news = """
<speak>
    <amazon:domain name="news">
        Breaking news: The company has announced record earnings for the quarter,
        exceeding analyst expectations by 15 percent.
    </amazon:domain>
</speak>
"""

response = polly.synthesize_speech(
    Text=ssml_news,
    TextType='ssml',
    OutputFormat='mp3',
    VoiceId='Matthew',  # Newscaster style only works with specific voices
    Engine='neural'
)

5. Speech Marks (Metadata)

Get timing information for lip-sync or highlighting:

response = polly.synthesize_speech(
    Text='Hello, how are you?',
    OutputFormat='json',  # Returns speech marks, not audio
    VoiceId='Joanna',
    Engine='neural',
    SpeechMarkTypes=['word', 'sentence', 'ssml', 'viseme']
)

# Parse speech marks (newline-delimited JSON)
marks = response['AudioStream'].read().decode('utf-8')
for line in marks.strip().split('\n'):
    mark = json.loads(line)
    print(f"{mark['type']}: {mark['value']} at {mark['time']}ms")

Output:

{"time":0,"type":"sentence","start":0,"end":18,"value":"Hello, how are you?"}
{"time":0,"type":"word","start":0,"end":5,"value":"Hello"}
{"time":417,"type":"word","start":7,"end":10,"value":"how"}
{"time":583,"type":"word","start":11,"end":14,"value":"are"}
{"time":750,"type":"word","start":15,"end":18,"value":"you"}

Use cases:

Synchronized captions
Animated avatars with lip-sync
Highlighting text as it’s spoken
Visual feedback in learning applications

6. Long-Form Content with StartSpeechSynthesisTask

For content longer than 3,000 characters or asynchronous processing:

# Start asynchronous synthesis task
response = polly.start_speech_synthesis_task(
    Text=long_article_text,  # Can be up to 200,000 characters
    OutputFormat='mp3',
    OutputS3BucketName='my-polly-output',
    OutputS3KeyPrefix='articles/',
    VoiceId='Joanna',
    Engine='neural'
)

task_id = response['SynthesisTask']['TaskId']

# Check task status
task = polly.get_speech_synthesis_task(TaskId=task_id)
print(f"Status: {task['SynthesisTask']['TaskStatus']}")
print(f"Output: {task['SynthesisTask']['OutputUri']}")

7. Lexicons (Custom Pronunciations)

Define custom pronunciations for acronyms, brand names, or technical terms:

# Create pronunciation lexicon (PLS format)
lexicon_content = """<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa"
      xml:lang="en-US">
  <lexeme>
    <grapheme>AWS</grapheme>
    <alias>Amazon Web Services</alias>
  </lexeme>
  <lexeme>
    <grapheme>SQL</grapheme>
    <phoneme>ˈɛs kjuː ˈɛl</phoneme>
  </lexeme>
</lexicon>
"""

# Upload lexicon
polly.put_lexicon(
    Name='tech-terms',
    Content=lexicon_content
)

# Use lexicon in synthesis
response = polly.synthesize_speech(
    Text='AWS provides SQL database services.',
    OutputFormat='mp3',
    VoiceId='Joanna',
    Engine='neural',
    LexiconNames=['tech-terms']
)

Integration Patterns

Pattern 1: Serverless Call Transcription Pipeline

Architecture:

Phone Call → S3 (audio) → Lambda (trigger) → Transcribe Job
                ↓
Transcribe Complete → EventBridge → Lambda → DynamoDB (transcript)
                                           → Comprehend (sentiment)
                                           → SNS (alerts for negative sentiment)

Implementation:

import boto3
import json

transcribe = boto3.client('transcribe')
comprehend = boto3.client('comprehend')
dynamodb = boto3.resource('dynamodb')
sns = boto3.client('sns')

def lambda_start_transcription(event, context):
    """Triggered when audio file lands in S3"""
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    job_name = f"call-{key.replace('/', '-').replace('.', '-')}"

    transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        LanguageCode='en-US',
        MediaFormat='wav',
        Media={'MediaFileUri': f's3://{bucket}/{key}'},
        OutputBucketName='my-transcription-output',
        Settings={
            'ShowSpeakerLabels': True,
            'MaxSpeakerLabels': 2,
            'ChannelIdentification': True
        }
    )

    return {'statusCode': 200, 'body': f'Started job: {job_name}'}

def lambda_process_transcript(event, context):
    """Triggered by EventBridge when transcription completes"""
    detail = event['detail']

    if detail['TranscriptionJobStatus'] != 'COMPLETED':
        return {'statusCode': 200, 'body': 'Job not completed'}

    # Fetch transcript from S3
    job_name = detail['TranscriptionJobName']
    transcript_uri = detail['TranscriptionJobResult']['TranscriptFileUri']

    # Download and parse transcript
    # (S3 download logic omitted for brevity)
    transcript_text = "..."  # Full transcript text

    # Analyze sentiment
    sentiment = comprehend.detect_sentiment(
        Text=transcript_text[:5000],  # First 5000 chars
        LanguageCode='en'
    )

    # Store in DynamoDB
    table = dynamodb.Table('call-transcripts')
    table.put_item(Item={
        'job_name': job_name,
        'transcript': transcript_text,
        'sentiment': sentiment['Sentiment'],
        'sentiment_score': sentiment['SentimentScore'],
        'timestamp': detail['CreationTime']
    })

    # Alert on negative sentiment
    if sentiment['Sentiment'] == 'NEGATIVE':
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789012:negative-call-alerts',
            Subject=f'Negative sentiment detected: {job_name}',
            Message=f'Sentiment: {sentiment["Sentiment"]}\nScore: {sentiment["SentimentScore"]}'
        )

    return {'statusCode': 200, 'body': 'Processed transcript'}

EventBridge rule (capture Transcribe completion):

{
  "source": ["aws.transcribe"],
  "detail-type": ["Transcribe Job State Change"],
  "detail": {
    "TranscriptionJobStatus": ["COMPLETED", "FAILED"]
  }
}

Pattern 2: Real-Time Voice Assistant with Polly

Use case: Convert chatbot text responses to speech for voice interfaces.

import boto3
from flask import Flask, request, send_file
import io

app = Flask(__name__)
polly = boto3.client('polly')

@app.route('/speak', methods=['POST'])
def text_to_speech():
    """API endpoint: Convert text to speech"""
    data = request.json
    text = data.get('text', '')
    voice_id = data.get('voice', 'Joanna')

    # Synthesize speech
    response = polly.synthesize_speech(
        Text=text,
        OutputFormat='mp3',
        VoiceId=voice_id,
        Engine='neural'
    )

    # Stream audio back to client
    audio_stream = response['AudioStream'].read()
    return send_file(
        io.BytesIO(audio_stream),
        mimetype='audio/mpeg',
        as_attachment=True,
        download_name='response.mp3'
    )

@app.route('/chatbot', methods=['POST'])
def chatbot_with_voice():
    """Chatbot endpoint with text and voice response"""
    user_input = request.json.get('message', '')

    # Generate chatbot response (simplified)
    bot_response = generate_chatbot_response(user_input)

    # Convert to speech
    audio_response = polly.synthesize_speech(
        Text=bot_response,
        OutputFormat='mp3',
        VoiceId='Joanna',
        Engine='neural'
    )

    audio_bytes = audio_response['AudioStream'].read()

    return {
        'text': bot_response,
        'audio': base64.b64encode(audio_bytes).decode('utf-8'),
        'content_type': 'audio/mpeg'
    }

Pattern 3: Podcast/Video Auto-Captioning

Architecture:

Video Upload (S3) → Lambda → Extract Audio (FFmpeg)
                           → Transcribe Job
                           → Generate SRT/VTT captions
                           → Store in S3
                           → Trigger video processing pipeline

Generate captions from Transcribe output:

import json

def generate_srt_from_transcript(transcript_json):
    """Convert Transcribe JSON to SRT subtitle format"""
    items = transcript_json['results']['items']

    captions = []
    current_caption = {'start': None, 'end': None, 'text': ''}

    for item in items:
        if item['type'] == 'pronunciation':
            word = item['alternatives'][0]['content']
            start_time = float(item['start_time'])
            end_time = float(item['end_time'])

            # Start new caption if needed
            if current_caption['start'] is None:
                current_caption['start'] = start_time

            current_caption['text'] += word + ' '
            current_caption['end'] = end_time

            # Create caption every 10 words or 5 seconds
            if len(current_caption['text'].split()) >= 10 or \
               (end_time - current_caption['start']) >= 5:
                captions.append({
                    'start': current_caption['start'],
                    'end': current_caption['end'],
                    'text': current_caption['text'].strip()
                })
                current_caption = {'start': None, 'end': None, 'text': ''}

    # Convert to SRT format
    srt_content = ""
    for i, caption in enumerate(captions, 1):
        start = format_timestamp(caption['start'])
        end = format_timestamp(caption['end'])
        srt_content += f"{i}\n{start} --> {end}\n{caption['text']}\n\n"

    return srt_content

def format_timestamp(seconds):
    """Convert seconds to SRT timestamp format (HH:MM:SS,mmm)"""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

# Usage
with open('transcript.json', 'r') as f:
    transcript = json.load(f)

srt_output = generate_srt_from_transcript(transcript)

with open('captions.srt', 'w') as f:
    f.write(srt_output)

Pattern 4: Accessible Content Generation

Use case: Generate audio versions of articles for accessibility.

import boto3
from bs4 import BeautifulSoup

polly = boto3.client('polly')
s3 = boto3.client('s3')

def html_article_to_audio(article_html, article_id):
    """Convert HTML article to audio with SSML formatting"""

    soup = BeautifulSoup(article_html, 'html.parser')

    # Extract title and content
    title = soup.find('h1').get_text()
    paragraphs = [p.get_text() for p in soup.find_all('p')]

    # Build SSML
    ssml = '<speak>\n'
    ssml += f'<prosody rate="medium" pitch="medium">\n'
    ssml += f'<emphasis level="strong">{title}</emphasis>\n'
    ssml += '<break time="1s"/>\n'

    for para in paragraphs:
        ssml += f'{para}\n'
        ssml += '<break time="500ms"/>\n'

    ssml += '</prosody>\n</speak>'

    # Generate audio (async for long content)
    response = polly.start_speech_synthesis_task(
        Text=ssml,
        TextType='ssml',
        OutputFormat='mp3',
        VoiceId='Joanna',
        Engine='neural',
        OutputS3BucketName='my-article-audio',
        OutputS3KeyPrefix=f'articles/{article_id}/'
    )

    return response['SynthesisTask']['TaskId']

# Usage
with open('article.html', 'r') as f:
    html_content = f.read()

task_id = html_article_to_audio(html_content, 'article-123')
print(f"Audio generation task started: {task_id}")

Pricing

AWS Transcribe Pricing

Standard Transcription (batch):

First 250 million seconds/month: $0.024 per minute ($1.44 per hour)
Over 250 million seconds: $0.0144 per minute ($0.864 per hour)

Streaming Transcription:

$0.0275 per minute ($1.65 per hour)

Transcribe Medical:

Batch: $0.036 per minute ($2.16 per hour)
Streaming: $0.045 per minute ($2.70 per hour)

Call Analytics:

$0.04 per minute ($2.40 per hour)

Example costs:

Use Case	Volume	Monthly Cost
Customer calls (1,000 calls × 5 min avg)	5,000 minutes	$120
Podcast transcription (100 episodes × 45 min)	4,500 minutes	$108
Live event streaming (100 hours)	6,000 minutes	$165
Medical dictation (500 visits × 10 min)	5,000 minutes	$180

AWS Polly Pricing

Standard Voices:

First 1 million characters/month: Free (12 months from first use)
After free tier: $4.00 per 1 million characters

Neural Voices:

$16.00 per 1 million characters

Speech Marks:

$4.00 per 1 million characters (standard)
$16.00 per 1 million characters (neural)

Example costs:

Use Case	Characters/Month	Engine	Monthly Cost
Voice notifications (100K messages × 100 chars)	10 million	Standard	$40
Audiobook narration (50 books × 300 pages × 2,000 chars)	30 million	Neural	$480
IVR system (500K calls × 200 chars avg)	100 million	Standard	$400
Podcast intro/outro (1,000 episodes × 500 chars)	500K	Neural	$8

Cost comparison:

For a customer support system handling 10,000 calls/month (5 min avg):

Transcribe: 50,000 minutes = $1,200/month
Polly (IVR responses, 200 chars avg): 2M characters = $8/month (standard) or $32/month (neural)
Total: ~$1,240/month

When to Use Transcribe & Polly

Use Transcribe When You Need To:

1. Make audio content searchable

Podcast libraries with full-text search
Video platforms with searchable captions
Legal/compliance recording archives
Meeting transcripts for knowledge management

2. Analyze conversations

Customer support quality monitoring
Sales call analysis (objections, keywords, sentiment)
Healthcare visit documentation
Market research interview analysis

3. Generate accessibility features

Real-time captions for live streams
Subtitles for video content
Text versions of audio content for deaf/hard-of-hearing users

4. Extract insights from audio

Speaker identification in multi-party calls
Sentiment analysis of customer interactions
Topic extraction from podcasts or webinars
PII detection and redaction for compliance

Use Polly When You Need To:

1. Create voice interfaces

Voice responses for chatbots and virtual assistants
IVR (Interactive Voice Response) systems
Voice-enabled applications and devices
Smart speaker skills (Alexa, Google Home)

2. Generate content narration

Audiobook production from ebooks
News article audio versions
Blog post podcasts
Educational content narration

3. Improve accessibility

Screen reader enhancements with natural voices
Audio descriptions for visual content
Voice guidance in applications
Multi-language audio content

4. Scale voice production

Automated voice announcements (flight info, public transit)
Dynamic voice notifications (order updates, alerts)
Personalized voice messages at scale
Voiceover for automated video generation

Transcribe + Polly Combined Use Cases

1. Voice translation pipeline

Transcribe audio (detect language)
Translate text (AWS Translate)
Synthesize in target language (Polly)
Example: Real-time multilingual customer support

2. Content repurposing

Transcribe podcast to text
Summarize with AI (Comprehend/Bedrock)
Generate summary audio (Polly)
Example: Podcast show notes with audio summaries

3. Voice bot with conversation memory

User speaks (real-time Transcribe)
Bot processes intent (Lex/Lambda)
Generate response (chatbot logic)
Speak response (Polly)
Log full conversation (DynamoDB)

When NOT to Use Transcribe & Polly

Transcribe Alternatives:

Use custom speech models when:

You need higher accuracy than 90-95% for specialized domains
You have extensive training data and ML expertise
You need low-latency on-device transcription (offline)
Cost per hour is prohibitive for your volume (>1M hours/month)

Use third-party services (Google Speech-to-Text, Azure Speech) when:

You need specific features (e.g., longer audio context windows)
You’re already locked into another cloud ecosystem
Pricing is significantly better for your use case

Don’t use Transcribe for:

Real-time voice commands requiring <100ms latency (use local models)
Offline transcription (no internet connectivity)
Extremely high-volume batch processing where cost becomes prohibitive

Polly Alternatives:

Use custom TTS models when:

You need a unique brand voice (custom voice cloning)
You require ultra-low latency (<50ms)
You need offline voice synthesis
You want IP ownership of the voice

Use third-party services (Google Text-to-Speech, Azure Neural TTS) when:

You need specific voice styles not available in Polly
You require more advanced prosody control
You’re already using another cloud provider

Don’t use Polly for:

Real-time conversational AI requiring <100ms response (consider streaming)
Extremely high-volume synthesis where cost is prohibitive (>1B chars/month)
Creative voice work requiring human inflection and emotion (hire voice actors)

Common Pitfalls

Transcribe Pitfalls:

1. Not using custom vocabularies for domain-specific terms

Problem: Generic models misinterpret technical terms, brand names, acronyms.

Solution:

# Always create custom vocabulary for your domain
polly.create_vocabulary(
    VocabularyName='company-terms',
    LanguageCode='en-US',
    Phrases=[
        'Kubernetes',
        'PostgreSQL',
        'OAuth2',
        'HIPAA',
        'Acme Corporation'
    ]
)

2. Ignoring audio quality requirements

Problem: Poor audio quality (background noise, low bitrate) results in terrible transcription.

Best practices:

Use at least 16 kHz sample rate (8 kHz minimum)
Mono or stereo (don’t use 5.1 surround)
MP3, WAV, FLAC, or MP4 formats
Minimize background noise (use noise reduction preprocessing)

3. Not handling job failures

Problem: Transcription jobs fail silently; no retry logic.

Solution:

# Poll for job completion with error handling
import time

def wait_for_transcription(job_name, max_wait=600):
    elapsed = 0
    while elapsed < max_wait:
        job = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        status = job['TranscriptionJob']['TranscriptionJobStatus']

        if status == 'COMPLETED':
            return job['TranscriptionJob']['Transcript']['TranscriptFileUri']
        elif status == 'FAILED':
            reason = job['TranscriptionJob'].get('FailureReason', 'Unknown')
            raise Exception(f"Transcription failed: {reason}")

        time.sleep(10)
        elapsed += 10

    raise TimeoutError(f"Transcription exceeded {max_wait}s")

4. Over-relying on speaker diarization accuracy

Problem: Speaker labels are not 100% accurate, especially with similar voices or overlapping speech.

Mitigation:

Use channel identification for structured conversations (phone calls)
Limit to 5 or fewer speakers for best accuracy
Don’t rely on speaker labels for legally binding attribution
Provide human review for critical applications

5. Not redacting PII in compliance-sensitive contexts

Problem: Storing unredacted transcripts with SSN, credit cards, or health data violates compliance.

Solution: Always enable content redaction for sensitive data:

ContentRedaction={
    'RedactionType': 'PII',
    'RedactionOutput': 'redacted_and_unredacted',  # Keep both for compliance
    'PiiEntityTypes': ['SSN', 'CREDIT_DEBIT_NUMBER', 'EMAIL', 'PHONE', 'NAME']
}

Polly Pitfalls:

1. Using standard voices when neural voices are needed

Problem: Standard voices sound robotic and unnatural, especially for customer-facing applications.

Solution: Always use neural voices for production applications:

response = polly.synthesize_speech(
    Text=text,
    VoiceId='Joanna',
    Engine='neural',  # Not 'standard'
    OutputFormat='mp3'
)

Cost difference: Neural is 4x more expensive, but quality difference is worth it.

2. Not using SSML for natural-sounding speech

Problem: Plain text synthesis lacks pauses, emphasis, and natural phrasing.

Solution: Use SSML for production-quality audio:

ssml = """
<speak>
    <prosody rate="95%">
        Welcome to our service.
        <break time="300ms"/>
        How can I help you today?
    </prosody>
</speak>
"""

3. Hitting the 3,000 character limit for synchronous synthesis

Problem: Long-form content fails with character limit errors.

Solution: Use start_speech_synthesis_task for content >3,000 characters:

polly.start_speech_synthesis_task(
    Text=long_text,  # Up to 200,000 characters
    OutputS3BucketName='my-bucket',
    VoiceId='Joanna',
    Engine='neural'
)

4. Not caching frequently used audio

Problem: Synthesizing the same text repeatedly wastes money and time.

Solution: Cache generated audio in S3 with content-based keys:

import hashlib

def get_or_synthesize_speech(text, voice_id='Joanna'):
    # Generate cache key
    cache_key = hashlib.sha256(f"{text}-{voice_id}".encode()).hexdigest()
    s3_key = f"polly-cache/{cache_key}.mp3"

    # Check cache
    try:
        response = s3.get_object(Bucket='my-audio-cache', Key=s3_key)
        return response['Body'].read()
    except s3.exceptions.NoSuchKey:
        pass

    # Synthesize and cache
    response = polly.synthesize_speech(
        Text=text,
        VoiceId=voice_id,
        Engine='neural',
        OutputFormat='mp3'
    )
    audio = response['AudioStream'].read()

    s3.put_object(Bucket='my-audio-cache', Key=s3_key, Body=audio)
    return audio

5. Not handling lexicon limits

Problem: Lexicons have limits (5 per request, 4,000 characters total).

Solution: Organize lexicons by domain and select relevant ones per request:

# Separate lexicons
polly.put_lexicon(Name='medical-terms', Content=medical_lexicon)
polly.put_lexicon(Name='tech-terms', Content=tech_lexicon)
polly.put_lexicon(Name='brand-names', Content=brand_lexicon)

# Use only relevant lexicons per synthesis
polly.synthesize_speech(
    Text=medical_text,
    LexiconNames=['medical-terms', 'brand-names'],  # Max 5
    VoiceId='Joanna',
    Engine='neural'
)

Security Best Practices

1. IAM Permissions

Principle of least privilege:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "transcribe:StartTranscriptionJob",
        "transcribe:GetTranscriptionJob"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-audio-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-transcription-output/*"
    }
  ]
}

2. Encrypt Sensitive Audio and Transcripts

Server-side encryption for S3:

s3.put_object(
    Bucket='my-audio-bucket',
    Key='sensitive-call.wav',
    Body=audio_data,
    ServerSideEncryption='aws:kms',
    SSEKMSKeyId='arn:aws:kms:us-east-1:123456789012:key/abc123'
)

Transcribe with KMS encryption:

transcribe.start_transcription_job(
    TranscriptionJobName='encrypted-job',
    Media={'MediaFileUri': 's3://my-bucket/call.wav'},
    OutputBucketName='my-output',
    OutputEncryptionKMSKeyId='arn:aws:kms:us-east-1:123456789012:key/abc123'
)

3. Enable Content Redaction for PII

Always redact PII in compliance-sensitive contexts:

ContentRedaction={
    'RedactionType': 'PII',
    'RedactionOutput': 'redacted',
    'PiiEntityTypes': ['SSN', 'CREDIT_DEBIT_NUMBER', 'NAME', 'ADDRESS', 'EMAIL', 'PHONE']
}

4. Use VPC Endpoints

Keep traffic private:

# Create VPC endpoint for Transcribe
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --service-name com.amazonaws.us-east-1.transcribe \
  --route-table-ids rtb-123456

5. Implement Data Retention Policies

Auto-delete sensitive transcripts:

# S3 lifecycle policy for transcripts
s3.put_bucket_lifecycle_configuration(
    Bucket='my-transcription-output',
    LifecycleConfiguration={
        'Rules': [{
            'Id': 'delete-transcripts-after-90-days',
            'Status': 'Enabled',
            'Expiration': {'Days': 90},
            'Filter': {'Prefix': 'transcripts/'}
        }]
    }
)

Key Takeaways

AWS Transcribe:

Use custom vocabularies for domain-specific terminology to improve accuracy
Enable speaker diarization for multi-party conversations, but verify accuracy
Use Call Analytics for customer support quality monitoring and sentiment analysis
Always enable PII redaction for compliance-sensitive applications
Use streaming transcription for real-time captions; batch for archived content

AWS Polly:

Always use neural voices for production applications (4x cost, 10x quality)
Use SSML for natural-sounding speech with pauses, emphasis, and pronunciation control
Cache frequently synthesized audio to reduce costs and latency
Use speech marks for lip-sync, highlighting, and synchronized captions
Use lexicons for consistent pronunciation of brand names, acronyms, and technical terms

Integration & Cost:

Combine Transcribe + Translate + Polly for multilingual voice applications
Use asynchronous jobs for long-form content (>3,000 chars for Polly, batch for Transcribe)
Monitor costs carefully for high-volume applications (especially neural Polly)
Implement serverless pipelines (S3 + Lambda + EventBridge) for automated processing

Quality & Compliance:

Pre-process audio for best transcription results (16+ kHz, noise reduction, clear channels)
Use Transcribe Medical for HIPAA-compliant healthcare transcription
Enable encryption at rest and in transit for sensitive audio and transcripts
Implement data retention policies to auto-delete transcripts after required period

AWS Transcribe and Polly eliminate the complexity of building custom speech AI systems while providing enterprise-grade accuracy, scalability, and compliance features for real-world voice applications.