When Architecture Patterns Don't Match the Problem
Lessons from three attempts to build a distributed event processing platform
Executive Summary
Over six years, a fintech company built three successive solutions to process millions of financial transaction events for real-time customer alerts. Each solution lasted approximately two years before stakeholders demanded change. Each failed for different technical reasons, but a common organizational pattern persisted throughout: development teams remained out of alignment with stakeholder needs, and that gap was never closed.
This case study examines what went wrong technically and organizationally, why the third solutionβs architecture was mismatched to the problem, and what should have been built instead.
Key lessons:
- Distribution is an optimization, not a starting point
- Donβt abandon good designs because of bad implementations
- Pattern selection requires trade-off analysis against actual requirements
- Observability isnβt optional for financial systems
- Organizational alignment problems canβt be solved with architecture changes alone
The Business Context
The company operated as an intermediary layer between banks, credit unions, and end users. Rather than requiring smaller financial institutions to build their own user interfaces and complex integrations, the platform provided a rich UI experience backed by transaction processing infrastructure.
The primary feature under development was real-time transaction alerts: customers subscribe to events on their accounts, and they receive email or SMS notifications when transactions occur. The processing happened primarily after transactions had been accepted by the financial institutions.
Scale and constraints:
- Millions of financial transaction events to process
- 8+ development teams needing to plug processors and workflows into the platform
- Financial compliance requirements demanding audit trails and reliability
- Customer SLAs with real monetary penalties for failures
Solution 1: The Monolith
The first attempt was a monolithic system built around a legacy scheduler. The architecture was poorly documented and poorly understood by the teams who inherited it.
Why it failed:
- Teams could not plug their processors and workflows into the existing infrastructure
- The systemβs behavior was opaque, making modifications risky
- No clear extension points existed for new functionality
- Scaling required manual configuration and human intervention rather than automated elasticity
- Non-containerized infrastructure made scaling expensive and slow
After approximately two years, stakeholders demanded change. The diagnosis was correct: the monolith couldnβt support multi-team extensibility or cost-effective scaling. The prescription was a move to microservices.
Solution 2: The Coordinator
The second solution introduced a microservices architecture with a central coordinator pattern:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Central Coordinator β
β (Queue management, event brokering) β
ββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Domain Processor β β Domain Processor β β Domain Processor β
β (Team A) β β (Team B) β β (Team C) β
β β β β β β
β Registers metadataβ β Registers metadataβ β Registers metadataβ
β for alert UI β β for alert UI β β for alert UI β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
Each processor registered its metadata with the coordinator, which the alert management UI used to show customers what alerts were available.
What worked:
- Central location for audit and retry logic
- Each processor was a holistic domain processor owned by a single team
- Simple to understand: clear data flow, clear ownership
- Easy to reason about scaling at the coordinator level
Why it was abandoned:
- The implementation was poorly coded, leading to connection deadlocks
- The service wasnβt designed to scale horizontally
- Queues were not used properly, creating resource exhaustion
- A refactor was deemed too expensive
Why Solution 2 could have been redeemed:
The value of Solution 2 wasnβt that it was βgoodβ but that it started with fewer components and consistent abstraction layers. This created options:
- When you wanted to bypass one layer for a new one, or go straight to a queue, you could do that incrementally
- You might end up with two versions of the same platform running in parallel, which enabled incremental migration without breaking changes
- The architecture supported gradual evolution rather than requiring wholesale replacement
- A lightweight adapter or API gateway could have been used to βtwo-stepβ the migration to a new solution while preserving the single abstraction layer
Solution 2 did have an event store in the central coordinator, but it was implemented as a simple queue rather than as durable storage with proper state management. It didnβt support rollbacks or saga orchestration. Moving event persistence to the processor level (as proposed later in this case study) would have fixed that limitation while preserving the coordinatorβs architectural strengths.
The critical mistake: Rather than investing in fixing the implementation while preserving these architectural strengths, the organization chose to start over with a completely different architecture. This pattern would prove costly.
Solution 3: Pipes & Filters
The third solution adopted a distributed pipes and filters architecture. Each actor had complete autonomy: its own authentication, its own AWS SQS access, and its own scaling behavior via Akka. A single central service handled subscriptions, metadata, and workflow registration, but communication between actors was distributed via SQS rather than synchronous API calls.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Central Registration Service β
β β
β Subscriptions β Metadata (UI) β Workflow (Parent/Child) β
β β
β All in-memory registration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Actors register on startup (concurrency bugs)
β
ββββββββββββ΄βββββββββββ
β Event Source β
ββββββββββββ¬βββββββββββ
β
SQS
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Actor A βββSQSβββ Actor B βββSQSβββ Actor C β
β (SQS+Auth) β β (SQS+Auth) β β (SQS+Auth) β
β DynamoDB β β DynamoDB β β DynamoDB β
β Akka scale β β Akka scale β β Akka scale β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
SQS SQS SQS
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Actor D β β Actor E β β Actor F β
β (SQS+Auth) β β (SQS+Auth) β β (SQS+Auth) β
β DynamoDB β β DynamoDB β β DynamoDB β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β
SQS
β
βΌ
βββββββββββββββββββββββββββββ
β Notification Service β
βββββββββββββββββββββββββββββ
Instead of using AWS EventBridge for fan-out, a custom solution was built to register parent/child processor relationships via the central registration service. Actors registered themselves with this single service on startup, which introduced concurrency bugs and created a single point of failure for subscriptions, metadata, and workflow routing.
What worked:
- Clean, well-structured code with good unit tests
- Queue-based communication throughout
- Testable locally using LocalStack
- DynamoDB for ephemeral workflow data
What failed:
Architecture Problems
| Problem | Description | Impact |
|---|---|---|
| Message bloat | Actors didnβt transform messages as intended; they added entire new context to payloads | Enormous network costs, bandwidth exhaustion |
| No centralized control | Each actor operated independently with no coordination | Impossible to audit, conflicted with financial compliance needs |
| Custom fan-out | Built custom parent/child registration instead of using EventBridge | Added complexity, introduced concurrency bugs during startup |
| Shared infrastructure | Actors ran on the same tier as UI APIs | Processing spikes caused UI unavailability |
Operational Problems
| Problem | Description | Impact |
|---|---|---|
| DLQs unused | Dead letter queues were added but never monitored or processed | Lost events in a financial system |
| No observability | No centralized monitoring or alerting | Weeks passed before critical problems were detected |
| Per-actor scaling | Each actor had its own Akka scaling behavior | Impossible to identify bottlenecks or predict capacity |
| In-memory metadata | Central registration for processors used in-memory storage | Constant startup failures, single point of failure |
Data Problems
| Problem | Description | Impact |
|---|---|---|
| Stale DynamoDB data | Decision-gate actors had configs that fell out of sync | Incorrect routing, inconsistent behavior |
| Noisy neighbors | One tenant could consume most of an actorβs processing capacity | Unfair resource allocation, SLA violations |
| SQS visibility timeout games | Constant tuning to account for processing latency and tenant partitioning | Fragile configuration, message reprocessing |
| Component-specific config | Configurations tied to components, not domains | Nightmare for support teams when components changed |
Developer Experience Problems
| Problem | Description | Impact |
|---|---|---|
| Coupled deployments | Any change required updating every actor simultaneously | Massive deployment cost, developer burnout |
| API sprawl | Teams created new APIs for every change | Fragmentation, inconsistency, maintenance burden |
| No rollback capability | Pipes and filters provides no saga pattern | Failed workflows left partial state with no compensation |
The result: After approximately two years, the accumulated failures led to SLA violations and significant customer payouts. The organization discovered problems weeks after they began because there was no observability, and by then the damage was done.
Pattern Analysis: Why Pipes & Filters Was Wrong
The pipes and filters pattern makes specific assumptions about how data flows through a system:
| Pattern Assumption | Reality in This System |
|---|---|
| Each filter performs a stateless transformation | Actors added unrelated context, didnβt transform |
| Filters are independent and composable | Actors required access to shared concepts and integrations |
| Scaling is per-filter based on throughput | Per-actor Akka scaling made bottlenecks invisible |
| Failure handling is per-filter | No saga support meant partial failures couldnβt roll back |
The pattern was selected without formal trade-off analysis. Development teams werenβt allowed to see the proposal, and no documentation existed explaining why this architecture was chosen over alternatives or what trade-offs were accepted.
What should have triggered concern:
- Financial systems require audit trails; pipes and filters distributes control
- The use case wasnβt transformation; it was enrichment and routing
- Multi-tenant systems need fair resource allocation; per-actor scaling canβt provide this
- Fan-out complexity suggested EventBridge, not custom registration
The Organizational Pattern That Never Changed
Across all three solutions, the same organizational dynamic persisted:
Development teams remained out of alignment with stakeholder needs. Stakeholders demanded βchangeβ every two years, but the actual requirements were never crystallized in a way that could be validated. Each solution was a technical response to stakeholder frustration rather than a deliberate answer to clearly defined needs.
Architecture decisions were made without trade-off analysis. Solution 3 was proposed without documented rationale, and the development teams building it werenβt even allowed to see the proposal that mandated it.
No architecture can fix an alignment problem, and the pattern persisted because the organizational issue was never addressed.
What Should Have Been Built
Based on the actual requirements and constraints, hereβs an architecture that would have addressed the real problems:
Principle 1: Separate Concerns Cleanly
βββββββββββββββββββββββββββββββββββββββββ
β Alert Management UI β
β (Customer subscription mgmt) β
βββββββββββββββββββββ¬ββββββββββββββββββββ
β reads
βΌ
βββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Alert Feature Metadata DB ββββββββ Versioned Migration β
β (Available alert types, β β Tasks β
β subscription options) β β (No code deployment) β
βββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β No coupling - UI metadata is completely
β separate from event processing
β¨
βββββββββββββββββββββββββββββββββββββββββ
β Transaction Events β
β (From financial systems) β
βββββββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS EventBridge β
β (Fan-out routing, no custom registration) β
ββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Domain Processor β β Domain Processor β β Domain Processor β
β (Team A) β β (Team B) β β (Team C) β
β β β β β β
β Just processes. β β Just processes. β β Just processes. β
β No registration. β β No registration. β β No registration. β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
- Alert feature metadata and subscriptions live in a versioned database, updated via separate migration tasks. No deployed code changes needed for metadata updates. The UI reads what alerts are available, customers subscribe through this service, and processors query it to determine which customers to notify for a given event. The subscription service is a stable API that processors consume, not something processors register with.
- Event processing is handled by async processors that just process. EventBridge routing rules are managed through infrastructure-as-code, tied to component versions rather than runtime registration. Processors donβt self-register with internal coordinators or maintain UI metadata. When a processor receives an event, it queries the subscription service to find matching customers.
- Fan-out routing uses AWS EventBridgeβs native capabilities, eliminating custom parent/child registration logic and the concurrency bugs that came with it.
Principle 2: Start with Domain Processors, Distribute When Proven
Solution 2 got this right: each domain/feature team owns their own processor. Distribution is an optimization to a proven need, and there was never any proven need for distributing work beyond single domain processors.
- Use share-nothing architecture: domains handle themselves
- No central coordinator needed
- Add Kinesis or other streaming only when measurement proves necessity
When workflows span multiple domains, use eventful choreography rather than orchestration. Each domain publishes events to EventBridge when its work completes, and other domains subscribe to events they care about. This adds documentation complexity (you need to track which domains produce and consume which events), but itβs the correct trade-off for a high-agility, multi-team environment where teams need to evolve independently.
Principle 3: Queuing Strategy Depends on Ordering Requirements
The right queuing approach depends on whether strict ordering matters for your domain:
When ordering doesnβt matter: SQS Fair Queues
For domains where events can be processed in any order, SQS Fair Queues provide automatic noisy-neighbor mitigation. By setting MessageGroupId on messages (typically to the tenant ID), SQS automatically ensures fair resource allocation across tenants without custom code. This would have solved the noisy-neighbor problem in Solution 3 with zero implementation effort.
Fair Queues work because theyβre Standard queues with fairness semantics: virtually unlimited throughput, no in-flight message limits per tenant, and automatic dwell-time fairness. The trade-off is no ordering guarantees.
When ordering matters: Persist-then-consume pattern
For domains requiring strict ordering (e.g., transaction sequences where order affects balance calculations), SQS FIFO has throughput constraints that create back-pressure during traffic spikes. The alternative is to persist events first and consume based on configurable rate and priority:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Domain Processor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β Event Store βββββΆβ Processing Logic β β
β β (DynamoDB) β β β’ Partitioned by tenant β β
β β β β β’ Configurable consumption rate β β
β β PK: TenantID β β β’ Priority-based processing β β
β β SK: Timestamp β β β’ Claim Check for large payloads β β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β Audit Trail β β Async Observability β β
β β (per domain) β β (aggregated centrally) β β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Ordering in DynamoDB: DynamoDB maintains sort key order within a partition. By using TenantID as the partition key and a timestamp or sequence number as the sort key, you get ordered reads per tenant for free. The processor queries each tenantβs partition in order, processes at a configurable rate, and marks items as processed. This gives you FIFO semantics per tenant without SQS FIFOβs throughput constraints.
Fair consumption across tenants: The processor controls the consumption rate per tenant, ensuring no single tenant monopolizes processing capacity. This is the configurable priority queue the system needed but never had. You can implement round-robin across tenants, weighted priority based on SLA tier, or burst allowances with rate limiting. All of this is controlled by your code rather than queue configuration.
| SQS-Only Approach | Persist-then-Consume Approach |
|---|---|
| Visibility timeouts constantly tuned | Persistent state eliminates timeout games |
| Consumer memory pressure | Read what you need, when you need it |
| FIFO throughput constraints (300-3000 msg/s per group) | DynamoDB scales to your write capacity |
| Ordering only within message groups | Ordering within partitions with tenant isolation |
| Canβt see traffic patterns until consumed | Visible backlog enables predictive scaling |
| βHow many nodes?β = guess | βHow many nodes?β = data-driven from backlog metrics |
Note on saga patterns: The original Solution 3 had no rollback capability when workflows failed mid-stream. Saga orchestration addresses this, but itβs orthogonal to the queuing strategy. Whether you use SQS or persist-then-consume, compensating transactions need to be designed into the workflow. The persist-then-consume approach makes saga state easier to manage because the event store already has the workflow history.
Principle 4: Domain-Centric Configuration
| Component-Specific Config (What Was Built) | Domain-Specific Config (What Was Needed) |
|---|---|
| Config tied to component deployment | Config tied to business domain concepts |
| Component changes break config | Component architecture can change freely |
| Support teams must understand topology | Support teams work with domain concepts |
| SDK consumers coupled to implementation | SDK consumers work with stable abstractions |
Principle 5: Governance Through Async Observability
- Each domain processor audits all activity properly, in an async manner
- Governance ensures compliance without blocking processing
- Support staff sees problems and traffic summaries as they occur
- No more weeks-long blind spots
The specifics of governance depend on team size and domain nature, but Architecture Decision Records (ADRs) are always a good starting point. This was something the organization never did across any of the three solutions. Documenting why decisions were made creates accountability and helps future teams understand constraints.
Principle 6: Claim Check for Actual Data Flow
The processors werenβt mutating a single payload through stages. Often they werenβt even looking at the same data. The Claim Check pattern acknowledges this reality: store large or context-specific data separately and pass references.
When processors need shared data, they query stable domain APIs (like the subscription service) rather than passing bloated payloads between stages. Each workflow is isolated, and all processors must be idempotent. This was another problem in Solution 3: SQS locking and visibility timeout issues meant the same work was sometimes processed more than once, causing duplicate notifications and inconsistent state. With a persisted event store at the processor level, idempotency is enforced by design. You can track which events have been processed and skip duplicates.
Principle 7: Tenant Isolation at the Processor Level
EventBridge doesnβt need to worry about tenants; thatβs the processorβs responsibility. With a persisted event store, processors can implement custom priority and rate-limiting patterns to ensure fair resource allocation across tenants (as described in Principle 3). If traffic patterns demand it, the architecture can evolve: EventBridge rules can point to Kinesis, which manages tenant isolation through shards. The key is that tenant isolation is a processor concern with clear options for scaling, not a system-wide problem with no good answers.
Key Lessons
1. Distribution is an optimization, not a starting point
Solution 3 assumed distribution was necessary from day one, but there was never a proven need for distributing work beyond single domain processors. The better approach: start simple, measure actual bottlenecks, then optimize.
2. Architectural flexibility matters more than initial correctness
Solution 2 had significant problems, but it hadnβt overextended itself. The simpler architecture left room to evolve: you could fix the implementation, add new layers incrementally, or migrate components without wholesale replacement. Solution 3βs distributed complexity closed off those options.
3. Pattern selection requires trade-off analysis
Pipes and filters has well-known trade-offs, and those trade-offs conflicted directly with the requirements of this system. No formal analysis was performed, no documentation explained the choice, and development teams werenβt even allowed to see the proposal.
4. Observability isnβt optional for financial systems
Solution 3 had no centralized observability, so problems went undetected for weeks. By the time issues were discovered, SLA violations had already triggered significant customer payouts. In financial systems, you must know whatβs happening in real-time.
5. Organizational alignment problems canβt be solved with architecture changes
The same misalignment between development teams and stakeholders persisted across all three solutions. Each architecture change was a technical response to an organizational problem. Until the alignment issue was addressed directly, no architecture would succeed.
Conclusion
The cycle of build-fail-replace could have been broken at Solution 2. Not because Solution 2 was good, but because it hadnβt painted itself into a corner. The architecture was simple enough to fix, extend, or partially replace. Instead, the organization invested in a more complex architecture that closed off those options.
Technical excellence matters, and Solution 3 had clean code and good tests. But technical excellence in service of the wrong pattern still fails. A flawed but flexible architecture can be iteratively improved. An overextended architecture requires starting over.
The path forward was always available: start with simple domain processors, measure actual needs, distribute only when proven necessary, and ensure observability from day one. Most importantly, address the organizational alignment that no architecture can fix.
Find this case study insightful? Share it with your network:
Share on LinkedIn