When Real-Time Push Needs Product Context
A proof-of-concept for routing WebSocket messages through product entitlements, and what it taught about validating requirements before building infrastructure
The Problem
The web application was data-hungry. It served authenticated users with paid content products, and those users expected near-real-time updates: sector alerts, trade alerts, application news, and scanner notifications. The existing pattern of polling APIs on interval was getting slower as the product surface grew, and the team knew that adding more polling endpoints would only make the responsiveness problem worse.
The straightforward solution would have been WebSocket-based push notifications. Subscribe to a topic, receive messages when they arrive. But the platform had a complicating factor that made generic pub/sub insufficient: users had different product entitlements, and the same topic needed to deliver different content to different users based on what they had paid for.
A user subscribed to product βCHLTβ who asked for βNewsβ should only receive news relevant to that product. A user with βFullAccessβ should receive everything. A message published to the βNewsβ topic with scope limited to βCHLTβ and βFullAccessβ should reach both of those users but not a user who only subscribed to βSfc.β The routing logic lived entirely on the server side; the client should never need to know about product-scoped channels or entitlement resolution. From the clientβs perspective, subscribing to βNewsβ should just work.
The team scoped this as a proof-of-concept: build a working solution, validate the architecture at moderate scale, and determine whether the investment was justified before committing to production infrastructure. That framing matters for understanding the decisions that followed.
Why Not Something Off-the-Shelf?
The team evaluated alternatives before building custom, but in retrospect, the evaluation was narrower than it should have been. The AWS primitives were legitimately poor fits, but the team stopped there without seriously evaluating managed real-time services that were better suited to the problem.
AWS Primitives: Fair Rejections
API Gateway WebSockets provide a managed connection layer, but the routing model is flat. Each route key selects a Lambda handler that must implement all business logic for determining who receives what, including entitlement lookups, channel resolution, and group membership in an external store like DynamoDB. At that point, API Gateway is just a WebSocket transport layer and all the context-driven routing lives in custom code anyway.
SNS filtering requires the subscriber to declare upfront which attributes to filter on. The design goal was the opposite: clients subscribe to a human-readable topic like βNewsβ and the server resolves the actual filtering based on server-side context the client has no knowledge of. SNS doesnβt support dynamic fan-out where the channel name itself is derived from user context.
AppSync subscriptions require the client to specify what itβs subscribing to with enough specificity for the filter to work. The client would need to know its own product entitlements, which defeats the purpose of server-side resolution.
These were reasonable rejections. None of these services are designed for the βsubscribe to a topic, let the server figure out the restβ pattern.
Managed Real-Time Services: The Gap in the Evaluation
Where the evaluation fell short was in not seriously considering purpose-built real-time messaging services like Pusher or Ably. These services support exactly the pattern the team was trying to build.
Pusherβs private channels use an authorization callback where the server decides which channels a connection is allowed to join. The flow would have been: client subscribes to βNews,β Pusher calls the authorization endpoint, the endpoint looks up the userβs product entitlements, and approves subscription to private-news-CHLT and private-news-Sfc. That authorization callback is maybe 50 lines of code sitting behind an endpoint the team already knew how to build.
Ably provides a similar token-based authorization model, and both services offer message history (comparable to the lookback pattern) as a built-in feature. The per-message TTL, deduplication, and replay that required a custom DynamoDB table and fire-and-forget replay logic in the custom solution are standard capabilities in these platforms.
AWS IoT Core also supports per-client topic policies based on identity claims, with MQTT topics that could map naturally to the product-scoped pattern. The topic structure news/CHLT, news/Sfc with IAM-style policies restricting which topics each user can subscribe to would achieve the same result.
The honest reason these werenβt evaluated seriously was that the team defaulted to building within the technology stack it already operated. The organization ran .NET services on ECS, had SignalR expertise, and had authorization infrastructure ready to use. The reflex was to compose from familiar components rather than evaluate whether a managed service could address the same requirements.
That said, managed services come with their own tradeoffs that the evaluation would have surfaced. Pusher and Ably charge per connection and per message. At moderate scale those costs can exceed what AWS infrastructure costs for the same workload, especially when the AWS components use PAY_PER_REQUEST billing that scales to near-zero during quiet periods. The authorization callback model also introduces latency because Pusher must call the teamβs endpoint, the endpoint must look up entitlements, and the response must travel back before the subscription completes. In the custom solution, that authorization lookup happened in-process with no network round trip. For a platform that highly valued client-API responsiveness, that latency difference mattered.
The point isnβt that managed services were clearly better. Itβs that they should have been part of the evaluation so the tradeoffs could be weighed deliberately rather than defaulted past.
Solution Architecture
The system had two primary flows: subscribing to topics and publishing messages to subscribers. Both flows passed through the same context resolution layer that translated between human-readable topics and product-scoped SignalR groups.
Subscribe Flow
ββββββββββββββββ WebSocket ββββββββββββββββββββββββββββββββββββββββββββββββ
β Client App β βββββββββββββββΆ β ContentHub (SignalR) β
β β Subscribe( β β
β β ["News", β Extracts authenticated Principal β
β β "Alerts"]) β Delegates to PubSubAdapter β
ββββββββββββββββ ββββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β PubSubAdapter.Subscribe β
β β
β 1. Looks up HubConfig from Parameter Store β
β 2. Calls AuthorizationService β
β 3. Iterates eligible channels β
β 4. Adds connection to SignalR groups β
β 5. Fire-and-forget lookback replay β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β AuthorizationService β
β β
β 1. Gets user's product entitlements β
β (e.g., Products: ["CHLT", "Sfc"]) β
β 2. Converts to MessageScopes β
β 3. For each matching ChannelConfig: β
β BuildEligibleChannelNames(scopes) β
β 4. Returns: "News" β ["News-CHLT", β
β "News-Sfc"] β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Publish Flow
ββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Domain Service β ββββΆ β EventBridge β ββββΆ β SQS Queue β
β (publishes β β (routes by β β (buffered) β
β event) β β event type) β β β
ββββββββββββββββββ ββββββββββββββββ ββββββββ¬βββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β QueuedMessagesWorker β
β (BackgroundService) β
β Polls every 30 seconds β
ββββββββββββββ¬ββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β PubSubAdapter.Publish β
β β
β 1. Resolve message β
β scopes to channels β
β 2. Save lookback copy β
β to DynamoDB (if β
β enabled) β
β 3. Send to each β
β SignalR group β
ββββββββββββββββββββββββββββ
The Smart Channel Concept
The core innovation was an indirection layer between client-facing topics and internal SignalR groups. Clients subscribed to topics. The server resolved those topics into scoped channels based on configuration, user entitlements, and message content. Neither subscribers nor publishers needed to know about this resolution; it happened transparently in the adapter layer.
Configuration-Driven Channel Topology
The entire hub, channel, and filter structure was externalized to AWS Parameter Store as JSON, loaded at service startup. Adding a new topic or changing how channels mapped to products required a configuration change, not a code deployment.
A simplified example of the configuration structure:
{
"Hubs": [
{
"Name": "Content",
"Type": "PubSub",
"Channels": [
{
"Name": "News-[ScopeValue]",
"Topic": "News",
"IsPublishableByUser": false,
"Filters": [
{ "Type": "Product Codes", "Values": ["*"] }
],
"LookBack": {
"IsEnabled": true,
"TtlPeriodType": "CalendarDay",
"TtlValue": 1
}
},
{
"Name": "Site Visit Count",
"Topic": "SiteVisits",
"IsPublishableByUser": false,
"Filters": []
}
],
"FilterCollections": [
{
"Name": "AllProducts",
"Filters": [
{ "Type": "Product Codes", "Values": ["*"] }
]
}
]
}
]
}
Two patterns are visible here. The βNewsβ channel uses a [ScopeValue] placeholder in its name and a wildcard product filter. At runtime, this single configuration entry produces N concrete SignalR groups, one per product code in the subscriberβs entitlements or the messageβs scopes. The βSite Visit Countβ channel has no filters, which means it resolves to a single static group that all subscribers and all messages use.
Filter collections provided DRY reuse: a common set of product filters defined once could be referenced by name from any number of channel configurations.
How Channel Resolution Works
The channel resolution algorithm takes a list of scopes (derived from either the userβs entitlements or the messageβs declared audience) and returns the concrete channel names that apply. The logic follows a priority chain:
BuildEligibleChannelNames(scopes):
ββ Channel has no filters?
β YES β Return static channel name (replace [ScopeValue] with "*")
β Everyone gets the same channel.
β
ββ Scopes contain Product Codes?
β YES β Check channel's product filter:
β ββ Filter is wildcard ("*")?
β β β Every product code in scope becomes its own channel
β β e.g., ["CHLT", "Sfc"] β ["News-CHLT", "News-Sfc"]
β β
β ββ Filter has specific values?
β β Only matching product codes become channels
β e.g., filter ["CHLT"], scope ["CHLT", "Sfc"] β ["News-CHLT"]
β
ββ No product match? Fall through to Feature Names
β Same matching logic against feature name filters
β Return matching feature-based channel names
A Concrete Example
Consider two users and one published message:
User A has products ["CHLT", "Sfc"]. When User A subscribes to βNews,β the authorization service resolves their entitlements into scopes, and the channel config produces ["News-CHLT", "News-Sfc"]. User Aβs connection joins both SignalR groups.
User B has products ["FullAccess"]. The same subscription request produces ["News-FullAccess"]. User B joins one group.
A message arrives with topic βNewsβ and scopes [{ Type: "Product Codes", Values: ["CHLT", "FullAccess"] }]. The publish path resolves this to channel names ["News-CHLT", "News-FullAccess"] and sends the message to both groups.
User A receives the message because they are in "News-CHLT". User B receives it because they are in "News-FullAccess". If a third user had only product βSfc,β they would not receive this message because βSfcβ is not in the messageβs scopes. The client never needed to know about any of this filtering. It subscribed to βNewsβ and received relevant news.
Subscribe Resolution:
ββββ News-CHLT βββ User A
User A (CHLT, Sfc) β
subscribes to "News" βββββββββββΆ ββββ News-Sfc βββ User A
β
User B (FullAccess) β
subscribes to "News" βββββββββββΆ ββββ News-FullAccess βββ User B
Publish Resolution:
Message: Topic="News"
Scopes: ["CHLT", "FullAccess"]
β
ββββΆ News-CHLT β User A receives β
ββββΆ News-FullAccess β User B receives β
β
ββββΆ News-Sfc β not targeted (Sfc not in message scopes)
Symmetry Between Subscribe and Publish
Both paths used the same channel resolution logic, which was a deliberate design choice. On subscribe, the userβs entitlements became the scopes. On publish, the messageβs declared audience became the scopes. The same resolution applied in both directions, which meant the system was self-consistent: any channel a user was placed into could also be targeted by a message, and any channel a message was sent to would only contain users who were entitled to receive it.
The Lookback Pattern
WebSocket connections are ephemeral. A user might connect to the application after a message was published, or their connection might drop and reconnect. Without a catch-up mechanism, those users would miss messages that arrived while they were disconnected.
The lookback pattern solved this by persisting recent messages to DynamoDB with configurable TTLs and replaying them to new connections during the subscribe handshake.
How Lookback Works
Each channel configuration had an optional lookback setting controlling whether lookback was enabled and how long messages persisted. TTLs could be calendar-day-based (expire at midnight) or minute-based, with DynamoDBβs native TTL feature handling cleanup automatically.
Lookback Save (on publish):
Message published to lookback-enabled channel
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β Is lookback enabled for this channel? β
β β
β YES β Stamp message with eligible channels β
β Compute TTL from config: β
β ββ CalendarDay? β midnight + N days β
β ββ Minutes? β now + N minutes β
β Save to DynamoDB with TTL β
β β
β NO β Skip (message is ephemeral) β
ββββββββββββββββββββββββββββββββββββββββββββββ
Replay on Subscribe
After a userβs connection was added to its eligible SignalR groups, the adapter fired a lookback replay as a fire-and-forget operation so it wouldnβt block the subscribe response.
Lookback Replay (on subscribe):
For each lookback-enabled channel the user subscribed to:
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query DynamoDB for non-expired messages by topic β
β β
β Track sent message UUIDs (deduplication set) β
β β
β For each of the user's eligible channel names: β
β Filter saved messages where: β
β - message's eligible channels include this one β
β - message UUID not already sent β
β β
β Send each matching message directly to β
β this specific connection (not the group) β
β β
β Add UUID to deduplication set β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The deduplication set prevented duplicate delivery. If a message was eligible for multiple channels that the user subscribed to (for example, a message scoped to both βCHLTβ and βSfcβ for a user with both products), the UUID check ensured the user received it only once.
DynamoDB as the Lookback Store
The lookback table used Topic as the hash key and CreatedAt as the range key, with PAY_PER_REQUEST billing. Lookback queries always filtered by topic, and the range key provided chronological ordering within each partition. PAY_PER_REQUEST meant the store cost nearly nothing during quiet periods and scaled automatically during spikes, and TTL-based expiration meant no cleanup jobs were needed.
Message Ingestion Pipeline
Messages reached the SignalR service through an event-driven pipeline. Internal domain services published events to AWS EventBridge, which routed them by event type to an SQS queue. A background worker within the SignalR service consumed the queue and passed messages through the adapter layer.
The Worker
The worker ran as a .NET BackgroundService co-located with the API in the same ECS container. It started two consumers in parallel: one for the main queue polling every 30 seconds, and one for the dead letter queue polling every 30 minutes. Both consumers used the same message processor; the only difference was the polling interval and queue name.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QueuedMessagesWorker β
β (BackgroundService) β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β Main Consumer β β DLQ Consumer β β
β β Poll: 30 seconds β β Poll: 30 minutes β β
β β Queue: pubsub- β β Queue: pubsub- β β
β β messagepublishedβ β messagepublished-dlq β β
β βββββββββββ¬ββββββββββββ βββββββββββββ¬βββββββββββββββ β
β β β β
β ββββββββββββ¬βββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββ β
β β Message Processor β β
β β β β
β β Deserialize event β β β
β β Extract message β β β
β β PubSubAdapter.Publish β β
β ββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SQS handled retries automatically: up to 3 attempts with a 30-second visibility timeout, then dead-lettering to the DLQ where the slower consumer served as a safety net for transient failures.
Co-located Processing
The worker ran inside the same ECS container as the API, not as a separate service. The API was lightweight enough that processing overhead posed no measurable contention, so the team avoided the operational cost of a separate deployment. The organization had a proven pattern for extracting workers into dedicated nodes if processing ever needed its own scaling profile.
Infrastructure Decisions
WebSocket connections impose infrastructure requirements that standard HTTP APIs donβt. The team made several deliberate choices to support persistent connections reliably.
Dedicated Load Balancer and Cluster
WebSocket connections require sticky sessions at the load balancer level, so the SignalR service couldnβt share an ALB with other HTTP services that used round-robin routing. Each environment got its own ALB and ECS cluster dedicated to the SignalR service. This is a fundamental requirement of any WebSocket infrastructure, not a cost specific to the custom approach.
Transport and Protocol
The client connected with WebSocket-only transport, bypassing SignalRβs default HTTP negotiation handshake. This eliminated an extra round trip on connection establishment at the cost of falling back to long polling if WebSockets were unavailable. Since the dedicated ALB was configured specifically for WebSocket support, the fallback scenario was not a concern.
Messages used MessagePack binary serialization over the WebSocket connection. Compared to JSON, MessagePack produces smaller payloads and faster serialization, which mattered for a connection that might deliver dozens of messages per minute for active topics.
Client Resilience
The client implemented a custom retry policy with linear backoff: 1 second for the first retry, increasing by 1 second each attempt, capped at 30 seconds. The policy retried indefinitely, never giving up on reconnection. On successful reconnection, the client automatically re-invoked the subscribe method to rejoin SignalR groups and trigger lookback replay. This made connection drops transparent to the user: the client reconnected, resubscribed, received any missed messages through lookback, and continued receiving live updates.
Connection Resilience:
Connection drops
β
βΌ
Retry with linear backoff (1s, 2s, 3s, ... 30s cap)
β
βΌ
Reconnected
β
βΌ
Auto-resubscribe to all topics
β
ββββΆ Rejoin SignalR groups
ββββΆ Lookback replays missed messages
Authentication
JWT tokens were passed on the initial WebSocket connection via an access token factory callback. The SignalR hubβs subscribe method was protected by an authorization policy ensuring only authenticated users with the correct role could subscribe to topics. The authorization check happened once per hub method invocation, not per message delivery, so it didnβt add latency to the real-time push path.
Design Decisions Worth Noting
Thin Hub, Fat Adapter
The SignalR hub itself contained exactly one method: Subscribe. It extracted the authenticated principal from the connection context, forwarded the requested topics and connection ID to the adapter, and returned. All business logic (channel resolution, lookback, authorization) lived in the PubSubAdapter. This separation meant the adapter could be tested independently of SignalRβs connection infrastructure, and the same adapter logic could serve both the hubβs subscribe path and the SQS workerβs publish path.
HubFacade Abstraction
The adapter didnβt interact with SignalRβs IHubContext directly. Instead, a facade wrapped hub contexts in a dictionary keyed by hub name, providing methods like PublishToChannel, PublishToConnection, and Subscribe. This indirection made the adapter hub-agnostic and testable without SignalRβs connection infrastructure.
IsPublishableByUser Gate
Each channel configuration included an IsPublishableByUser flag. When a message came from a user through the authenticated publish path, the authorization service filtered out channels where this flag was false. When a message came from the SQS worker as an internal system message, the flag was ignored. This created a clean separation between channels that accepted user-generated content and channels that were strictly server-to-client.
What Actually Happened
The POC worked. The architecture was sound, the channel resolution performed well at moderate scale, the lookback pattern handled reconnection gracefully, and the configuration-driven topology made it easy to add new topics without code changes. As a proof-of-concept, it validated that the pattern was viable.
It never shipped to production.
While the POC was being built, the team found ways to improve the web applicationβs data loading patterns through better lazy loading and caching strategies. The content that had been slow to load (and that motivated the push for real-time updates) became fast enough through smarter client-side data fetching that the urgency for WebSocket push evaporated. The polling-based approach, combined with better caching, delivered an acceptable user experience without the infrastructure complexity that WebSockets would have introduced.
The real-time push feature was deprioritized, and the POC sat on the shelf.
Honest Retrospective
Was the architecture good?
Yes. The smart channel concept, where client-facing topics resolve to product-scoped SignalR groups through configuration-driven filters, is a clean pattern. The symmetry between subscribe and publish resolution, where both paths use the same scope-matching logic, made the system self-consistent and easy to reason about. The lookback pattern solved a real problem with WebSocket ephemerality. The thin hub, fat adapter separation made the business logic testable independent of SignalRβs connection infrastructure. These are patterns worth documenting and reusing.
Was building it the right call?
The infrastructure itself was not the problem. The real question is whether the feature justified any investment at all, custom or managed, given that the underlying problem turned out to be solvable without real-time push.
The context-driven routing logic is about 50 lines of code that would exist regardless of whether it lives in a SignalR adapter, a Pusher authorization callback, or an Ably token request handler. Where the approaches diverge is in the surrounding infrastructure: connection lifecycle management, message persistence, retry handling, and monitoring. Either approach could have been defensible. The less defensible part was investing in either one before validating that users actually needed real-time push.
What should have happened instead?
Two things were missing from the process.
First, the requirement should have been validated before the infrastructure was built. The question wasnβt βcan we build real-time push with product-scoped routing?β It was βdo our users actually need real-time push, or is the problem solvable with better data loading?β If the team had invested a sprint in lazy loading improvements first, the real-time push work might never have started.
Second, the buy-vs-build evaluation should have included managed real-time services, not just AWS primitives. Pusher, Ably, and AWS IoT Core are purpose-built for this pattern. The team might have still chosen to build custom for legitimate reasons like in-process authorization, PAY_PER_REQUEST cost scaling, and full operational control. But that should have been a deliberate tradeoff decision with numbers attached, not a default.
What made it a worthwhile exercise anyway?
The POC validated architectural patterns that apply beyond this specific use case. The configuration-driven channel topology, the scope-based resolution algorithm, and the lookback replay pattern are reusable ideas. Building the POC also forced the team to think carefully about WebSocket infrastructure requirements like sticky sessions, reconnection handling, and connection-scoped state, knowledge that informed later infrastructure decisions even though this particular service didnβt ship.
It was a well-built solution to a problem that didnβt need solving yet. The build-vs-buy question is genuinely close for this use case, with real advantages on both sides. But the more important question was never asked: does the problem exist? Good engineering applied to an unvalidated requirement is still wasted effort, regardless of whether the implementation is custom or off-the-shelf.
Find this case study insightful? Share it with your network:
Share on LinkedIn