Designed and implemented a dual-pipeline architecture that routes ML inference workloads to GPU or CPU based on batch size, achieving 75% cost reduction on large batches and 80% latency improvement on small requests.
The Challenge
A high-volume ML classification platform processed millions of text documents through multiple NLP models. The existing single-pipeline architecture using SageMaker Serverless couldn't efficiently serve both real-time demo requests and large enterprise batch jobs.
Our Approach
Designed a dual-pipeline system with intelligent routing based on batch size and account type. GPU pipeline handles large batches with scale-to-zero economics, while CPU pipeline provides zero cold start for small requests and demos.
Implementation Timeline
Total Duration: 8 weeks end-to-end implementation
Workload Analysis
2 weeks
- Traffic pattern analysis revealing bimodal distribution
- Cost modeling for GPU vs CPU at different batch sizes
- Latency requirements mapping by customer segment
- Threshold determination for pipeline routing
GPU Pipeline Development
2 weeks
- SageMaker Async Inference endpoint configuration
- S3-based payload handling for large batches
- CloudWatch alarm-triggered GPU scaling
- EventBridge cron jobs for idle detection and scale-down
CPU Pipeline Development
2 weeks
- SageMaker Serverless endpoint deployment
- Synchronous inference Lambda functions
- Demo and trial account fast-path routing
- Concurrency controls for downstream protection
Shared Infrastructure
2 weeks
- DynamoDB migration for high-throughput writes
- Aurora RDS setup for analytics with eventual consistency
- SQS queue architecture with dead letter queues
- CloudWatch dashboards and Slack alerting
Technical Architecture
Dual-pipeline ML inference architecture with intelligent routing, GPU scale-to-zero, and shared services for cost-optimized high-volume processing.
Results & Impact
Business Benefits
“The dual-pipeline architecture solved what seemed like an impossible problem. We now deliver sub-second responses for demos while processing enterprise batches at a fraction of the previous cost. The GPU scale-to-zero alone saves us thousands monthly.”