Monolithic to Cloud Native:
Lessons from Migrating Heroku to EKS at Scale
We didn't migrate systems. We migrated assumptions.
The headline numbers
| Metric | Heroku | EKS | Change |
|---|---|---|---|
| API Latency p99 | 700ms | 70ms | DOWN 90% |
| Deploy Time | 45 min | 4 min | DOWN 91% |
| Monthly Incidents | 12 | 2 | DOWN 83% |
| Deploy Frequency | 2/wk | 15/day | UP 50x |
| Infra Cost | — | — | DOWN 60%+ (right-sized + spot + Karpenter) |
30 days before vs 30 days after, production deploys, pager-triggering incidents, infra-only cost. Engineering time absorbed: 2 platform engineers x 5 months full-time, 8 application engineers ~30% of their time during their respective service migrations. Cost numbers withheld for customer confidentiality; the 60%+ reduction is confirmed at the monthly bill level.
Two years on
Same OSS stack, two years later. The leverage came from the tooling, not the headcount.
The three failures
The Invisible Throttle (CFS)
Node.js libuv spawns 6 threads. CFS at 500m CPU gives you 50ms per 100ms period. Crypto stretches 15ms to 200ms. Dashboards show 35% CPU. Metric to watch: container_cpu_cfs_throttled_periods_total.
The DNS Amplification Tax
K8s defaults to ndots:5. api.stripe.com has 2 dots, becomes 10 DNS packets per lookup. 150K Stripe calls/day = 1.5M DNS queries. Heroku defaulted to ndots:1. Three lines of pod dnsConfig fixed it.
The Connection Pool Death Spiral
15 pods x 3 replicas x 10 conns = 450 connections vs 400 limit. CMD npm start (shell form) means PID 1 = /bin/sh, swallows SIGTERM. Connections leak. Health checks fail. Restart loop. Heroku connections exhausted too because both environments shared the same database. PgBouncer + exec-form CMD + SIGTERM handler fixed it.
The OSS stack
- Orchestration
- Kubernetes (EKS), Helm, Istio
- Delivery
- GitHub Actions, Flux, Atlantis, Terraform + Terragrunt
- Observability
- Prometheus + Thanos, Grafana, Elastic Stack
- Developer Experience
- Backstage
- Networking + State
- cert-manager, external-dns, PgBouncer, Redis, SQS
Security guardrails: IRSA (workload identity), external-secrets, Trivy (image scanning), NetworkPolicy, admission webhooks. Adopted-when-OSS, kept after license shifts: Terraform (BSL since 2023-08), Redis (RSAL/SSPL since 2024-03), Elastic (SSPL since 2021-03). Not OSS by design (managed services): GitHub Actions, AWS SQS, ElastiCache, RDS. If we were starting in 2026: OpenTofu, Valkey, OpenSearch as defaults.
Compliance & DR
PCI-DSS: Scope re-mapped from Heroku Postgres to RDS over the cutover. IRSA for workload identity, external-secrets with sealed secrets in git, Trivy scanning in CI from day one. Re-certified post-migration without a finding.
Disaster recovery: Active-passive across two AWS regions (ca-central-1 primary, us-east-1 warm standby). RPO 15 minutes via cross-region Postgres replicas. RTO 30 minutes for the application tier. Quarterly failover exercises.
Still working on: Backstage cost-attribution dashboards, Istio Ambient mode evaluation. Migration is not over. It's a beginning.
The migration checklist
- 1Instrument before you migrate. Two weeks of baseline metrics on the platform you are leaving.
- 2Pick your least-critical service. Use Istio, Linkerd, or NGINX weighted upstream for traffic shifting.
- 3Make every infrastructure change a pull request. Atlantis or Digger if you are starting fresh.
- 4Set CPU limits with profiling, not guessing. Add container_cpu_cfs_throttled_periods_total to every default dashboard.
- 5Audit every Dockerfile for shell-form CMD. Switch to exec-form. Add SIGTERM handlers in your apps.
- 6Override pod ndots before you migrate any service that talks to external APIs. Most Heroku-style apps assume ndots:1.
- 7Deploy PgBouncer alongside Postgres in the new environment, even if you didn't have it on the old one.
- 8Plan dual-running with shared databases carefully. Both environments share the same blast radius until you cut over.
- 9Build a developer portal in parallel. Backstage Scaffolder is the highest-ROI feature. Catalog comes later.
- 10If you still meet your SLAs on your current PaaS, do not migrate. Migration is a means, not a goal.
And if your current PaaS meets your needs?
That's a valid answer too. Not every team needs Kubernetes. git push heroku main is still the best deploy UX I've ever used. If your team isn't ready for the highest-risk highest-ceiling option, that's not a failure. That's a correct read of your situation.
Slides & recording
The slide deck (16:9 PDF, 25 slides) is available below. Recording will be posted to the Linux Foundation YouTube channel within ~2 weeks of the event.
Mateen Ali Anjum — Staff DevOps Engineer, PhonoTech (Phono Technologies Inc.)