Monolithic to Cloud Native:
Lessons from Migrating Heroku to EKS at Scale
We didn't migrate systems. We migrated assumptions.
The headline numbers
| Metric | Heroku | EKS | Change |
|---|---|---|---|
| API Latency p99 | 700ms | 70ms | DOWN 90% |
| Deploy Time | 45 min | 4 min | DOWN 91% |
| Monthly Incidents | 12 | 2 | DOWN 83% |
| Deploy Frequency | 2/wk | 15/day | UP 50x |
| Infra Cost | baseline | DOWN 60%+ | right-sized + spot + Karpenter |
Engineering time absorbed: 2 platform engineers x 5 months full-time. 8 application engineers ~30% of their time during their respective service migrations.
The three failures
The Invisible Throttle (CFS)
Node.js libuv spawns 6 threads. CFS at 500m CPU gives you 50ms per 100ms period. Crypto stretches 15ms to 200ms. Dashboards show 35% CPU. Metric to watch: container_cpu_cfs_throttled_periods_total.
The DNS Amplification Tax
K8s defaults to ndots:5. api.stripe.com has 2 dots, becomes 10 DNS packets per lookup. 150K Stripe calls/day = 1.5M DNS queries. Heroku defaulted to ndots:1. Three lines of pod dnsConfig fixed it.
The Connection Pool Death Spiral
15 pods x 3 replicas x 10 conns = 450 connections vs 400 limit. CMD npm start (shell form) means PID 1 = /bin/sh, swallows SIGTERM. Connections leak. Health checks fail. Restart loop. Heroku connections also gone because shared DB. PgBouncer + exec-form CMD + SIGTERM handler fixed it.
The OSS stack
- Orchestration
- Kubernetes (EKS), Helm, Istio
- Delivery
- GitHub Actions, Flux, Atlantis, Terraform + Terragrunt
- Observability
- Prometheus + Thanos, Grafana, Elastic Stack
- Developer Experience
- Backstage
- Networking + State
- cert-manager, external-dns, PgBouncer, Redis, SQS
Adopted-when-OSS, kept after license shifts: Terraform (BSL since 2023-08), Redis (RSAL/SSPL since 2024-03), Elastic (SSPL since 2021-01). Not OSS by design (managed services): GitHub Actions, AWS SQS, ElastiCache, RDS. If we were starting in 2026: OpenTofu, Valkey, OpenSearch as defaults.
The migration checklist
- 1Instrument before you migrate. Two weeks of baseline metrics on the platform you are leaving.
- 2Pick your least-critical service. Use Istio, Linkerd, or NGINX weighted upstream for traffic shifting.
- 3Make every infrastructure change a pull request. Atlantis or Digger if you are starting fresh.
- 4Set CPU limits with profiling, not guessing. Add container_cpu_cfs_throttled_periods_total to every default dashboard.
- 5Audit every Dockerfile for shell-form CMD. Switch to exec-form. Add SIGTERM handlers in your apps.
- 6Override pod ndots before you migrate any service that talks to external APIs. Most Heroku-style apps assume ndots:1.
- 7Deploy PgBouncer alongside Postgres in the new environment, even if you didn't have it on the old one.
- 8Plan dual-running with shared databases carefully. Both environments share the same blast radius until you cut over.
- 9Build a developer portal in parallel. Backstage Scaffolder is the highest-ROI feature. Catalog comes later.
- 10If you still meet your SLAs on your current PaaS, do not migrate. Migration is a means, not a goal.
And if your current PaaS meets your needs?
That's a valid answer too. Not every team needs Kubernetes. git push heroku main is still the best deploy UX I've ever used. If your team isn't ready for the highest-risk highest-ceiling option, that's not a failure. That's a correct read of your situation.
Slides & recording
Slide deck (16:9 PDF) will be uploaded after the session. Recording will be posted to the Linux Foundation YouTube channel within ~2 weeks of the event.
Mateen Ali Anjum — Staff DevOps Engineer, PhonoTech (Phono Technologies Inc.)