OSS NA 2026

Monolithic to Cloud Native:
Lessons from Migrating Heroku to EKS at Scale

We didn't migrate systems. We migrated assumptions.

Date
Monday, May 18, 2026
Time
5:25-6:05pm CDT
Room
200F, Minneapolis Convention Center

The headline numbers

MetricHerokuEKSChange
API Latency p99700ms70msDOWN 90%
Deploy Time45 min4 minDOWN 91%
Monthly Incidents122DOWN 83%
Deploy Frequency2/wk15/dayUP 50x
Infra CostbaselineDOWN 60%+right-sized + spot + Karpenter

Engineering time absorbed: 2 platform engineers x 5 months full-time. 8 application engineers ~30% of their time during their respective service migrations.

The three failures

1

The Invisible Throttle (CFS)

Node.js libuv spawns 6 threads. CFS at 500m CPU gives you 50ms per 100ms period. Crypto stretches 15ms to 200ms. Dashboards show 35% CPU. Metric to watch: container_cpu_cfs_throttled_periods_total.

2

The DNS Amplification Tax

K8s defaults to ndots:5. api.stripe.com has 2 dots, becomes 10 DNS packets per lookup. 150K Stripe calls/day = 1.5M DNS queries. Heroku defaulted to ndots:1. Three lines of pod dnsConfig fixed it.

3

The Connection Pool Death Spiral

15 pods x 3 replicas x 10 conns = 450 connections vs 400 limit. CMD npm start (shell form) means PID 1 = /bin/sh, swallows SIGTERM. Connections leak. Health checks fail. Restart loop. Heroku connections also gone because shared DB. PgBouncer + exec-form CMD + SIGTERM handler fixed it.

The OSS stack

Orchestration
Kubernetes (EKS), Helm, Istio
Delivery
GitHub Actions, Flux, Atlantis, Terraform + Terragrunt
Observability
Prometheus + Thanos, Grafana, Elastic Stack
Developer Experience
Backstage
Networking + State
cert-manager, external-dns, PgBouncer, Redis, SQS

Adopted-when-OSS, kept after license shifts: Terraform (BSL since 2023-08), Redis (RSAL/SSPL since 2024-03), Elastic (SSPL since 2021-01). Not OSS by design (managed services): GitHub Actions, AWS SQS, ElastiCache, RDS. If we were starting in 2026: OpenTofu, Valkey, OpenSearch as defaults.

The migration checklist

  1. 1Instrument before you migrate. Two weeks of baseline metrics on the platform you are leaving.
  2. 2Pick your least-critical service. Use Istio, Linkerd, or NGINX weighted upstream for traffic shifting.
  3. 3Make every infrastructure change a pull request. Atlantis or Digger if you are starting fresh.
  4. 4Set CPU limits with profiling, not guessing. Add container_cpu_cfs_throttled_periods_total to every default dashboard.
  5. 5Audit every Dockerfile for shell-form CMD. Switch to exec-form. Add SIGTERM handlers in your apps.
  6. 6Override pod ndots before you migrate any service that talks to external APIs. Most Heroku-style apps assume ndots:1.
  7. 7Deploy PgBouncer alongside Postgres in the new environment, even if you didn't have it on the old one.
  8. 8Plan dual-running with shared databases carefully. Both environments share the same blast radius until you cut over.
  9. 9Build a developer portal in parallel. Backstage Scaffolder is the highest-ROI feature. Catalog comes later.
  10. 10If you still meet your SLAs on your current PaaS, do not migrate. Migration is a means, not a goal.

And if your current PaaS meets your needs?

That's a valid answer too. Not every team needs Kubernetes. git push heroku main is still the best deploy UX I've ever used. If your team isn't ready for the highest-risk highest-ceiling option, that's not a failure. That's a correct read of your situation.

Slides & recording

Slide deck (16:9 PDF) will be uploaded after the session. Recording will be posted to the Linux Foundation YouTube channel within ~2 weeks of the event.

Slides PDF (after May 18) Recording link (after ~Jun 1)

Mateen Ali Anjum — Staff DevOps Engineer, PhonoTech (Phono Technologies Inc.)

phonotech.ca · Contact PhonoTech · More case studies