OSS NA 2026

Monolithic to Cloud Native:
Lessons from Migrating Heroku to EKS at Scale

We didn't migrate systems. We migrated assumptions.

Date
Monday, May 18, 2026
Time
5:25-6:05pm CDT
Room
200F, Minneapolis Convention Center

The headline numbers

MetricHerokuEKSChange
API Latency p99700ms70msDOWN 90%
Deploy Time45 min4 minDOWN 91%
Monthly Incidents122DOWN 83%
Deploy Frequency2/wk15/dayUP 50x
Infra CostDOWN 60%+ (right-sized + spot + Karpenter)

30 days before vs 30 days after, production deploys, pager-triggering incidents, infra-only cost. Engineering time absorbed: 2 platform engineers x 5 months full-time, 8 application engineers ~30% of their time during their respective service migrations. Cost numbers withheld for customer confidentiality; the 60%+ reduction is confirmed at the monthly bill level.

Two years on

Same OSS stack, two years later. The leverage came from the tooling, not the headcount.

10100
Engineers on platform
47100
Services
22
Platform engineers

The three failures

1

The Invisible Throttle (CFS)

Node.js libuv spawns 6 threads. CFS at 500m CPU gives you 50ms per 100ms period. Crypto stretches 15ms to 200ms. Dashboards show 35% CPU. Metric to watch: container_cpu_cfs_throttled_periods_total.

2

The DNS Amplification Tax

K8s defaults to ndots:5. api.stripe.com has 2 dots, becomes 10 DNS packets per lookup. 150K Stripe calls/day = 1.5M DNS queries. Heroku defaulted to ndots:1. Three lines of pod dnsConfig fixed it.

3

The Connection Pool Death Spiral

15 pods x 3 replicas x 10 conns = 450 connections vs 400 limit. CMD npm start (shell form) means PID 1 = /bin/sh, swallows SIGTERM. Connections leak. Health checks fail. Restart loop. Heroku connections exhausted too because both environments shared the same database. PgBouncer + exec-form CMD + SIGTERM handler fixed it.

The OSS stack

Orchestration
Kubernetes (EKS), Helm, Istio
Delivery
GitHub Actions, Flux, Atlantis, Terraform + Terragrunt
Observability
Prometheus + Thanos, Grafana, Elastic Stack
Developer Experience
Backstage
Networking + State
cert-manager, external-dns, PgBouncer, Redis, SQS

Security guardrails: IRSA (workload identity), external-secrets, Trivy (image scanning), NetworkPolicy, admission webhooks. Adopted-when-OSS, kept after license shifts: Terraform (BSL since 2023-08), Redis (RSAL/SSPL since 2024-03), Elastic (SSPL since 2021-03). Not OSS by design (managed services): GitHub Actions, AWS SQS, ElastiCache, RDS. If we were starting in 2026: OpenTofu, Valkey, OpenSearch as defaults.

Compliance & DR

PCI-DSS: Scope re-mapped from Heroku Postgres to RDS over the cutover. IRSA for workload identity, external-secrets with sealed secrets in git, Trivy scanning in CI from day one. Re-certified post-migration without a finding.

Disaster recovery: Active-passive across two AWS regions (ca-central-1 primary, us-east-1 warm standby). RPO 15 minutes via cross-region Postgres replicas. RTO 30 minutes for the application tier. Quarterly failover exercises.

Still working on: Backstage cost-attribution dashboards, Istio Ambient mode evaluation. Migration is not over. It's a beginning.

The migration checklist

  1. 1Instrument before you migrate. Two weeks of baseline metrics on the platform you are leaving.
  2. 2Pick your least-critical service. Use Istio, Linkerd, or NGINX weighted upstream for traffic shifting.
  3. 3Make every infrastructure change a pull request. Atlantis or Digger if you are starting fresh.
  4. 4Set CPU limits with profiling, not guessing. Add container_cpu_cfs_throttled_periods_total to every default dashboard.
  5. 5Audit every Dockerfile for shell-form CMD. Switch to exec-form. Add SIGTERM handlers in your apps.
  6. 6Override pod ndots before you migrate any service that talks to external APIs. Most Heroku-style apps assume ndots:1.
  7. 7Deploy PgBouncer alongside Postgres in the new environment, even if you didn't have it on the old one.
  8. 8Plan dual-running with shared databases carefully. Both environments share the same blast radius until you cut over.
  9. 9Build a developer portal in parallel. Backstage Scaffolder is the highest-ROI feature. Catalog comes later.
  10. 10If you still meet your SLAs on your current PaaS, do not migrate. Migration is a means, not a goal.

And if your current PaaS meets your needs?

That's a valid answer too. Not every team needs Kubernetes. git push heroku main is still the best deploy UX I've ever used. If your team isn't ready for the highest-risk highest-ceiling option, that's not a failure. That's a correct read of your situation.

Slides & recording

The slide deck (16:9 PDF, 25 slides) is available below. Recording will be posted to the Linux Foundation YouTube channel within ~2 weeks of the event.

Slides PDF Recording link (after ~Jun 1)

Mateen Ali Anjum — Staff DevOps Engineer, PhonoTech (Phono Technologies Inc.)

phonotech.ca · Contact PhonoTech · More case studies