OSS NA 2026

Monolithic to Cloud Native:
Lessons from Migrating Heroku to EKS at Scale

We didn't migrate systems. We migrated assumptions.

Date

Monday, May 18, 2026

Time

5:25-6:05pm CDT

Room

200F, Minneapolis Convention Center

Migration checklist 22 merged OSS PRs Sched session page

The headline numbers

Metric	Heroku	EKS	Change
API Latency p99	700ms	70ms	DOWN 90%
Deploy Time	45 min	4 min	DOWN 91%
Monthly Incidents	12	2	DOWN 83%
Deploy Frequency	2/wk	15/day	UP 50x
Infra Cost	—	—	DOWN 60%+ (right-sized + spot + Karpenter)

30 days before vs 30 days after, production deploys, pager-triggering incidents, infra-only cost. Engineering time absorbed: 2 platform engineers x 5 months full-time, 8 application engineers ~30% of their time during their respective service migrations. Cost numbers withheld for customer confidentiality; the 60%+ reduction is confirmed at the monthly bill level.

Two years on

Same OSS stack, two years later. The leverage came from the tooling, not the headcount.

10→100

Engineers on platform

47→100

Services

2→2

Platform engineers

The three failures

The Invisible Throttle (CFS)

Node.js libuv spawns 6 threads. CFS at 500m CPU gives you 50ms per 100ms period. Crypto stretches 15ms to 200ms. Dashboards show 35% CPU. Metric to watch: container_cpu_cfs_throttled_periods_total.

The DNS Amplification Tax

K8s defaults to ndots:5. api.stripe.com has 2 dots, becomes 10 DNS packets per lookup. 150K Stripe calls/day = 1.5M DNS queries. Heroku defaulted to ndots:1. Three lines of pod dnsConfig fixed it.

The Connection Pool Death Spiral

15 pods x 3 replicas x 10 conns = 450 connections vs 400 limit. CMD npm start (shell form) means PID 1 = /bin/sh, swallows SIGTERM. Connections leak. Health checks fail. Restart loop. Heroku connections exhausted too because both environments shared the same database. PgBouncer + exec-form CMD + SIGTERM handler fixed it.

The OSS stack

Orchestration: Kubernetes (EKS), Helm, Istio
Delivery: GitHub Actions, Flux, Atlantis, Terraform + Terragrunt
Observability: Prometheus + Thanos, Grafana, Elastic Stack
Developer Experience: Backstage
Networking + State: cert-manager, external-dns, PgBouncer, Redis, SQS

Security guardrails: IRSA (workload identity), external-secrets, Trivy (image scanning), NetworkPolicy, admission webhooks. Adopted-when-OSS, kept after license shifts: Terraform (BSL since 2023-08), Redis (RSAL/SSPL since 2024-03), Elastic (SSPL since 2021-03). Not OSS by design (managed services): GitHub Actions, AWS SQS, ElastiCache, RDS. If we were starting in 2026: OpenTofu, Valkey, OpenSearch as defaults.

Compliance & DR

PCI-DSS: Scope re-mapped from Heroku Postgres to RDS over the cutover. IRSA for workload identity, external-secrets with sealed secrets in git, Trivy scanning in CI from day one. Re-certified post-migration without a finding.

Disaster recovery: Active-passive across two AWS regions (ca-central-1 primary, us-east-1 warm standby). RPO 15 minutes via cross-region Postgres replicas. RTO 30 minutes for the application tier. Quarterly failover exercises.

Still working on: Backstage cost-attribution dashboards, Istio Ambient mode evaluation. Migration is not over. It's a beginning.

The migration checklist

1Instrument before you migrate. Two weeks of baseline metrics on the platform you are leaving.
2Pick your least-critical service. Use Istio, Linkerd, or NGINX weighted upstream for traffic shifting.
3Make every infrastructure change a pull request. Atlantis or Digger if you are starting fresh.
4Set CPU limits with profiling, not guessing. Add container_cpu_cfs_throttled_periods_total to every default dashboard.
5Audit every Dockerfile for shell-form CMD. Switch to exec-form. Add SIGTERM handlers in your apps.
6Override pod ndots before you migrate any service that talks to external APIs. Most Heroku-style apps assume ndots:1.
7Deploy PgBouncer alongside Postgres in the new environment, even if you didn't have it on the old one.
8Plan dual-running with shared databases carefully. Both environments share the same blast radius until you cut over.
9Build a developer portal in parallel. Backstage Scaffolder is the highest-ROI feature. Catalog comes later.
10If you still meet your SLAs on your current PaaS, do not migrate. Migration is a means, not a goal.

And if your current PaaS meets your needs?

That's a valid answer too. Not every team needs Kubernetes. git push heroku main is still the best deploy UX I've ever used. If your team isn't ready for the highest-risk highest-ceiling option, that's not a failure. That's a correct read of your situation.

Slides & recording

The slide deck (16:9 PDF, 25 slides) is available below. Recording will be posted to the Linux Foundation YouTube channel within ~2 weeks of the event.

Slides PDF Recording link (after ~Jun 1)

Mateen Ali Anjum — Staff DevOps Engineer, PhonoTech (Phono Technologies Inc.)

phonotech.ca · Contact PhonoTech · More case studies

Monolithic to Cloud Native:Lessons from Migrating Heroku to EKS at Scale