Leading Through a Platform Rewrite

The art and science of re-architecting core systems without killing momentum or morale.

Why rewrite at all?

Rewrites are justified when foundational constraints make incremental change slower and riskier than replacement: chronic reliability issues, unscalable data models, security posture gaps, or a platform that blocks new business models. If a refactor can restore speed inside 1–2 quarters, refactor. If not, plan a rewrite—but run the business while you do it.

Guiding principles

Two speeds, one roadmap: keep shipping customer value while carving the new core in parallel.
Contracts before code: stabilize APIs, events, and schemas so teams can move independently.
Thin vertical slices: migrate capability-by-capability, not layer-by-layer.
SLOs as guardrails: reliability targets and error budgets decide release pace and cutover timing.
Evidence over opinions: instrument everything—decisions ride on data.

Execution blueprint: the Strangler Fig in five moves

Find the seams: map domains, user journeys, and shared data. Choose a first slice with high pain and clear boundaries.
Edge adapters: put a gateway/proxy in front of the old core; route a small cohort to the new path.
Dual-run safely: mirror traffic, compare behavior, and measure SLO deltas before increasing exposure.
Iterative migration: move one capability at a time; deprecate old endpoints as contracts stabilize.
Decommission with ceremony: remove unused code, revoke credentials, and celebrate the entropy reduction.

Data strategy that won’t bite later

Contracts: publish versioned schemas (OpenAPI/JSON Schema/Avro) and change policies.
Movement: use CDC or event-ingest for backfills; prefer dual writes with idempotency during cutover.
Reconciliation: build diff tools and dashboards; treat data drift as a P1 during migration.
Ownership: domain teams own both code and data; central platform provides the rails.

Team topology & roles

Platform Core: runtime, CI/CD, observability, developer platform.
Enablement Guild: patterns, templates, reviews—turn decisions into paved roads.
Product Pods: ship features on top of new contracts; they are the canaries.
Migration SWAT: adapters, dual-run, backfills, and decommissioning.
Change Office: comms, stakeholder management, risk register, and OKR tracking.

Safety & governance

Progressive delivery: feature flags, canaries, and automatic rollbacks.
Blast radius control: cohorting by region/account/tier with fast disable paths.
Security-by-default: least privilege, secret rotation, and audit trails before go-live.

Communication that keeps morale high

Narrative: explain the why in business terms (speed, reliability, new markets).
Cadence: a public migration board and biweekly demos beat status emails.
Recognition: celebrate deprecations, not just launches—deleting code is creating capacity.

Metrics that matter

DORA: deploy frequency, lead time, change failure rate, MTTR.
SLOs: p95 latency, availability, and error budgets on critical journeys.
Migration burn-down: % traffic on new core, # endpoints decommissioned, data drift rate.
DevEx: time-to-first-PR on new paved road; scaffold-to-prod time.

30 / 60 / 90

30 days: domain map, target slice, contracts drafted; observability baseline; cutover plan v1.
60 days: gateway live, dual-run in shadow; first slice canary; backfill pipeline operating.
90 days: 30–50% traffic on new path for first slice; decommission old endpoints; next two slices queued.

Definition of Done (for each migrated slice)

Contracts versioned and documented; SDKs/docs generated.
SLOs met under production load; rollback tested.
Data reconciled with drift < agreed threshold.
Old endpoints removed; credentials revoked; runbooks updated.
Postmortem + learnings shared; templates updated.

Anti-patterns to avoid

Big-bang cutovers: maximize risk, minimize learning.
Layer-first rewrites: migrate a capability end-to-end instead.
Invisible work: if the business can’t see progress, it will cancel the project.

Great rewrites are leadership problems disguised as architecture problems. Protect momentum, make safety visible, and ship value every sprint. That’s how you change the engine while the plane is flying—and land smoother than you took off.