How the Cryptorefills engineering team migrated an entire production platform

Simonluca Landi

EnlightenmentsFeb 19, 2026

15 Min read

See how a small Cryptorefills team used AI alongside human judgment to migrate an entire production platform to a new AWS account.

TL;DR: The Cryptorefills engineering team migrated our entire production infrastructure over 20 services, live databases, blockchain nodes, and over thirty partner integrations from one AWS account to a brand new one, without stopping the business. The hardest problems were a live database failover we could not afford to get wrong, a payment processor that had not yet whitelisted our new IP (solved with a creative proxy), and DNS cutovers where a single command redirected all customer traffic. We used Claude Code throughout, and it materially changed what a small team could accomplish.

Sometime in early 2026, the Cryptorefills engineering team found ourselves staring at a problem no one wants to face: move our entire production infrastructure from one AWS account to a brand new one. Not a single app. Not one service. The whole thing every database, every blockchain node, every external integration, every secret.

This is the story of how a small team pulled it off, what broke along the way, and why the most honest thing we can do is tell you about both.

Why we had to move and why we were almost glad

Internal organizational decisions required us to operate under a fresh AWS account. That was the forcing function, and it was not optional.

But once we accepted we had to move, we saw the opportunity. Like many fast-growing startups, our infrastructure had accumulated decisions that made sense at the time but no longer met our standards. All compute sat in a single availability zone one of Amazon's independently powered data centers meaning one facility-level incident could theoretically take everything offline. It never happened, but it was a risk we carried every day. Our outbound internet traffic routed through one tiny, manually managed server that served as a single point of failure. And too much critical knowledge lived in people's heads rather than in code.

These were architectural shortcomings, not active incidents customer data remained protected throughout but they represented a level of risk we were no longer comfortable carrying. Rebuilding gave us the chance to do things right. The question was whether we could pull that off without breaking the platform thousands of customers relied on daily.

What we were actually moving

To understand why this migration was hard, you need to understand what Cryptorefills looks like under the hood.

From the outside: a platform where you buy gift cards, mobile top-ups, flights, and stays with cryptocurrency. Underneath: an interconnected ecosystem of over 20 services spanning multiple programming languages and connections to dozens of external partners.

Applications. A Next.js frontend, a Java/Scala backend API handling core business logic, a content management system, and several supporting web applications each with its own build pipeline, deployment configuration, and set of environment variables.

Data. A MongoDB replica set three coordinated database servers that mirror each other, so if one fails the others keep running holding the entire business state: orders, users, products, payment records. An Elasticsearch cluster powering product search. WordPress and CMS databases.

Blockchain infrastructure. Full blockchain nodes for the networks we support, each with hundreds of gigabytes of chain data that takes days to sync from scratch. A payment server managing cryptocurrency wallets, sitting on four terabytes of storage.

External integrations. Connections to over thirty partners: gift card suppliers, mobile top-up providers, payment processors, cryptocurrency exchanges, identity verification services. Many authenticate our requests partly by checking our server's IP address meaning our new infrastructure would need to announce itself and wait for each partner's approval before those integrations worked.

Monitoring and observability. Technical monitoring tracking CPU, memory, disk, pod health, and network errors across every service with automated alerts when thresholds are breached. Business monitoring dashboards for operational visibility. A full logging pipeline: application logs collected, ingested through Logstash, indexed in Elasticsearch, and visualized in Kibana for troubleshooting and analysis. All of this had to be rebuilt from scratch in the new environment.

Backup and data protection. Automated backup plans for database volumes, application data, and cryptocurrency wallet state with daily, weekly, and monthly retention schedules. Wallet channel backups archived to separate storage. Every backup job monitored and alerted on failure.

Operations. Twenty-one CI/CD pipelines. Container registries. Three CDN distributions. Message queues handling order processing in the background so customers do not wait while the system talks to suppliers. DNS records controlling where every customer request lands. And over a hundred secrets API keys, database credentials, cryptographic keys, and certificates organized across dozens of categories in secrets management.

Think of replacing a major highway interchange while thousands of cars keep driving through it. Every lane, every ramp, every sign has to be swapped out and drivers should barely notice the construction.

Building the next generation

Our old setup worked. It had served us well for years, running on Amazon ECS with four large servers over 250 GB of RAM total managing 23 services with automated deployments via CI/CD pipelines. But it had reached the limits of what it was designed for. All compute in a single availability zone meant we carried concentration risk. The management layer lacked the isolation and access controls that a mature payment platform demands. And while deployments were automated, the underlying infrastructure itself was not fully codified making changes required more institutional knowledge than documentation.

The new world would be fundamentally different.

We chose Kubernetes (Amazon EKS) think of it as a traffic controller that automatically manages where each application runs, restarts it if it crashes, and scales it up if demand spikes. Everything defined in Terraform modules and Kubernetes manifests: version-controlled, reproducible, auditable. The network split into clearly separated zones one for public-facing traffic, one for internal applications, one for databases, one for engineering access so a breach in one zone cannot cascade into others. A proper managed NAT gateway (the server that handles outbound internet traffic) with a static IP for partner whitelisting. A VPN an encrypted private tunnel required for all administrative access. Each service carrying exactly the permissions it needs, nothing more.

SSL certificates were provisioned and validated in the new account. Secrets flow from AWS Secrets Manager into applications automatically. This was not just a migration. It was an architectural leap forward. And that made it simultaneously exciting and terrifying.

The three hardest problems we faced

With the new environment designed and provisioned, the real work began: moving live services without breaking them. Most of the migration was methodical and uneventful but three problems stood out.

Moving a live database without losing a single record

The MongoDB migration required careful planning and precise execution. This database held everything: customer accounts, order history, payment records. Losing data was not an option. Extended downtime was not an option.

We chose replica set expansion the approach MongoDB recommends for live migrations. Instead of copying the entire database in one shot (which requires downtime), you add new servers to the existing cluster and let the database replicate itself across, like onboarding new team members without stopping the work.

Our existing cluster in the old environment ran three nodes: a primary, a secondary, and an arbiter. We deployed new MongoDB nodes in the Kubernetes environment, backed by faster, larger encrypted storage. We gave the new servers the authentication credentials to join the existing cluster, then added them one at a time as silent observers they received a copy of all the data but had no say in which server was in charge. We watched replication lag metrics for hours, waiting for each new node to fall under one second of delay.

With the new nodes healthy and fully synced, we updated application connection strings to include all members. Then came the critical moment: we granted the new nodes voting rights and higher priority, and issued a stepdown command to the old primary. For somewhere between five and twelve seconds, the database cluster went through a leadership handover. During those seconds, any attempt to save new data was held in a queue frozen. Then the new node won the election, became primary, and the queue flushed. Applications reconnected automatically.

After the failover, we compared document counts across key collections and ran application-level consistency checks before moving on. Then we carefully removed the old nodes from the cluster lowering their priority first, stripping voting rights, then removing them one by one, observing cluster health after each step.

The entire operation took about an hour. The customer-facing impact was a brief pause in write operations during the election a matter of seconds.

The payment processor that had not whitelisted us yet

Here is something that rarely appears in architecture diagrams: dozens of external partners authenticate our API requests partly by checking the source IP address. New infrastructure means a new outbound IP. Every partner needs to approve the new one.

We sent whitelist requests to over thirty partners weeks before cutover. Most responded within days. But processing these requests takes time on both sides, and one critical payment provider's approval was still pending when we needed to complete the migration. After the switch, their API began rejecting every request from our new IP. Payments through that provider stopped.

We needed a workaround, and we needed it fast.

The solution was creative and admittedly hacky. We set up a simple forwarding service on the old infrastructure's server the same tiny machine that used to route all our traffic, still running, still carrying the whitelisted IP address. Then we essentially told our new systems to lie about where the partner's service lived: instead of sending requests directly to the partner, we directed them to the old server first, which forwarded them onward using the still-trusted IP address. Requests flowed through the private bridge between accounts, hit the forwarder, and exited to the internet with the right return address.

One more wrinkle: our backend software stubbornly remembered address mappings and would not check for updates. We had to flip a configuration switch to make it refresh its memory every sixty seconds, ensuring the workaround would actually take effect and could be cleaned up later.

It was a hack. It added an extra network hop and a dependency on infrastructure we were trying to retire. But it kept payments flowing while we waited for the partner to confirm the new IP. Sometimes the right engineering decision is the one that keeps the business running.

The DNS cutover sixty seconds that felt like an hour

DNS is the internet's address book the invisible layer that routes every customer request to the right server. It is also the most nerve-wracking thing to change in production, because a mistake sends customers into the void.

Twenty-four hours before each service cutover, we lowered the DNS TTL (how long address records are cached) to 60 seconds. This meant that when we made the switch, most resolvers would reflect the change within about a minute and a rollback would propagate just as quickly.

The actual cutover was a single command: update one record in the DNS system, swapping the old server address for the new one. We had the undo command ready in a separate terminal window, one keystroke from execution. We had prepared and validated rollback procedures so the old environment could resume serving traffic if anything went wrong. Then we executed.

And then we watched. Dashboards. Logs. Health checks. The new server's request count climbing. Error rates holding at zero. Service health staying green across the board.

We monitored closely for forty-five minutes. When dashboards held green and error rates stayed flat, we moved on to the next service.

We did this multiple times once for the frontend, once for the backend API, once for the CMS. Each time with the same ritual: lower the TTL, prepare the rollback, execute, watch, breathe.

What was not perfect

We should be honest: not everything went smoothly.

Elasticsearch was the roughest edge. Migrating search infrastructure between environments involves moving large chunks of indexed data across a network bridge. The cross-environment link added enough latency to make this significantly slower than expected. During the transition window, search was degraded queries were slower and some results were incomplete while data was still being relocated.

Some lower-volume payment options had brief interruptions as we worked through IP whitelisting with various partners. Each provider had their own timeline, and a few lagged behind our migration schedule. These interruptions affected the availability of specific payment methods in the checkout flow, not the processing of transactions already in progress no customer payments were lost or corrupted during the migration.

We aimed for minimal disruption and achieved it for the vast majority of features. But calling this a "zero-downtime migration" would be dishonest. It was a migration with brief, managed degradation windows executed with rollback plans ready for each one.

How AI changed our migration

There is a thread running through this entire project that we want to be transparent about: we used AI extensively. Specifically, Claude Code Anthropic's command-line AI coding tool was involved in nearly every phase.

This was not a gimmick. It was born from necessity. A migration of this scope would typically require dedicated platform engineers, database administrators, networking specialists, and security reviewers. We are a small team. We needed all of those skill sets simultaneously, available at any hour.

Where it proved most valuable was operational debugging during the live migration. When the system monitoring whether our API was running kept reporting errors, we diagnosed a simple misconfiguration the health monitor was knocking on the wrong door in minutes rather than hours. When the payment processor rejected requests from our new IP, we designed and implemented the proxy workaround in a single session.

A significant portion of the Terraform modules, Kubernetes manifests, and migration scripts started as drafts produced collaboratively with Claude Code, then went through the same review and testing process we apply to any infrastructure change. The cutover runbook with its pre-flight checklists, step-by-step procedures, and rollback triggers was drafted with AI assistance and refined with operational experience.

Claude Code also helped evaluate trade-offs between migration approaches: should we use replica set expansion or dump-and-restore for MongoDB? Single NAT gateway for predictable IP addressing or multi-NAT for resilience? We talked through these decisions in detail, and the AI provided analysis grounded in real documentation and migration patterns.

To be clear about what AI did not do: it did not make the judgment calls. It did not decide when we were ready to cut over. It did not hold the rollback button at midnight. It did not know which partners were critical and which could tolerate a brief interruption. Those decisions required human context, business understanding, and the willingness to be accountable. But as a force multiplier for a small team an always-available collaborator who could shift between Terraform syntax, Kubernetes networking, MongoDB operations, and Java configuration in a single conversation the effect was significant. Tasks that would normally take a day of research often took hours. Across dozens of such tasks, the cumulative impact on our timeline was substantial.

The trade-offs we made on purpose

Single exit point, one static IP. We deliberately routed all internet-bound traffic through a single managed gateway one consistent IP address that partners can whitelist. If that gateway's availability zone goes down, outbound connectivity fails. We accepted this because the old setup had the same single point of failure with worse reliability.

Private control plane. Our infrastructure management interface is completely hidden from the public internet. Every engineer who wants to manage a server must first connect through an encrypted tunnel. This is the right security posture, but it was an adjustment for a team accustomed to working over the open internet.

Spot instances for non-critical workloads. Supporting services run on spare cloud capacity that costs significantly less but can be reclaimed with two minutes' notice. We sized our always-on capacity to cover the critical path and use discounted capacity for everything else.

Infrastructure as code, no exceptions. Every resource defined in Terraform or Kubernetes manifests. Nothing created by clicking through a console.

What we would tell you over coffee

Your hardest problems will be at the edges, not the core. Terraform and Kubernetes are well-documented. The real challenges were external: partners who whitelist your IP, DNS records in the wrong account, application-level caching behavior, health check paths that differ between environments. Budget extra time for boundary issues.

Prepare rollback plans before you need them. For every cutover, we prepared and validated rollback procedures. We kept old infrastructure running for days after each migration. We never had to execute a full rollback but having the option gave us the confidence to move forward.

If your database supports native replication, use it for migration. MongoDB's replica set expansion gave us a near-zero-downtime path that avoided the risk of a full data export and import. Expand the set, sync the data, fail over, contract. The drivers handle the rest.

AI is genuinely useful for infrastructure work. Not because it sounds innovative because it materially changed how fast we could move. The combination of broad technical knowledge and instant context-switching made Claude Code function like an experienced infrastructure engineer available around the clock.

Do not claim zero downtime if it is not true. Brief windows of degraded functionality are far more manageable than a reputation for dishonesty. Acknowledge them upfront, have communication plans ready, and build more trust than you would by pretending everything was invisible.

This migration took approximately four weeks from initial planning to the final service cutover. We moved over 20 services, multiple databases, blockchain nodes with terabytes of data, and dozens of external integrations to a completely new AWS account. Throughout the process, all personal data remained within AWS infrastructure in the same region, encrypted at rest and in transit.

The new infrastructure is more secure, more resilient, and fully defined in code. It is not perfect infrastructure never is. But it is a foundation we can build on with confidence.

And if we ever have to move again, at least this time we will have the Terraform modules ready.

We are Cryptorefills a small engineering team building payment infrastructure for the global crypto economy. If you want to learn more about how we operate, explore career opportunities, or just talk shop about infrastructure, we would like to hear from you.

About Cryptorefills | Careers | Get in Touch