Feb 01, 2024
3 min read
About 4 years ago, we were asked to prepare an infrastructure for a new e-commerce platform. There was only one requirement: it had to be failure-resistant. We’re not ones to back away from a challenge, so after some internal brainstorming, we came up with a solution.
We purchased production bare metal machines from independent suppliers and created an internal network. Then, we configured these machines to work in a synchronised manner with the following services:
- MariaDB, a SQL database replicated across 3 nodes,
- GlusterFS, as a shared file system space,
- RabbitMQ, as a message broker,
- Redis, an in-memory data structure store, used as cache and session data storage.
We ran our application built on PHP using the Symfony framework. For CDN, WAF, and load balancer, we used Cloudflare.
In the beginning, everything was working out well. Our reliability goal was met, even when problems with some suppliers appeared; the service continued to operate thanks to our approach.
However, over time, our client’s business needs began to grow. We added some internal applications, blog, landing pages, etc. At that point, it was necessary to run more virtual machines. We regularly updated all the machines and services, but it slowly became problematic and time-consuming. We came to a conclusion: the current solution wasn’t working. And that’s when we decided to move to cloud solutions.
We needed to define our goals once we decided to move to cloud solutions. We discussed them during our internal meetings until we agreed and established three main goals:
- creating and maintaining Infrastructure as Code (IaC),
- eliminating configuration drift,
- improving the deployment process.
After conducting some research, we decided to go with Amazon AWS as our cloud provider and use Terraform as our Infrastructure as Code (IaC) tool for provisioning and managing our infrastructure. We also utilised Kubernetes to configure, automate, and scale our applications.
Migration became a separate project. We prepared a dedicated Jira board and met regularly to update our progress. We also divided our tasks into two streams: the migration of the infrastructure and the migration of the applications. We were committed to preparing everything, such as CI/CD, based on the Bitbucket pipeline service.
When it came to infrastructure migration, we first decided which AWS services were essential so we could reduce costs and which ones could be replaced by open-source equivalents. Once we made the list of the AWS services we would use, we created the appropriate tasks to update the application code, for example, replacing GlusterFS with S3 or using SQS instead of RabbitMQ.
Preparing a test environment and running applications took several weeks. And none of us was sitting idly by; during this time, our QA team was preparing test scenarios and marking the ones that worked. We also prepared a load test using the Grafana K6 tool to check whether our auto-scaling configuration worked properly.
As soon as all crucial tasks were marked as done and all test cases were glowing with a positive green light, we started preparing the production environment and set the migration date with the client.
If you can believe it, the entire operation took one single night, with everyone focused on the task at hand. How is that possible? We prepared a document outlining all 20 tasks, each with an estimated completion time and a person assigned to do it. We stuck by it and achieved what seemed impossible.
The most crucial part of this undertaking was to stop the ongoing e-commerce operations for a moment. There were several steps we needed to take before putting it back online. The first was transferring data from databases. Then, we ran the production service up but blocked it outside our VPN. In the third step, we’ve let our QA engineers into action so they could test the most critical scenarios from previously prepared paths.
Once we ensured that all the essential parts of the application were working properly, we launched public access to the website. We’ve successfully moved everything to the cloud in one night. We delivered on our promise, and our client didn’t even lose much time on this improvement.
Lessons & results
A project like that is a challenge, and each challenge brings new lessons. In this case, we didn’t just learn more about technological innovations such as Terraform or Kubernetes; we’ve gained more experience planning large operations, dividing workload, and operating under time pressure.
Lesson nr 1
Preparing sandbox environments and performing necessary operations in digital projects is surprisingly easy and cost-effective. For instance, we could prepare and test a whole migration operation on the fly without disrupting the production environment.
Lesson nr 2
We estimated the operation to be around 3 hours and managed it in exactly three hours. We’ve learned that planning is the key to success. Our team felt more confident knowing exactly what task they needed to perform. However, we also had a plan B: an emergency solution to return to the previous environment quickly.
Our results are, in fact, our achievements:
- we updated safe test and production environments,
- we prepared the infrastructure as an easy-to-manage code,
- we built infrastructure ready for rapid business changes,
- the average response of application time decreased to around 100 ms,
- thanks to several improvements, the transfer to CDN is now 92% of the traffic (before, it was 80%).
It goes without saying that we’re proud of what we’ve accomplished. However, there’s still more ahead:
- after a whole month of production operations, we’ll analyse the use of the service and go through the cost optimisation process,
- we’ll implement OpenTelemetry on the application site,
- we’ll monitor and trace using Grafana and Grafana Mimir.
Migrating to the cloud was a great decision. The current solution is flexible and allows us to benefit from all the products made available by the cloud providers, making it even easier to implement the most demanding applications.
But there’s more to a project like that. Such big challenges mean combining all our efforts, learning new things, and improving the inner harmony of our team. We’re all grateful that we have customers who have been with us for years and trust us with their technological solutions. It’s through teamwork that we can deliver the highest level of results.