Reach High Availability with a Multiple Cloud Deployment
Wednesday, December 3, 2014
This article is written by guest author Eugene Olshenbaum. Eugene is the Head of Media Platform at Wix, a cloud-based web development platform that makes it easy for everyone to create beautiful websites.
While some people are still debating whether to use a cloud service, we at Wix are debating how many to use. Тhe more services we use, the more assurance we have that we can handle any failures. To help ensure business continuity by freeing developers from the constraints of a single provider, multi-cloud environments are becoming the next evolution in cloud platform architecture.
Dimensional Research recently interviewed 659 IT decision makers with cloud responsibilities in Australia, Brazil, Canada, Germany, the UK, US, and Singapore, and 77% of respondents said they either already have or plan to implement a multi-cloud infrastructure in the coming year. Only 8% are not planning to do so.
As a result of this growing trend, we thought it was time to revisit a recent blog post describing Wix’s disaster recovery strategy, as well as discuss our multi-cloud implementation at Wix.
At Wix.com, we provide a cloud-based web development platform that allows users to create HTML5 websites and mobile sites through the use of our online drag-and-drop tools. Wix Media Platform is one of the most important pieces of our infrastructure, supporting the 55 million websites running on Wix.com.
While providing tools for building functional websites like an eCommerce shop, hotel, or restaurant, we quickly realized that our customers care about only one thing: they want their site to always be online. And because we know that things fail no matter what, using multiple cloud providers is our solution to:
Wix Media Platform High-Level Architecture
The new multi-cloud configuration of Wix Media Platform’s system layout provides active/active, strongly consistent setup on:
Wix’s platform relies on several subsystems, each of which provides its own service-level agreement (SLA). One of the key design guidelines is to keep each subsystem fully backed up by its independent equivalent on another location.
The Challenge
We want to provide close to 100% uptime for data serving while protecting users’ data against loss. We originally ran our service in one managed hosting environment. To improve data disaster recovery, we added a second one, running both services in active/active mode. Later, we added a third data center to run our services in 3x active/active mode.
As we explained in our previous blog post, we learned that maintaining three cross-data-center replicas was much more complex than managing two, especially with the data centers owned by different ISPs for ISP redundancy. One of the challenges in 3x active/active mode was database replication. To replicate across three data centers we had to configure our MySQL in a ring topology. The ring would break when one data center went down for a long time or failed completely.
To address this, instead of implementing 3x active/active mode with our current infrastructure, we decided to run in 2x active/active mode, with the third replica running on an entirely different technology platform. The third replica also added protection against data poisoning (when a faulty piece of code unintentionally corrupts data and remains undetected for some time).
We decided to develop a fully functional, logical data center natively on Google Cloud Platform. After six months, in April 2013, we started to serve Wix media from Google Cloud Platform in monitored geographies. By the end of 2013, 100% of production traffic was served from Google Cloud Platform. We developed NORM on Google Cloud Platform. NORM (Not Only Replication Manager) is a generic replication bus that allows us to keep the data in sync in all logical locations: Google Cloud Platform, Amazon Web Services, and Wix data centers.
Conclusion
As the leading cloud-based web development platform in the world, we have been paying very close attention to the string of recent cloud outages. Each minute of downtime is money our client loses, so it came as a natural decision for us to implement a multi-cloud infrastructure and mitigate the risks associated with failures.
We believe the advantages of utilizing multiple cloud platforms heavily outweigh the challenges. Over time, we learned that the benefits were going beyond extended capabilities, lower costs, and improved performance.
Operational efforts are way less stressful, and sleepless nights and crisis chat rooms are now in the past. In most cases we just switch traffic to a functional system and investigate failures afterwards. With this new implementation, our team can rest easier and still provide an exceptional customer experience.
- Posted by Eugene Olshenbaum, Director of Media Platform at Wix
While some people are still debating whether to use a cloud service, we at Wix are debating how many to use. Тhe more services we use, the more assurance we have that we can handle any failures. To help ensure business continuity by freeing developers from the constraints of a single provider, multi-cloud environments are becoming the next evolution in cloud platform architecture.
Dimensional Research recently interviewed 659 IT decision makers with cloud responsibilities in Australia, Brazil, Canada, Germany, the UK, US, and Singapore, and 77% of respondents said they either already have or plan to implement a multi-cloud infrastructure in the coming year. Only 8% are not planning to do so.
As a result of this growing trend, we thought it was time to revisit a recent blog post describing Wix’s disaster recovery strategy, as well as discuss our multi-cloud implementation at Wix.
At Wix.com, we provide a cloud-based web development platform that allows users to create HTML5 websites and mobile sites through the use of our online drag-and-drop tools. Wix Media Platform is one of the most important pieces of our infrastructure, supporting the 55 million websites running on Wix.com.
While providing tools for building functional websites like an eCommerce shop, hotel, or restaurant, we quickly realized that our customers care about only one thing: they want their site to always be online. And because we know that things fail no matter what, using multiple cloud providers is our solution to:
- Achieve at least Five 9s uptime
- Stay on top of the competition
- Eliminate the risks associated with the business continuity of the infrastructure provider, as well as risks related to electricity suppliers, networking providers, and other "data center" issues (since each cloud provider will usually operate separately).
Wix Media Platform High-Level Architecture
The new multi-cloud configuration of Wix Media Platform’s system layout provides active/active, strongly consistent setup on:
- Google Cloud Platform (primary)
- Amazon Web Services
- Wix-managed data centers
Wix’s platform relies on several subsystems, each of which provides its own service-level agreement (SLA). One of the key design guidelines is to keep each subsystem fully backed up by its independent equivalent on another location.
The Challenge
We want to provide close to 100% uptime for data serving while protecting users’ data against loss. We originally ran our service in one managed hosting environment. To improve data disaster recovery, we added a second one, running both services in active/active mode. Later, we added a third data center to run our services in 3x active/active mode.
As we explained in our previous blog post, we learned that maintaining three cross-data-center replicas was much more complex than managing two, especially with the data centers owned by different ISPs for ISP redundancy. One of the challenges in 3x active/active mode was database replication. To replicate across three data centers we had to configure our MySQL in a ring topology. The ring would break when one data center went down for a long time or failed completely.
To address this, instead of implementing 3x active/active mode with our current infrastructure, we decided to run in 2x active/active mode, with the third replica running on an entirely different technology platform. The third replica also added protection against data poisoning (when a faulty piece of code unintentionally corrupts data and remains undetected for some time).
We decided to develop a fully functional, logical data center natively on Google Cloud Platform. After six months, in April 2013, we started to serve Wix media from Google Cloud Platform in monitored geographies. By the end of 2013, 100% of production traffic was served from Google Cloud Platform. We developed NORM on Google Cloud Platform. NORM (Not Only Replication Manager) is a generic replication bus that allows us to keep the data in sync in all logical locations: Google Cloud Platform, Amazon Web Services, and Wix data centers.
Conclusion
As the leading cloud-based web development platform in the world, we have been paying very close attention to the string of recent cloud outages. Each minute of downtime is money our client loses, so it came as a natural decision for us to implement a multi-cloud infrastructure and mitigate the risks associated with failures.
We believe the advantages of utilizing multiple cloud platforms heavily outweigh the challenges. Over time, we learned that the benefits were going beyond extended capabilities, lower costs, and improved performance.
Operational efforts are way less stressful, and sleepless nights and crisis chat rooms are now in the past. In most cases we just switch traffic to a functional system and investigate failures afterwards. With this new implementation, our team can rest easier and still provide an exceptional customer experience.
- Posted by Eugene Olshenbaum, Director of Media Platform at Wix