Well, I am back from a little excursion outside of Tokyo to recharge and get temporarily away from the reality of the situation at the nuclear reactors up North in Fukushima.
What better time than now to post an article about Disaster Recovery and Business Continuity using VMware's Site Recovery Manager product.
As seismically active as Japan is, it would seem a functional requirement of vSphere designs to encompass DR/BC planning, so I was surprised to find out that we were one of the few companies in Japan implementing SRM in a production capacity when we started the project about 2 years ago. I suspect that the infancy of production VMware deployments here has some influence on this. I also suspect companies that have production deployments well in hand will now look toward the DR/BC benefits of this product, especially now and in the near future.
Since I joined the company, going on 3 years ago, we have maintained 2 on-site datacenters (1 in Eastern and 1 in Western Japan). The main purpose up until my arrival was to provide fast access to essential services such as e-mail, file server, AD, etc. but the increase in speed of our private MPLS network between the Eastern and Western offices has made this less of an issue.
We looked at using the Western Japan datacenter as a recovery site in the event of natural disaster or other break in business continuity. At first we used a manual process to test and implement this. We were able to plan licensing for VMware SRM and implemented in about 4 months, encompassing critical (SLA backed) services comprising about 10 VMs.
We ended up having 2 "Recovery Plans", in VMware's terminology, each of which was tested mutiple times in the past 2 years. We were able to automate the entire process, save the DNS switching, which required some manual intervention during the testing phase due to our use of static DNS. The only way to get the DNS to change automatically, and hence allow the services to be available to users externally is through the use of dynamic DNS and DHCP. We will look at this more closely going forward, as we had some issue that I will describe shortly.
We also automated the switching of the VMs' IPs, DNS, routing, etc. using the IP address mapping CSV file described in the SRM documentation. This was tested as working until recently when I upgraded to vCenter 4.1 and the IP address mapping configuration was inexplicable lost.
Finally, we use our SRM implementation in an additional, less common way. Whenever we have a scheduled power outage at one of our sites, we use SRM to carry out a controlled failover to the other site for critical external user services. Once the power failure or test is completed, we reverse the SAN replication and manually failback the services (as VMware SRM 4.1 currently doesn't have a procedure for this).
Disaster and Resulting Failover
On 3/11 at 2:46PM local time, our VMware DR plan immediately went into motion. Thankfully we still had access to systems at our primary site, so we tested the accessibility of services as a first line of defense. Everything was fully operational. I performed some additional tests of the recovery plans at that time, to ensure that we were able to failover to the Western office, should another quake take out the power, or remove our access. All tests checked out OK.
After the smoke cleared, we learned that rolling blackouts would be taking place in the area of our Eastern Japan office, so we made plans to perform the "Recovery" operation to the other, functional office. For the most part the operation was sucessful, but not without a few issues. Namely, due to the lack of IP address mappings, we had some VMs for which the IP address did not switch over automatically. This, in combination with the fact that we didn't document some of the DNS changes in the appropriate zone files meant that we had to figure out the IP addressed manually and add entries to the zone file instead of commenting/uncommenting the appropriate entries as we had done during previous testing. Also, it turned out that without my knowledge, some additional VMs were added to the Recovery Plan without having IPs reserved in the Recovery Site. We had to work dynamically to account for these last-minute changes.
All in all, we got the issues resolved manually and restored services to users within an acceptable timeframe. Thankfully this took place over the weekend when the end users are out of the office. The problems encountered will serve as lessons for better planning for future DR plans and will prompt management for more frequent DR tests.
I hope that the relative success of the running of the DR plan will also open the eyes of management to more prevalent use of VMware, specifically for the disaster recovery benefits. I can say, without going into detail, other critical systems (outside of the virtual infrastructure) did not fare as well!
Thanks VMware! Thanks SRM! Mission accomplished...