Failures. They happen. When they do, it is usually stressful to get things back to normal. What if that stress of a complete failure of all your servers and storage could be avoided, with the added bonus of recovery being fast?
One of the benefits of going down the hyperconverged route is that your infrastructure becomes simplified. Complexity is the enemy of uptime – so keeping things simple will win every time when uptime matters.
A Nutanix NX-3450 block (essentially 4 ESXi hosts in a 2U appliance).
Nutanix Software AOS 4.5.1
ESXi 6.0 on the 4 hosts
vCenter 6.0 (vCSA) hosted on one of these hosts, with HA and DRS on.
This is a fully contained vSphere environment in a hyperconverged 2U package.
Now I’m going to unexpectedly kill this 4-node cluster.
That is, a complete power loss at the same time by killing the power to both power supplies in the Nutanix block.
All the Guest VMs get killed of course, including vCenter and including the Nutanix Controller VMs (CVMs).
How difficult will it be to get everything back up and running?
In the video below I power off the cluster hard, then power on and manually start a timer and see what happens and when.
(Keep reading below if you don’t want to watch the whole thing)
So what happened?
The video is posted above, but for those who don’t want to sit through here’s the summary timeline below. Note that the times below are from the laptop counter time shown in the video:
0m:00s – Timer is started once power is switched back on.
1m:11s – My VMware VIclient connection times out due to no response from vCenter.
2m:40s – First ESXi host responds to ping.
4m:18s – Disk lights start to respond, indicating that the Controller VMs are booting and the disks are passed through to them.
4m:49s – First CVM responds to ping.
6m:29s – Although not seen in on the screen, the NFS datastore was auto-restored to each ESXi host (allowing the Guest VMs to start, including vCenter). This means that the Nutanix cluster services have started.
9m:23s – vCenter, hosted on this failed cluster, has restarted and starts to respond to ping.
11m:50s – First successful attempt login to vCenter (hurry up vCenter! :)
12m:03s – vCenter login successful and things look good – all guest VMs have restarted.
…so it took just over 12 minutes for the infrastructure to recover…. without any human intervention at all. All the guest VMs are powered back on and are running. Sure, they will be going through their crash-recovery procedures and you probably need to fix up some applications, but the infrastructure is up and stable (which is of course a prerequisite before you can even start to troubleshoot the applications).
OK, so what?
You’ve just witnessed the power of treating your storage like any other application or VM in your environment.
You can see the Nutanix Controller VMs sitting next to the Windows Guest VMs on the vCenter screen. By remaining independent of the hypervisor, and independent of vCenter, your software-defined storage can self-recover from a complete power failure without any human intervention, no reliance on other technologies, and then your workloads can start as you’d expect (or want!) them to in a failure scenario.
There is no need for external “witnesses”, no mucking around, no hours/days of downtime while you speak to the vendor to help you recover…just get the job done and production back to normal ASAP.
The Nutanix software does the hard work for you. Software is where it’s at in 2016!
By contrast, what if you had “traditional” separate servers and storage, how long would it take to recover from a complete power outage of both? What about a failure of the SAN only? If you were to do the same with your SAN (pull the power hard) what would happen? How long would it take for you to recover it? Could you get away with not touching it at all and expect it to recover and VM workloads would be fine as well?
Remember, no one who pays the bills cares – they just want the applications UP.
In fact, this is how I normally move my Nutanix block around – I usually just pull the power. It saves me some time :)
Why do this test? (or “Whatever…my DC is designed to prevent power loss!”)
Things go wrong. Mains Power can die. UPS, Batteries, Generators can die. I’ve seen cases where someone has not refilled their backup diesel generator – and didn’t know until they needed it. I’ve seen a UPS firmware update kill power to a whole row of racks in the middle of the day. I’m sure many of you reading this are getting flashbacks from similar scenarios. It isn’t fun when it happens. I know of Nutanix customers who have had similar power issues and what I demo here is consistent with their experience.
I keep saying it, but ALWAYS test failure scenarios, especially in this new world of SDDC and hyperconvergence. You will find that not all hyperconverged players are created equal :)
Sure, you could perhaps auto-cutover to DR in this situation (depending on the length of outage) but isn’t it reassuring that you don’t have to worry about the infrastructure if you lost a whole cluster due to some unforeseen event?
How does Nutanix achieve this?
Writes are always to persistent storage on a Nutanix cluster, verified and checksummed before acknowledging that write back to the Guest VM making the request. Therefore, your data is always consistent from the Guest VM’s perspective. If the Guest VM thinks that a write has occurred, you can be assured that there are at least 2 copies of that write in a Nutanix cluster, across the nodes (the local node and one remote node at least).
You have a completely self-contained distributed file system that is designed from the ground up for handling failures and self-heal.
Note that Nutanix can suffer a complete power loss of a block and your VMs can start on other blocks (normal HA) if you have a minimum of 3 blocks in your cluster. This is called Availability Domains (formerly ‘Block Awareness’) and it is inbuilt – you don’t have to configure anything. What this does is ensure data replicas are placed on different blocks from the source block. Cool.
Nutanix is simplifying the datacenter footprint, and more uptime is the result.
I often travel to remote locations and this demo has always had a positive response. Hopefully, I have shown you one of the many aspects in which Nutanix is unique in the hyperconverged and SDDC space.
Pingback: Fighting the FUD # 1 – Harsha Hosur – VCDX #135