Nutanix Power Loss and Self-Recovery Demo

Failures. They happen. When they do, it is usually stressful to get things back to normal. What if that stress of a complete failure of all your servers and storage could be avoided, with the added bonus of recovery being fast?

One of the benefits of going down the hyperconverged route is that your infrastructure becomes simplified. Complexity is the enemy of uptime – so keeping things simple will win every time when uptime matters.

The Setup:

A Nutanix NX-3450 block (essentially 4 ESXi hosts in a 2U appliance).
Nutanix Software AOS 4.5.1
ESXi 6.0 on the 4 hosts
vCenter 6.0 (vCSA) hosted on one of these hosts, with HA and DRS on.

This is a fully contained vSphere environment in a hyperconverged 2U package.

The Test:

Now I’m going to unexpectedly kill this 4-node cluster.

Hard.

That is, a complete power loss at the same time by killing the power to both power supplies in the Nutanix block.

All the Guest VMs get killed of course, including vCenter and including the Nutanix Controller VMs (CVMs).

How difficult will it be to get everything back up and running?

In the video below I power off the cluster hard, then power on and manually start a timer and see what happens and when.

(Keep reading below if you don’t want to watch the whole thing)

So what happened?

The video is posted above, but for those who don’t want to sit through here’s the summary timeline below. Note that the times below are from the laptop counter time shown in the video:

0m:00s – Timer is started once power is switched back on.

1m:11s – My VMware VIclient connection times out due to no response from vCenter.

2m:40s – First ESXi host responds to ping.

4m:18s – Disk lights start to respond, indicating that the Controller VMs are booting and the disks are passed through to them.

4m:49s – First CVM responds to ping.

6m:29s – Although not seen in on the screen, the NFS datastore was auto-restored to each ESXi host (allowing the Guest VMs to start, including vCenter). This means that the Nutanix cluster services have started.

9m:23s – vCenter, hosted on this failed cluster, has restarted and starts to respond to ping.

11m:50s – First successful attempt login to vCenter (hurry up vCenter! :)

12m:03s – vCenter login successful and things look good – all guest VMs have restarted.

…so it took just over 12 minutes for the infrastructure to recover…. without any human intervention at all. All the guest VMs are powered back on and are running. Sure, they will be going through their crash-recovery procedures and you probably need to fix up some applications, but the infrastructure is up and stable (which is of course a prerequisite before you can even start to troubleshoot the applications).

OK, so what?

You’ve just witnessed the power of treating your storage like any other application or VM in your environment.

You can see the Nutanix Controller VMs sitting next to the Windows Guest VMs on the vCenter screen. By remaining independent of the hypervisor, and independent of vCenter, your software-defined storage can self-recover from a complete power failure without any human intervention, no reliance on other technologies, and then your workloads can start as you’d expect (or want!) them to in a failure scenario.

There is no need for external “witnesses”, no mucking around, no hours/days of downtime while you speak to the vendor to help you recover…just get the job done and production back to normal ASAP.

The Nutanix software does the hard work for you. Software is where it’s at in 2016!

By contrast, what if you had “traditional” separate servers and storage, how long would it take to recover from a complete power outage of both? What about a failure of the SAN only? If you were to do the same with your SAN (pull the power hard) what would happen? How long would it take for you to recover it? Could you get away with not touching it at all and expect it to recover and VM workloads would be fine as well?

Remember, no one who pays the bills cares – they just want the applications UP.

In fact, this is how I normally move my Nutanix block around – I usually just pull the power. It saves me some time :)

Why do this test? (or “Whatever…my DC is designed to prevent power loss!”)

Things go wrong. Mains Power can die. UPS, Batteries, Generators can die. I’ve seen cases where someone has not refilled their backup diesel generator – and didn’t know until they needed it. I’ve seen a UPS firmware update kill power to a whole row of racks in the middle of the day. I’m sure many of you reading this are getting flashbacks from similar scenarios. It isn’t fun when it happens. I know of Nutanix customers who have had similar power issues and what I demo here is consistent with their experience.

I keep saying it, but ALWAYS test failure scenarios, especially in this new world of SDDC and hyperconvergence. You will find that not all hyperconverged players are created equal :)

Sure, you could perhaps auto-cutover to DR in this situation (depending on the length of outage) but isn’t it reassuring that you don’t have to worry about the infrastructure if you lost a whole cluster due to some unforeseen event?

How does Nutanix achieve this?

Writes are always to persistent storage on a Nutanix cluster, verified and checksummed before acknowledging that write back to the Guest VM making the request. Therefore, your data is always consistent from the Guest VM’s perspective. If the Guest VM thinks that a write has occurred, you can be assured that there are at least 2 copies of that write in a Nutanix cluster, across the nodes (the local node and one remote node at least).

You have a completely self-contained distributed file system that is designed from the ground up for handling failures and self-heal.

Other points:

Note that Nutanix can suffer a complete power loss of a block and your VMs can start on other blocks (normal HA) if you have a minimum of 3 blocks in your cluster. This is called Availability Domains (formerly ‘Block Awareness’) and it is inbuilt – you don’t have to configure anything. What this does is ensure data replicas are placed on different blocks from the source block. Cool.

Nutanix is simplifying the datacenter footprint, and more uptime is the result.

I often travel to remote locations and this demo has always had a positive response. Hopefully, I have shown you one of the many aspects in which Nutanix is unique in the hyperconverged and SDDC space.

Nutanix 4.5 Cross-Hypervisor VM Conversion

One of the reasons people stay with a particular type of hypervisor is that it is too hard (or too costly) to migrate to another type. All that drama of converting, testing and making sure all is right and then the risk of having to move back if something went wrong.

Sure, there are separate software tools you can buy to do the conversion for you . . . but what if the virtualisation infrastructure itself – the thing that is actually providing your servers and storage – could do it as an in-built function? What if that could be done just by clicking a few buttons?

So in the demo video below, I take a running Windows VM on a Nutanix Cluster “A” running vSphere and then take a snapshot of it and send it to a second Nutanix Cluster “B” running Nutanix’s own free Hypervisor (AHV) and then start the VM. Job done. Easy.

Here’s the setup:

clustersetup

Basic lab setup using a flat L2 network. Production and DR deployments would use L3 networks – which is fine of course

..and here’s the demo:

For brevity, I cut out the initial one-off processes to set up the Replication. The full process was below (check out the Nutanix Index for articles describing setting up Replication):

1. Setup a Data Protection Remote Site ‘pair’ of clusters (so that they can replicate to each other) and test the connection.

Site A (ESXi cluster)
Site B (AHV cluster)

2. Set up a Protection Domain policy, add the VM you want to be a part of the replication policy and set a schedule.

3. On the Windows VM on ESXi on site A that you want to snap to Site B running AHV, make sure you install the Nutanix VM Mobility drivers MSI from the my.nutanix.com support portal. (These will soon be included in Nutanix Guest Tools (NGT) post Nutanix AOS 4.6 release, so by installing the NGT you will automatically get the VM Mobility drivers). The Nutanix VM Mobility installer deploys the drivers that are required at the destination AHV cluster. After you prepare the source VMs, they can be exported (snapped) to the AHV cluster.

4. Run the snapshot and restore operation as per the video. That’s it!

Word on keyboard

Almost as easy as clicking this button

A few points to note:

In the video I am just taking a crash-consistent snapshot, if you want a clean snap then shut down the source VM first, then snap, then restore. Live app-consistent snapshots will be coming in 4.6 for ESXi and AHV.

Obviously if your VMs have static IPs or to avoid computer naming issues, you should take care of these before joining the newly created AHV VM to the network. When you restore the VM on AHV, by default there is no virtual nic connected (so the risk is minimal if you just want to test). If you wanted it to connect to the network you would attach a nic to the restored VM on via Prism on the AHV cluster (go to the VM page).

Only 64-bit guest operating systems are supported at the time of writing (Nutanix AOS 4.5).

For Windows 7 and Windows 2008 R2 operating systems, you have to install SHA-2 code signing support patch before installing Nutanix VM Mobility installer. For more information, see https://technet.microsoft.com/en-us/library/security/3033929.

More info can be found in the Nutanix Prism Web Console Guide under the “Nutanix VM Mobility for Windows” section – which can be found on the Nutanix Support Portal.

Use cases:

A lot of people are trying AHV for the first time, and larger customers usually have a test/dev set of Nutanix nodes for testing. This method would be perfect to try snapping production VMs on AHV for testing and verify all is OK.

Also, I can see a use case where DR clusters could now use the in-built AHV on Nutanix clusters and save some licensing dollars.

It would also be possible to use Nutanix Community Edition as the AHV target – in case you had some spare hardware and wanted to just try this out without the need for a full Nutanix set of nodes.

Future software plans:

In a few weeks (early 2016), Nutanix will release AOS 4.6. With it, two-way VM conversion (ESXi<->AHV in either direction) should be included. In a future release AOS is expected to add support for Hyper-V, delta disks, and volume groups.

Yes, Nutanix will enable the ability to leave AHV and migrate your VMs *back* to ESXi (for example) should you choose. Put simply, the onus is on Nutanix to keep innovating to maintain your loyalty, rather than any technical or license ‘lock-in’. At the end of the day your workloads are just virtual machines – you should be free to move them wherever you see fit (even away from Nutanix if you choose).

There will be lots of improvement and extra features coming in future releases of course, which you will get by simply doing a standard Nutanix non-disruptive upgrade.

Conclusion: 

In essence, you can see why going Hyper-converged makes doing things like this almost trivial compared trying to do the same in a traditional 3-tier infrastructure (separate servers and storage layers). As the Nutanix software improves, your life gets easier each time. With each Nutanix release, more and more features like this will continue to be added and improved. Being 100% in-software is going to be a necessity in the next decade and beyond.

Thanks to @danmoz for letting me borrow his Dell XC cluster… and I treated it badly too (eg. multiple times I hard powered it off with no care – and it self-recovered every time).

shackles

Hypervisor lock-in is sooo 2007 :)

On Nutanix Resiliency – The Controller VM

saidnoceoever

Spot on! Invisible Infra means not babysitting old DC constructs, instead deliver *business* advantages. Source: https://twitter.com/vjswami/status/562298721942401026

If you make the claim that your DC infrastructure should be invisible, clearly you need to have a solid story around handling failures. Coping with failure scenarios is critical in any infrastructure. I get asked often on how Nutanix clusters cope in situations where things go wrong unexpectedly – and rightly so.

Nutanix has been designed to expect failures of hardware, hypervisor or Nutanix’s own NOS software. I don’t care how long it takes to deploy a Nutanix cluster (5 mins or 60 mins depending if you want to change hypervisor etc) – as to me that is a one-off occurrence and fairly uninteresting (although a lot faster than the old way of deploying servers and a SAN). What really matters is whether or not you are going to be called in at 2am or on weekends when things die – or even worse if something goes wrong at 9am on a weekday.

In my personal view, uptime trumps all else in this modern 24×7 world. Gone are the days where ‘outage windows’ are acceptable. Of course, every deployment is unique and depends on the requirements and circumstances.

I usually demo HA events (eg. motherboard failure, nic failure etc) but none of that is particularly exciting for virtualisation administrators – failures of this nature are *expected* to be handled OK in 2015. Nutanix is no different here as that is all handled by the hypervisor.

One demo that really gets people excited is what I want to talk about here.

I’m going to assume you know a little about the Nutanix terminology and architecture, if you don’t check out The Nutanix Bible by Steven Poitras : http://stevenpoitras.com/the-nutanix-bible/

To set the scene, we have a fully working NX-3450 block (4 nodes of 3050 series) – which is a 2U appliance. These four nodes will be connected to a standard 10GbE switch. The Nutanix architecture is such that there is a Controller VM (CVM) on each and *every* node in the cluster (here a NX-3450 would therefore have 4 CVMs). This setup gives approx 7.5TB usable (after replication) but before dedupe/compression/EC.

With Nutanix, the storage fabric and the compute fabric are independent of each other. For example, you can upgrade one without affecting the other. Therefore the failure domain is limited to one or the other.

Let’s focus on a single CVM on a node that is also hosting your guest VMs. Let’s assume that the hypervisor is healthy and OK. What if someone powers off the CVM hard by mistake, or deletes the CVM, or if the CVM kernel panics and simply “stops” in the middle of normal guest VM I/O operations? What are the effects? What will your users notice? Will all hell break loose? Let’s check out this worst case scenario up-front, where the Nutanix “Data Path Redundancy” feature kicks into action.

I’ll demo this via this short video:

So what happened here? The CVM was killed with live guest VMs reading and writing data. This is NOT a usual scenario – but it is a great example of the robustness of the Nutanix distributed fabric. What we see is that the hypervisor is trying to write to the NFS datastore. When the CVM dies, the hypervisor on that node can no longer communicate with the NFS datastore and re-tries again and again. The hypervisor will continue to do so as per normal NFS timeout values. But here, the CVMs on other nodes were already taking steps to fix the situation – before the hypervisor knows what’s going on! After 30 seconds, the hypervisor on the affected node is told by one of the other CVMs to redirect I/O to one of the remaining healthy CVMs and things carry on as normal. This is well within the hypervisor’s (in this case ESXi 5.5) default NFS timeout period before it marks the NFS datastore as unavailable (125 seconds is mentioned here: http://cormachogan.com/2012/11/27/nfs-best-practices-part-2-advanced-settings/). Cool.

So what does all that mean to you and the end users of the guest VMs when a CVM is killed?

No data loss.
No loss in guest VM network connectivity.
No guest VM BSOD.
No hypervisor PSOD.
No vMotion / No HA event required. 

….just a temporary 30 second ‘pause’ in I/O because the NFS datastore became unresponsive for 30 seconds – but it recovered before the hypervisor’s own timeout function – so the guests keep going from where they last tried to write data. Job done!

Obviously the same situation would be true if your Nutanix cluster was instead running Hyper-V (SMB) or Acropolis Hypervisor (iSCSI). Remember, to Nutanix the hypervisor itself is ‘above’ the distributed storage fabric (designed deliberately so). If it wasn’t then there would have needed to be a full outage of the VMs and a HA event.

nutanix-choice

Abstracting the physical disks away from the hypervisor means the Nutanix CVMs can present the protocol of choice to your hypervisor of choice. Treat storage just like a application VM….If virtualisation is good enough for production Oracle and SQL, then so it should be for storage (and it is :)

If you still are a bit hazy on how Data Path Redundancy works, check out this nu.school video: https://www.youtube.com/watch?v=9cigloapOXw

Please keep in mind that these I/O ‘pause’ effects would not be seen in a normal scenario such as a rolling one-click upgrade of Nutanix NOS software versions – where there would be no interruption of I/O at all (due to the fact the affected CVM can pro-actively redirect it’s hypervisor I/O to another CVM in preparation for a controlled reboot). The same is true for the 1-click hypervisor upgrades – because the guest VMs are vMotioned anyway before CVM shutdown (because the hypervisor itself is restarted).

This is why people can expand/upgrade their production Nutanix clusters in the middle of the day:

Very cool stuff. I remember wasting a Christmas Day in 2008 upgrading firmware on an old IBM blade chassis. I wish Nutanix had existed then…. I could have done the same on Nutanix nodes over Christmas lunch…from home!

BTW, don’t worry if you have more nodes (than the 4 shown in the video) or want more protection than simply handling a single CVM failure. If you have enough nodes or appliances you can lose entire blocks or racks of Nutanix and keep working (ie. lose more than one CVM simultaneously). Check out the Nutanix Bible for ‘Availability Domains’ and RF3 for examples and check the minimum requirements for these situations. Try that with a dual-controller SAN. 

I hope this post has shown you one of the most powerful aspects of a building your virtualisation infrastructure using Nutanix. In a lot of traditional environments, losing 6 disks at once would mean you’d be having a very bad day, maybe even having to invoke a DR strategy or restore from backups, perhaps with a lot of manual steps too. Who needs that drama in 2015? Go invisible and then go home and crack open a beer! Let the software do all the work for you, and with your free time you could maybe learn some new skills instead of babysitting infrastructure. Win/win!

Thanks to Josh Odgers (@josh_odgers), Matt Northam (@twickersmatt) and Matt Day (@idiomatically … a champion Nutanix customer!) for reviewing this post.