Nutanix 4.5 Cross-Hypervisor VM Conversion

One of the reasons people stay with a particular type of hypervisor is that it is too hard (or too costly) to migrate to another type. All that drama of converting, testing and making sure all is right and then the risk of having to move back if something went wrong.

Sure, there are separate software tools you can buy to do the conversion for you . . . but what if the virtualisation infrastructure itself – the thing that is actually providing your servers and storage – could do it as an in-built function? What if that could be done just by clicking a few buttons?

So in the demo video below, I take a running Windows VM on a Nutanix Cluster “A” running vSphere and then take a snapshot of it and send it to a second Nutanix Cluster “B” running Nutanix’s own free Hypervisor (AHV) and then start the VM. Job done. Easy.

Here’s the setup:

clustersetup

Basic lab setup using a flat L2 network. Production and DR deployments would use L3 networks – which is fine of course

..and here’s the demo:

For brevity, I cut out the initial one-off processes to set up the Replication. The full process was below (check out the Nutanix Index for articles describing setting up Replication):

1. Setup a Data Protection Remote Site ‘pair’ of clusters (so that they can replicate to each other) and test the connection.

Site A (ESXi cluster)
Site B (AHV cluster)

2. Set up a Protection Domain policy, add the VM you want to be a part of the replication policy and set a schedule.

3. On the Windows VM on ESXi on site A that you want to snap to Site B running AHV, make sure you install the Nutanix VM Mobility drivers MSI from the my.nutanix.com support portal. (These will soon be included in Nutanix Guest Tools (NGT) post Nutanix AOS 4.6 release, so by installing the NGT you will automatically get the VM Mobility drivers). The Nutanix VM Mobility installer deploys the drivers that are required at the destination AHV cluster. After you prepare the source VMs, they can be exported (snapped) to the AHV cluster.

4. Run the snapshot and restore operation as per the video. That’s it!

Word on keyboard

Almost as easy as clicking this button

A few points to note:

In the video I am just taking a crash-consistent snapshot, if you want a clean snap then shut down the source VM first, then snap, then restore. Live app-consistent snapshots will be coming in 4.6 for ESXi and AHV.

Obviously if your VMs have static IPs or to avoid computer naming issues, you should take care of these before joining the newly created AHV VM to the network. When you restore the VM on AHV, by default there is no virtual nic connected (so the risk is minimal if you just want to test). If you wanted it to connect to the network you would attach a nic to the restored VM on via Prism on the AHV cluster (go to the VM page).

Only 64-bit guest operating systems are supported at the time of writing (Nutanix AOS 4.5).

For Windows 7 and Windows 2008 R2 operating systems, you have to install SHA-2 code signing support patch before installing Nutanix VM Mobility installer. For more information, see https://technet.microsoft.com/en-us/library/security/3033929.

More info can be found in the Nutanix Prism Web Console Guide under the “Nutanix VM Mobility for Windows” section – which can be found on the Nutanix Support Portal.

Use cases:

A lot of people are trying AHV for the first time, and larger customers usually have a test/dev set of Nutanix nodes for testing. This method would be perfect to try snapping production VMs on AHV for testing and verify all is OK.

Also, I can see a use case where DR clusters could now use the in-built AHV on Nutanix clusters and save some licensing dollars.

It would also be possible to use Nutanix Community Edition as the AHV target – in case you had some spare hardware and wanted to just try this out without the need for a full Nutanix set of nodes.

Future software plans:

In a few weeks (early 2016), Nutanix will release AOS 4.6. With it, two-way VM conversion (ESXi<->AHV in either direction) should be included. In a future release AOS is expected to add support for Hyper-V, delta disks, and volume groups.

Yes, Nutanix will enable the ability to leave AHV and migrate your VMs *back* to ESXi (for example) should you choose. Put simply, the onus is on Nutanix to keep innovating to maintain your loyalty, rather than any technical or license ‘lock-in’. At the end of the day your workloads are just virtual machines – you should be free to move them wherever you see fit (even away from Nutanix if you choose).

There will be lots of improvement and extra features coming in future releases of course, which you will get by simply doing a standard Nutanix non-disruptive upgrade.

Conclusion: 

In essence, you can see why going Hyper-converged makes doing things like this almost trivial compared trying to do the same in a traditional 3-tier infrastructure (separate servers and storage layers). As the Nutanix software improves, your life gets easier each time. With each Nutanix release, more and more features like this will continue to be added and improved. Being 100% in-software is going to be a necessity in the next decade and beyond.

Thanks to @danmoz for letting me borrow his Dell XC cluster… and I treated it badly too (eg. multiple times I hard powered it off with no care – and it self-recovered every time).

shackles

Hypervisor lock-in is sooo 2007 :)

On Nutanix Resiliency – The Controller VM

saidnoceoever

Spot on! Invisible Infra means not babysitting old DC constructs, instead deliver *business* advantages. Source: https://twitter.com/vjswami/status/562298721942401026

If you make the claim that your DC infrastructure should be invisible, clearly you need to have a solid story around handling failures. Coping with failure scenarios is critical in any infrastructure. I get asked often on how Nutanix clusters cope in situations where things go wrong unexpectedly – and rightly so.

Nutanix has been designed to expect failures of hardware, hypervisor or Nutanix’s own NOS software. I don’t care how long it takes to deploy a Nutanix cluster (5 mins or 60 mins depending if you want to change hypervisor etc) – as to me that is a one-off occurrence and fairly uninteresting (although a lot faster than the old way of deploying servers and a SAN). What really matters is whether or not you are going to be called in at 2am or on weekends when things die – or even worse if something goes wrong at 9am on a weekday.

In my personal view, uptime trumps all else in this modern 24×7 world. Gone are the days where ‘outage windows’ are acceptable. Of course, every deployment is unique and depends on the requirements and circumstances.

I usually demo HA events (eg. motherboard failure, nic failure etc) but none of that is particularly exciting for virtualisation administrators – failures of this nature are *expected* to be handled OK in 2015. Nutanix is no different here as that is all handled by the hypervisor.

One demo that really gets people excited is what I want to talk about here.

I’m going to assume you know a little about the Nutanix terminology and architecture, if you don’t check out The Nutanix Bible by Steven Poitras : http://stevenpoitras.com/the-nutanix-bible/

To set the scene, we have a fully working NX-3450 block (4 nodes of 3050 series) – which is a 2U appliance. These four nodes will be connected to a standard 10GbE switch. The Nutanix architecture is such that there is a Controller VM (CVM) on each and *every* node in the cluster (here a NX-3450 would therefore have 4 CVMs). This setup gives approx 7.5TB usable (after replication) but before dedupe/compression/EC.

With Nutanix, the storage fabric and the compute fabric are independent of each other. For example, you can upgrade one without affecting the other. Therefore the failure domain is limited to one or the other.

Let’s focus on a single CVM on a node that is also hosting your guest VMs. Let’s assume that the hypervisor is healthy and OK. What if someone powers off the CVM hard by mistake, or deletes the CVM, or if the CVM kernel panics and simply “stops” in the middle of normal guest VM I/O operations? What are the effects? What will your users notice? Will all hell break loose? Let’s check out this worst case scenario up-front, where the Nutanix “Data Path Redundancy” feature kicks into action.

I’ll demo this via this short video:

So what happened here? The CVM was killed with live guest VMs reading and writing data. This is NOT a usual scenario – but it is a great example of the robustness of the Nutanix distributed fabric. What we see is that the hypervisor is trying to write to the NFS datastore. When the CVM dies, the hypervisor on that node can no longer communicate with the NFS datastore and re-tries again and again. The hypervisor will continue to do so as per normal NFS timeout values. But here, the CVMs on other nodes were already taking steps to fix the situation – before the hypervisor knows what’s going on! After 30 seconds, the hypervisor on the affected node is told by one of the other CVMs to redirect I/O to one of the remaining healthy CVMs and things carry on as normal. This is well within the hypervisor’s (in this case ESXi 5.5) default NFS timeout period before it marks the NFS datastore as unavailable (125 seconds is mentioned here: http://cormachogan.com/2012/11/27/nfs-best-practices-part-2-advanced-settings/). Cool.

So what does all that mean to you and the end users of the guest VMs when a CVM is killed?

No data loss.
No loss in guest VM network connectivity.
No guest VM BSOD.
No hypervisor PSOD.
No vMotion / No HA event required. 

….just a temporary 30 second ‘pause’ in I/O because the NFS datastore became unresponsive for 30 seconds – but it recovered before the hypervisor’s own timeout function – so the guests keep going from where they last tried to write data. Job done!

Obviously the same situation would be true if your Nutanix cluster was instead running Hyper-V (SMB) or Acropolis Hypervisor (iSCSI). Remember, to Nutanix the hypervisor itself is ‘above’ the distributed storage fabric (designed deliberately so). If it wasn’t then there would have needed to be a full outage of the VMs and a HA event.

nutanix-choice

Abstracting the physical disks away from the hypervisor means the Nutanix CVMs can present the protocol of choice to your hypervisor of choice. Treat storage just like a application VM….If virtualisation is good enough for production Oracle and SQL, then so it should be for storage (and it is :)

If you still are a bit hazy on how Data Path Redundancy works, check out this nu.school video: https://www.youtube.com/watch?v=9cigloapOXw

Please keep in mind that these I/O ‘pause’ effects would not be seen in a normal scenario such as a rolling one-click upgrade of Nutanix NOS software versions – where there would be no interruption of I/O at all (due to the fact the affected CVM can pro-actively redirect it’s hypervisor I/O to another CVM in preparation for a controlled reboot). The same is true for the 1-click hypervisor upgrades – because the guest VMs are vMotioned anyway before CVM shutdown (because the hypervisor itself is restarted).

This is why people can expand/upgrade their production Nutanix clusters in the middle of the day:

Very cool stuff. I remember wasting a Christmas Day in 2008 upgrading firmware on an old IBM blade chassis. I wish Nutanix had existed then…. I could have done the same on Nutanix nodes over Christmas lunch…from home!

BTW, don’t worry if you have more nodes (than the 4 shown in the video) or want more protection than simply handling a single CVM failure. If you have enough nodes or appliances you can lose entire blocks or racks of Nutanix and keep working (ie. lose more than one CVM simultaneously). Check out the Nutanix Bible for ‘Availability Domains’ and RF3 for examples and check the minimum requirements for these situations. Try that with a dual-controller SAN. 

I hope this post has shown you one of the most powerful aspects of a building your virtualisation infrastructure using Nutanix. In a lot of traditional environments, losing 6 disks at once would mean you’d be having a very bad day, maybe even having to invoke a DR strategy or restore from backups, perhaps with a lot of manual steps too. Who needs that drama in 2015? Go invisible and then go home and crack open a beer! Let the software do all the work for you, and with your free time you could maybe learn some new skills instead of babysitting infrastructure. Win/win!

Thanks to Josh Odgers (@josh_odgers), Matt Northam (@twickersmatt) and Matt Day (@idiomatically … a champion Nutanix customer!) for reviewing this post. 

Erasure Coding in NOS 4.1.3

Being new to Erasure Coding, this topic is taking me a bit of getting used to. So I decided to reach out to Nutanix Engineering and try and get some of my basic questions answered, mainly because if I am recommending to turn on a feature I need to get my head around it first. After some discussions, I’ve put together this post – which is a bit long – but it is important to get an understanding on the best way to implement it.

If you are going to try Nutanix Erasure Coding (or EC-X; for simplicity I’ll drop the X) which is in tech preview stage with NOS 4.1.3, here are some recommendations and notes for you to consider. Please note that things could change with subsequent software updates – so as always check with the latest release notes and docs.

As with every ‘tech preview’ feature, it is not recommended for production environments “yet”.

For a quick intro to EC, see: http://www.joshodgers.com/2015/06/09/whats-next-erasure-coding/ and http://myvirtualcloud.net/?p=7106

Nutanix has always had data protection in-built, called Redundancy Factor (or RF). Typically, 2 copies of data are kept on a Nutanix cluster (one on a local node where the guest VMs are running, and one remote) – therefore this is called RF2. Three copies on larger clusters (5 or more nodes) are also possible – so RF3. In either case, this investment in data protection comes at a cost of course. In the RF2 case, say for every 1GB of data you write, there is a mirrored 1GB stored as well in case of some sort of failure, and you have to have the storage available to handle this.

EC is a method to try to reduce this overhead.

How EC works:

This is the normal situation in a Nutanix cluster of 6 nodes with RF2 as standard:

RF2-now

RF2 as per a standard Nutanix deployment – here 4 VMs on 4 nodes are writing data (and RF2 copies to other nodes for redundancy)

RF2 replicas are therefore a ‘100%’ overhead in a sense – you need a 1:1 copy of your data to protect it in case of some form of disk/node failure. Fair enough, but is there a way to get some better efficiency?

EC has been introduced to try to reduce that 100% overhead of the 2nd copy.

EC works by creating a mathematical function around a data set such that if a member of the data set is lost, the lost data can be recovered easily from the rest of the members of the set.

So lets turn on EC. After a delay, the algorithm will act on the secondary copies of data and create a parity stripe set, then Curator will discard the original 2nd copies:

EC-after

EC on Nutanix

OK – but what happens in a failure? Lets fail a node and see what happens to that cold-tier of data:

EC-nodefail

Node C loses a disk – EC recalculates the missing piece

…So we need to recover the ‘c’ piece of data:

EC-nodefail2

Place the missing piece ‘c’ on another node

Standard EC requires at least 4 nodes (2/1 stripe), and 6 nodes are highly recommended (4/1 stripe), so as you can see ‘large’ clusters are preferred for EC. After 6 nodes the stripe algorithm remains at 4/1 by default (and recommended). You could change the stripe sizes via CLI (to say 10/1) but you’ll get diminishing returns versus the extra read penalty (reading more data on more nodes to form the EC stripe).

Small 3 node systems will require GFLAG modifications via Support/Engineering – so try to avoid turning on EC for small 3-node clusters (ie. leave as standard RF2 for all data tiers). The official line is that only 4 nodes and above are EC compatible.

Use Cases for EC: 

A good use case for EC is for workloads with the likelihood of a lot of cold tier data (eg. Snapshots, File Servers, Archive, Backups etc).

The following environments are not ideal for EC:

1. High amounts re-writes or overwrites. A heavy write container would not get the benefits of EC given that the data would always be refreshed and remain in RF2.

2. High number of small writes. The reason is that EC stripe calculation is done on extent group – or egroup – level (4MB) with 1MB extents, so there could be overhead if a lot of small write changes occur to update the extent groups.

Some Nutanix EC facts:

EC is a Container-level operation.

EC supports other Nutanix features such as dedupe/compression/DR/Metro; although Block Awareness is not supported with EC in this current NOS 4.1.3 release (I’d expect this to change with a future software update).

EC will take some time to realise savings due to the delay setting to ensure only ‘cold’ data is a candidate for EC. See below on delay timing recommendations.

New writes will remain in the ‘hot’ tier and are still RF2/RF3 as per normal, as EC only acts on the ‘cold’ data tier.

EC is a background Curator task, therefore minimal impact on I/O.

Once parity blocks are calculated and confirmed, the original 2nd-copy RF replicas are discarded.

The parity blocks are never stored on the same nodes as the ones with the original stripe data.

If you have RF3/FT2 containers, there will be a 2nd EC parity block – to handle 2 simultaneous failures (as you’d expect).

EC Delay Timings: 

EC has a timer which records the time since a piece of data was last modified. If the last time the data was modified hasn’t yet reached the delay threshold, the standard RF mechanisms remain current for that data.

Try and use a longer delay before EC kicks in (default is 1 hour) – so that only super-cold data is a candidate for EC – minimising the chance of EC replicas being needed by active VMs.  ie. Hot data is not affected as it would still be in a normal RF state.

If you configure EC for say a delay of 1 week, the failure scenario effects should reduce because clearly the data isn’t likely to be accessed by the VMs affected by the failure (or for example, a NOS upgrade) due to the time since last modification.

I recommend EC delay to be configured for a week. EC should be treated as long term cold data optimisation.

On a Nutanix cluster you can get an idea of the age (access/modification age) of your data by running the following command from any CVM:

curator_cli get_egroup_access_info

..to show the breakdown of the age of access/modified data to make an informed decision on suggested EC delay timings.

Set the EC timings to be older than the majority of modify write age – to minimise re-calculations in/out of EC stripes.

curator_cli

Output of ‘curator_cli get_egroups_access_info’  command in 4.1.3 – click to enlarge

So the output in this example above shows that the value of ’16’ means 16 ‘egroups’ of data were last modified between 1 hour and 1 day ago (3600-86400 seconds), but the same 16 egroups were accessed (read only) over a week ago (604800-2592000 seconds).

Another way to look at it is that 187+5501=5.6K egroups out of a total of 5.7K egroups was modified over 604800 seconds ago  – so this cluster 98% of its data has not been written to in less than a week. Yes, I had this cluster off for a while – it was being shipped :)

So, looks like most of my data was last modified over a week ago. So lets change the EC settings so that only cold data from over a week is considered as candidates for EC. Therefore, cold data less than a week old would remain as normal RF2 in my case.

As you can see, you can use this table to get a good feel for what EC delay setting makes most sense in your Nutanix cluster. If you are unsure on the best delay setting for your cluster, log a support call and one of the Nutanix support people will be happy to help.

From any CVM:

#set EC to one week (604800 seconds), note my container name was NFS1 
ctr edit name=NFS1 erasure-code="on" erasure-code-delay=604800

EC delay of one week would be fine for when you plan to fill up their clusters over a long time – say over a year. Try and only use EC on ‘stable’ clusters – ie. no plans for node removals, as node removals will need to break the EC stripes, and hence revert to standard RF2 – so sizing of containers still need to be considered for RF2 if you plan to remove nodes later (and that may take some time to break EC and revert to normal RF). This however is a rare scenario of course (not many people remove Nutanix nodes :)

EC and reads in a failure situation:

Because of the Data Locality bias in Nutanix (which despite what people may tell you is absolutely necessary at scale to maintain consistent performance), Curator prefers the local copy to survive in the case of EC. That is, only the replicas on ‘other’ nodes to be candidates for EC. This means that the node on which your guest VMs are running will have their original data available. If a failure occurs locally (eg disk) and the local bits are now missing, we can now re-create the missing (cold data) piece from reading data from 4 places (in the case of a 4/1 stripe).

EC decodes only the needed amount on the fly and returns that to satisfy the read. No need to recompute parities or create new strips – only decode what’s not available. The strip remains intact and the contents of all strip members remains the same.

For writes, remember that new writes would be treated as per normal in an RF2 situation (hot/oplog etc) – hence EC isn’t involved.

Testing EC in your environment: 

When testing EC, make sure you don’t fall into the trap of not utilising the scale-out advantages that Nutanix brings. A lot of people when testing Nutanix features and performance just test things using 1 Virtual Machine – and I’ve yet to discover a production environment with only 1 VM. Scale out your tests and you’ll see the difference.

Dima from Nutanix Engineering explained it to me with regard to EC on Nutanix:

Data locality bias means that Curator prefers the local replica to survive (only one of two replicas survive Erasure Coding in RF=2). For example, to create one EC strip of 4 egroups we must have at least 4 local egroup replicas on 4 distinct CVMs. Some who experimented for the first time with EC wrote data through a single UVM. This means all primary replicas (local replicas) reside on just one CVM and no erasure coding is taking place! In real life UVMs are usually distributed around the cluster, so the local bias is a good thing to have. 

 
Hope that this post has helped explain when to use EC and some of the inner workings of the tech.
Thanks to Jerome (@getafix_nutanix), Dima and Kannan for helping me gather info for this post, and to Chad (@thechads) and Dan (@DanMoz) for reviewing.