Tag Archives: nutanix

On Nutanix Resiliency – The Controller VM

Spot on! Invisible Infra means not babysitting old DC constructs, instead deliver *business* advantages. Source: https://twitter.com/vjswami/status/562298721942401026

If you make the claim that your DC infrastructure should be invisible, clearly you need to have a solid story around handling failures. Coping with failure scenarios is critical in any infrastructure. I get asked often on how Nutanix clusters cope in situations where things go wrong unexpectedly – and rightly so.

Nutanix has been designed to expect failures of hardware, hypervisor or Nutanix’s own NOS software. I don’t care how long it takes to deploy a Nutanix cluster (5 mins or 60 mins depending if you want to change hypervisor etc) – as to me that is a one-off occurrence and fairly uninteresting (although a lot faster than the old way of deploying servers and a SAN). What really matters is whether or not you are going to be called in at 2am or on weekends when things die – or even worse if something goes wrong at 9am on a weekday.

In my personal view, uptime trumps all else in this modern 24×7 world. Gone are the days where ‘outage windows’ are acceptable. Of course, every deployment is unique and depends on the requirements and circumstances.

I usually demo HA events (eg. motherboard failure, nic failure etc) but none of that is particularly exciting for virtualisation administrators – failures of this nature are *expected* to be handled OK in 2015. Nutanix is no different here as that is all handled by the hypervisor.

One demo that really gets people excited is what I want to talk about here.

I’m going to assume you know a little about the Nutanix terminology and architecture, if you don’t check out The Nutanix Bible by Steven Poitras : http://stevenpoitras.com/the-nutanix-bible/

To set the scene, we have a fully working NX-3450 block (4 nodes of 3050 series) – which is a 2U appliance. These four nodes will be connected to a standard 10GbE switch. The Nutanix architecture is such that there is a Controller VM (CVM) on each and *every* node in the cluster (here a NX-3450 would therefore have 4 CVMs). This setup gives approx 7.5TB usable (after replication) but before dedupe/compression/EC.

With Nutanix, the storage fabric and the compute fabric are independent of each other. For example, you can upgrade one without affecting the other. Therefore the failure domain is limited to one or the other.

Let’s focus on a single CVM on a node that is also hosting your guest VMs. Let’s assume that the hypervisor is healthy and OK. What if someone powers off the CVM hard by mistake, or deletes the CVM, or if the CVM kernel panics and simply “stops” in the middle of normal guest VM I/O operations? What are the effects? What will your users notice? Will all hell break loose? Let’s check out this worst case scenario up-front, where the Nutanix “Data Path Redundancy” feature kicks into action.

I’ll demo this via this short video:

So what happened here? The CVM was killed with live guest VMs reading and writing data. This is NOT a usual scenario – but it is a great example of the robustness of the Nutanix distributed fabric. What we see is that the hypervisor is trying to write to the NFS datastore. When the CVM dies, the hypervisor on that node can no longer communicate with the NFS datastore and re-tries again and again. The hypervisor will continue to do so as per normal NFS timeout values. But here, the CVMs on other nodes were already taking steps to fix the situation – before the hypervisor knows what’s going on! After 30 seconds, the hypervisor on the affected node is told by one of the other CVMs to redirect I/O to one of the remaining healthy CVMs and things carry on as normal. This is well within the hypervisor’s (in this case ESXi 5.5) default NFS timeout period before it marks the NFS datastore as unavailable (125 seconds is mentioned here: http://cormachogan.com/2012/11/27/nfs-best-practices-part-2-advanced-settings/). Cool.

So what does all that mean to you and the end users of the guest VMs when a CVM is killed?

No data loss.
No loss in guest VM network connectivity.
No guest VM BSOD.
No hypervisor PSOD.
No vMotion / No HA event required.

….just a temporary 30 second ‘pause’ in I/O because the NFS datastore became unresponsive for 30 seconds – but it recovered before the hypervisor’s own timeout function – so the guests keep going from where they last tried to write data. Job done!

Obviously the same situation would be true if your Nutanix cluster was instead running Hyper-V (SMB) or Acropolis Hypervisor (iSCSI). Remember, to Nutanix the hypervisor itself is ‘above’ the distributed storage fabric (designed deliberately so). If it wasn’t then there would have needed to be a full outage of the VMs and a HA event.

Abstracting the physical disks away from the hypervisor means the Nutanix CVMs can present the protocol of choice to your hypervisor of choice. Treat storage just like a application VM….If virtualisation is good enough for production Oracle and SQL, then so it should be for storage (and it is :)

If you still are a bit hazy on how Data Path Redundancy works, check out this nu.school video: https://www.youtube.com/watch?v=9cigloapOXw

Please keep in mind that these I/O ‘pause’ effects would not be seen in a normal scenario such as a rolling one-click upgrade of Nutanix NOS software versions – where there would be no interruption of I/O at all (due to the fact the affected CVM can pro-actively redirect it’s hypervisor I/O to another CVM in preparation for a controlled reboot). The same is true for the 1-click hypervisor upgrades – because the guest VMs are vMotioned anyway before CVM shutdown (because the hypervisor itself is restarted).

This is why people can expand/upgrade their production Nutanix clusters in the middle of the day:

https://twitter.com/hdex/status/593236976393826304

https://twitter.com/Kawa_Farid/status/616847465065508865

https://twitter.com/idiomatically/status/566674611547684864

Very cool stuff. I remember wasting a Christmas Day in 2008 upgrading firmware on an old IBM blade chassis. I wish Nutanix had existed then…. I could have done the same on Nutanix nodes over Christmas lunch…from home!

BTW, don’t worry if you have more nodes (than the 4 shown in the video) or want more protection than simply handling a single CVM failure. If you have enough nodes or appliances you can lose entire blocks or racks of Nutanix and keep working (ie. lose more than one CVM simultaneously). Check out the Nutanix Bible for ‘Availability Domains’ and RF3 for examples and check the minimum requirements for these situations. Try that with a dual-controller SAN.

I hope this post has shown you one of the most powerful aspects of a building your virtualisation infrastructure using Nutanix. In a lot of traditional environments, losing 6 disks at once would mean you’d be having a very bad day, maybe even having to invoke a DR strategy or restore from backups, perhaps with a lot of manual steps too. Who needs that drama in 2015? Go invisible and then go home and crack open a beer! Let the software do all the work for you, and with your free time you could maybe learn some new skills instead of babysitting infrastructure. Win/win!

Thanks to Josh Odgers (@josh_odgers), Matt Northam (@twickersmatt) and Matt Day (@idiomatically … a champion Nutanix customer!) for reviewing this post.

Erasure Coding in NOS 4.1.3

3 Replies

Being new to Erasure Coding, this topic is taking me a bit of getting used to. So I decided to reach out to Nutanix Engineering and try and get some of my basic questions answered, mainly because if I am recommending to turn on a feature I need to get my head around it first. After some discussions, I’ve put together this post – which is a bit long – but it is important to get an understanding on the best way to implement it.

If you are going to try Nutanix Erasure Coding (or EC-X; for simplicity I’ll drop the X) which is in tech preview stage with NOS 4.1.3, here are some recommendations and notes for you to consider. Please note that things could change with subsequent software updates – so as always check with the latest release notes and docs.

As with every ‘tech preview’ feature, it is not recommended for production environments “yet”.

For a quick intro to EC, see: http://www.joshodgers.com/2015/06/09/whats-next-erasure-coding/ and http://myvirtualcloud.net/?p=7106

Nutanix has always had data protection in-built, called Redundancy Factor (or RF). Typically, 2 copies of data are kept on a Nutanix cluster (one on a local node where the guest VMs are running, and one remote) – therefore this is called RF2. Three copies on larger clusters (5 or more nodes) are also possible – so RF3. In either case, this investment in data protection comes at a cost of course. In the RF2 case, say for every 1GB of data you write, there is a mirrored 1GB stored as well in case of some sort of failure, and you have to have the storage available to handle this.

EC is a method to try to reduce this overhead.

How EC works:

This is the normal situation in a Nutanix cluster of 6 nodes with RF2 as standard:

RF2 as per a standard Nutanix deployment – here 4 VMs on 4 nodes are writing data (and RF2 copies to other nodes for redundancy)

RF2 replicas are therefore a ‘100%’ overhead in a sense – you need a 1:1 copy of your data to protect it in case of some form of disk/node failure. Fair enough, but is there a way to get some better efficiency?

EC has been introduced to try to reduce that 100% overhead of the 2nd copy.

EC works by creating a mathematical function around a data set such that if a member of the data set is lost, the lost data can be recovered easily from the rest of the members of the set.

So lets turn on EC. After a delay, the algorithm will act on the secondary copies of data and create a parity stripe set, then Curator will discard the original 2nd copies:

EC on Nutanix

OK – but what happens in a failure? Lets fail a node and see what happens to that cold-tier of data:

Node C loses a disk – EC recalculates the missing piece

…So we need to recover the ‘c’ piece of data:

Place the missing piece ‘c’ on another node

Standard EC requires at least 4 nodes (2/1 stripe), and 6 nodes are highly recommended (4/1 stripe), so as you can see ‘large’ clusters are preferred for EC. After 6 nodes the stripe algorithm remains at 4/1 by default (and recommended). You could change the stripe sizes via CLI (to say 10/1) but you’ll get diminishing returns versus the extra read penalty (reading more data on more nodes to form the EC stripe).

Small 3 node systems will require GFLAG modifications via Support/Engineering – so try to avoid turning on EC for small 3-node clusters (ie. leave as standard RF2 for all data tiers). The official line is that only 4 nodes and above are EC compatible.

Use Cases for EC:

A good use case for EC is for workloads with the likelihood of a lot of cold tier data (eg. Snapshots, File Servers, Archive, Backups etc).

The following environments are not ideal for EC:

1. High amounts re-writes or overwrites. A heavy write container would not get the benefits of EC given that the data would always be refreshed and remain in RF2.

2. High number of small writes. The reason is that EC stripe calculation is done on extent group – or egroup – level (4MB) with 1MB extents, so there could be overhead if a lot of small write changes occur to update the extent groups.

Some Nutanix EC facts:

EC is a Container-level operation.

EC supports other Nutanix features such as dedupe/compression/DR/Metro; although Block Awareness is not supported with EC in this current NOS 4.1.3 release (I’d expect this to change with a future software update).

EC will take some time to realise savings due to the delay setting to ensure only ‘cold’ data is a candidate for EC. See below on delay timing recommendations.

New writes will remain in the ‘hot’ tier and are still RF2/RF3 as per normal, as EC only acts on the ‘cold’ data tier.

EC is a background Curator task, therefore minimal impact on I/O.

Once parity blocks are calculated and confirmed, the original 2nd-copy RF replicas are discarded.

The parity blocks are never stored on the same nodes as the ones with the original stripe data.

If you have RF3/FT2 containers, there will be a 2nd EC parity block – to handle 2 simultaneous failures (as you’d expect).

EC Delay Timings:

EC has a timer which records the time since a piece of data was last modified. If the last time the data was modified hasn’t yet reached the delay threshold, the standard RF mechanisms remain current for that data.

Try and use a longer delay before EC kicks in (default is 1 hour) – so that only super-cold data is a candidate for EC – minimising the chance of EC replicas being needed by active VMs. ie. Hot data is not affected as it would still be in a normal RF state.

If you configure EC for say a delay of 1 week, the failure scenario effects should reduce because clearly the data isn’t likely to be accessed by the VMs affected by the failure (or for example, a NOS upgrade) due to the time since last modification.

I recommend EC delay to be configured for a week. EC should be treated as long term cold data optimisation.

On a Nutanix cluster you can get an idea of the age (access/modification age) of your data by running the following command from any CVM:

curator_cli get_egroup_access_info

..to show the breakdown of the age of access/modified data to make an informed decision on suggested EC delay timings.

Set the EC timings to be older than the majority of modify write age – to minimise re-calculations in/out of EC stripes.

Output of ‘curator_cli get_egroups_access_info’ command in 4.1.3 – click to enlarge

So the output in this example above shows that the value of ’16’ means 16 ‘egroups’ of data were last modified between 1 hour and 1 day ago (3600-86400 seconds), but the same 16 egroups were accessed (read only) over a week ago (604800-2592000 seconds).

Another way to look at it is that 187+5501=5.6K egroups out of a total of 5.7K egroups was modified over 604800 seconds ago – so this cluster 98% of its data has not been written to in less than a week. Yes, I had this cluster off for a while – it was being shipped :)

So, looks like most of my data was last modified over a week ago. So lets change the EC settings so that only cold data from over a week is considered as candidates for EC. Therefore, cold data less than a week old would remain as normal RF2 in my case.

As you can see, you can use this table to get a good feel for what EC delay setting makes most sense in your Nutanix cluster. If you are unsure on the best delay setting for your cluster, log a support call and one of the Nutanix support people will be happy to help.

From any CVM:

#set EC to one week (604800 seconds), note my container name was NFS1 
ctr edit name=NFS1 erasure-code="on" erasure-code-delay=604800

EC delay of one week would be fine for when you plan to fill up their clusters over a long time – say over a year. Try and only use EC on ‘stable’ clusters – ie. no plans for node removals, as node removals will need to break the EC stripes, and hence revert to standard RF2 – so sizing of containers still need to be considered for RF2 if you plan to remove nodes later (and that may take some time to break EC and revert to normal RF). This however is a rare scenario of course (not many people remove Nutanix nodes :)

EC and reads in a failure situation:

Because of the Data Locality bias in Nutanix (which despite what people may tell you is absolutely necessary at scale to maintain consistent performance), Curator prefers the local copy to survive in the case of EC. That is, only the replicas on ‘other’ nodes to be candidates for EC. This means that the node on which your guest VMs are running will have their original data available. If a failure occurs locally (eg disk) and the local bits are now missing, we can now re-create the missing (cold data) piece from reading data from 4 places (in the case of a 4/1 stripe).

EC decodes only the needed amount on the fly and returns that to satisfy the read. No need to recompute parities or create new strips – only decode what’s not available. The strip remains intact and the contents of all strip members remains the same.

For writes, remember that new writes would be treated as per normal in an RF2 situation (hot/oplog etc) – hence EC isn’t involved.

Testing EC in your environment:

When testing EC, make sure you don’t fall into the trap of not utilising the scale-out advantages that Nutanix brings. A lot of people when testing Nutanix features and performance just test things using 1 Virtual Machine – and I’ve yet to discover a production environment with only 1 VM. Scale out your tests and you’ll see the difference.

Dima from Nutanix Engineering explained it to me with regard to EC on Nutanix:

Data locality bias means that Curator prefers the local replica to survive (only one of two replicas survive Erasure Coding in RF=2). For example, to create one EC strip of 4 egroups we must have at least 4 local egroup replicas on 4 distinct CVMs. Some who experimented for the first time with EC wrote data through a single UVM. This means all primary replicas (local replicas) reside on just one CVM and no erasure coding is taking place! In real life UVMs are usually distributed around the cluster, so the local bias is a good thing to have.

Hope that this post has helped explain when to use EC and some of the inner workings of the tech.
Thanks to Jerome (@getafix_nutanix), Dima and Kannan for helping me gather info for this post, and to Chad (@thechads) and Dan (@DanMoz) for reviewing.

Nutanix Community Edition Deployment Options

Invisible Infrastructure

Helping people make storage and compute invisible since 2012. Next stop : making the hypervisor and cloud services just as invisible.

Tag Archives: nutanix

On Nutanix Resiliency – The Controller VM

Erasure Coding in NOS 4.1.3

Nutanix Community Edition Deployment Options