Being new to Erasure Coding, this topic is taking me a bit of getting used to. So I decided to reach out to Nutanix Engineering and try and get some of my basic questions answered, mainly because if I am recommending to turn on a feature I need to get my head around it first. After some discussions, I’ve put together this post – which is a bit long – but it is important to get an understanding on the best way to implement it.
If you are going to try Nutanix Erasure Coding (or EC-X; for simplicity I’ll drop the X) which is in tech preview stage with NOS 4.1.3, here are some recommendations and notes for you to consider. Please note that things could change with subsequent software updates – so as always check with the latest release notes and docs.
As with every ‘tech preview’ feature, it is not recommended for production environments “yet”.
For a quick intro to EC, see: http://www.joshodgers.com/2015/06/09/whats-next-erasure-coding/ and http://myvirtualcloud.net/?p=7106
Nutanix has always had data protection in-built, called Redundancy Factor (or RF). Typically, 2 copies of data are kept on a Nutanix cluster (one on a local node where the guest VMs are running, and one remote) – therefore this is called RF2. Three copies on larger clusters (5 or more nodes) are also possible – so RF3. In either case, this investment in data protection comes at a cost of course. In the RF2 case, say for every 1GB of data you write, there is a mirrored 1GB stored as well in case of some sort of failure, and you have to have the storage available to handle this.
EC is a method to try to reduce this overhead.
How EC works:
This is the normal situation in a Nutanix cluster of 6 nodes with RF2 as standard:
RF2 replicas are therefore a ‘100%’ overhead in a sense – you need a 1:1 copy of your data to protect it in case of some form of disk/node failure. Fair enough, but is there a way to get some better efficiency?
EC has been introduced to try to reduce that 100% overhead of the 2nd copy.
EC works by creating a mathematical function around a data set such that if a member of the data set is lost, the lost data can be recovered easily from the rest of the members of the set.
So lets turn on EC. After a delay, the algorithm will act on the secondary copies of data and create a parity stripe set, then Curator will discard the original 2nd copies:
OK – but what happens in a failure? Lets fail a node and see what happens to that cold-tier of data:
…So we need to recover the ‘c’ piece of data:
Standard EC requires at least 4 nodes (2/1 stripe), and 6 nodes are highly recommended (4/1 stripe), so as you can see ‘large’ clusters are preferred for EC. After 6 nodes the stripe algorithm remains at 4/1 by default (and recommended). You could change the stripe sizes via CLI (to say 10/1) but you’ll get diminishing returns versus the extra read penalty (reading more data on more nodes to form the EC stripe).
Small 3 node systems will require GFLAG modifications via Support/Engineering – so try to avoid turning on EC for small 3-node clusters (ie. leave as standard RF2 for all data tiers). The official line is that only 4 nodes and above are EC compatible.
Use Cases for EC:
A good use case for EC is for workloads with the likelihood of a lot of cold tier data (eg. Snapshots, File Servers, Archive, Backups etc).
The following environments are not ideal for EC:
1. High amounts re-writes or overwrites. A heavy write container would not get the benefits of EC given that the data would always be refreshed and remain in RF2.
2. High number of small writes. The reason is that EC stripe calculation is done on extent group – or egroup – level (4MB) with 1MB extents, so there could be overhead if a lot of small write changes occur to update the extent groups.
Some Nutanix EC facts:
EC is a Container-level operation.
EC supports other Nutanix features such as dedupe/compression/DR/Metro; although Block Awareness is not supported with EC in this current NOS 4.1.3 release (I’d expect this to change with a future software update).
EC will take some time to realise savings due to the delay setting to ensure only ‘cold’ data is a candidate for EC. See below on delay timing recommendations.
New writes will remain in the ‘hot’ tier and are still RF2/RF3 as per normal, as EC only acts on the ‘cold’ data tier.
EC is a background Curator task, therefore minimal impact on I/O.
Once parity blocks are calculated and confirmed, the original 2nd-copy RF replicas are discarded.
The parity blocks are never stored on the same nodes as the ones with the original stripe data.
If you have RF3/FT2 containers, there will be a 2nd EC parity block – to handle 2 simultaneous failures (as you’d expect).
EC Delay Timings:
EC has a timer which records the time since a piece of data was last modified. If the last time the data was modified hasn’t yet reached the delay threshold, the standard RF mechanisms remain current for that data.
Try and use a longer delay before EC kicks in (default is 1 hour) – so that only super-cold data is a candidate for EC – minimising the chance of EC replicas being needed by active VMs. ie. Hot data is not affected as it would still be in a normal RF state.
If you configure EC for say a delay of 1 week, the failure scenario effects should reduce because clearly the data isn’t likely to be accessed by the VMs affected by the failure (or for example, a NOS upgrade) due to the time since last modification.
I recommend EC delay to be configured for a week. EC should be treated as long term cold data optimisation.
On a Nutanix cluster you can get an idea of the age (access/modification age) of your data by running the following command from any CVM:
..to show the breakdown of the age of access/modified data to make an informed decision on suggested EC delay timings.
Set the EC timings to be older than the majority of modify write age – to minimise re-calculations in/out of EC stripes.
So the output in this example above shows that the value of ’16’ means 16 ‘egroups’ of data were last modified between 1 hour and 1 day ago (3600-86400 seconds), but the same 16 egroups were accessed (read only) over a week ago (604800-2592000 seconds).
Another way to look at it is that 187+5501=5.6K egroups out of a total of 5.7K egroups was modified over 604800 seconds ago – so this cluster 98% of its data has not been written to in less than a week. Yes, I had this cluster off for a while – it was being shipped :)
So, looks like most of my data was last modified over a week ago. So lets change the EC settings so that only cold data from over a week is considered as candidates for EC. Therefore, cold data less than a week old would remain as normal RF2 in my case.
As you can see, you can use this table to get a good feel for what EC delay setting makes most sense in your Nutanix cluster. If you are unsure on the best delay setting for your cluster, log a support call and one of the Nutanix support people will be happy to help.
From any CVM:
#set EC to one week (604800 seconds), note my container name was NFS1 ctr edit name=NFS1 erasure-code="on" erasure-code-delay=604800
EC delay of one week would be fine for when you plan to fill up their clusters over a long time – say over a year. Try and only use EC on ‘stable’ clusters – ie. no plans for node removals, as node removals will need to break the EC stripes, and hence revert to standard RF2 – so sizing of containers still need to be considered for RF2 if you plan to remove nodes later (and that may take some time to break EC and revert to normal RF). This however is a rare scenario of course (not many people remove Nutanix nodes :)
EC and reads in a failure situation:
Because of the Data Locality bias in Nutanix (which despite what people may tell you is absolutely necessary at scale to maintain consistent performance), Curator prefers the local copy to survive in the case of EC. That is, only the replicas on ‘other’ nodes to be candidates for EC. This means that the node on which your guest VMs are running will have their original data available. If a failure occurs locally (eg disk) and the local bits are now missing, we can now re-create the missing (cold data) piece from reading data from 4 places (in the case of a 4/1 stripe).
EC decodes only the needed amount on the fly and returns that to satisfy the read. No need to recompute parities or create new strips – only decode what’s not available. The strip remains intact and the contents of all strip members remains the same.
For writes, remember that new writes would be treated as per normal in an RF2 situation (hot/oplog etc) – hence EC isn’t involved.
Testing EC in your environment:
When testing EC, make sure you don’t fall into the trap of not utilising the scale-out advantages that Nutanix brings. A lot of people when testing Nutanix features and performance just test things using 1 Virtual Machine – and I’ve yet to discover a production environment with only 1 VM. Scale out your tests and you’ll see the difference.
Dima from Nutanix Engineering explained it to me with regard to EC on Nutanix:
Data locality bias means that Curator prefers the local replica to survive (only one of two replicas survive Erasure Coding in RF=2). For example, to create one EC strip of 4 egroups we must have at least 4 local egroup replicas on 4 distinct CVMs. Some who experimented for the first time with EC wrote data through a single UVM. This means all primary replicas (local replicas) reside on just one CVM and no erasure coding is taking place! In real life UVMs are usually distributed around the cluster, so the local bias is a good thing to have.
Hope that this post has helped explain when to use EC and some of the inner workings of the tech.
Thanks to Jerome (@getafix_nutanix), Dima and Kannan for helping me gather info for this post, and to Chad (@thechads) and Dan (@DanMoz) for reviewing.
It looks like the command ‘curator_cli get_egroup_access_info’ does not work on AOS 188.8.131.52.
At least not on my environment. Has the command been changed?
I just checked one of the HQ clusters running 184.108.40.206 and the command is still there and works. Thanks for reading and reach out if you need help.
Seems to work on my cluster as well now. Not sure what I did differently. Was connecting to the VIP earlier today so perhaps the command is now running on a different CVM. Thanks anyway for getting back to me.