Author Archives: Cameron Stockwell

Intro to Nutanix Lifecycle Manager (LCM) v1.2

“Single Pane of Glass” gets thrown around a lot by vendors …. as does “upgrades are easy”. If you lucky enough to be a Nutanix customer, you already live this dream. 

But how could the Nutanix Engineering team make the experience even better?

While Hugh was using the tried-and-true traditional Nutanix AOS upgrade method within Nutanix Prism, each individual component (such as AOS, hypervisor, BIOS etc) had to be done independently. It was reliable of course, but what if you wanted to upgrade many components of your cluster at once and it be just as easy and reliable and to take care of the dependencies for you?

Plus, with security releases coming thick and fast these days, it is imperative that we try to make it dead simple for customers like Hugh to be able to react quickly to patch their infrastructure, regardless of hypervisor or hardware component type.

Thus, Nutanix Life Cycle Manager (LCM) was born.

LCM-menu.PNG

The LCM feature is available with Nutanix AOS 5.0+

LCM is a framework that can detect and upgrade hardware and software components in a rolling fashion completely in-band via Nutanix Prism, taking care of any dependencies and maintenance mode operations as needed to conduct the upgrades.

The idea is that you can go to the one location to manage all your Nutanix related software and firmware updates, click a button and then LCM will orchestrate the entire process, with no effect on your running workloads. All this while you go and do something else in the meantime, or perhaps just have a quiet glass of red.

LCM has the power to tell whatever brand of hypervisor you are using to evacuate VMs to other nodes and reboot the host should the update require it. Only when a host is confirmed that it has returned to service is the next host able to conduct it’s upgrade.

LCM is intelligent enough to allow you to select one node only for some updates. For example, you might want to just upgrade the BIOS or disks firmware in one node. Some updates are cluster wide and some can be node based depending on the component – but in either case LCM will take care of the operation for you.

If you are not familiar with Nutanix LCM, then take a look at the quick video demos :

 

LCM is the framework which (eventually) will be the method in which you manage updates and upgrades to your Nutanix clusters. I say ‘eventually’ because it is still early days for LCM, but things are ramping up quickly. As such, you should check for new LCM updates every week and see which new features are unlocked with the latest LCM Framework updates.

You don’t have to wait for a new version of the Nutanix AOS software either – LCM is independent of AOS – so you can upgrade LCM at any time a new update is available.

So what’s new in LCM v1.2?

With v1.2, LCM supports additional inventory and update components on Nutanix NX and Dell XC clusters. Lenovo HX and Nutanix Software-Only support is under development.  Up to now, only the SATADOM updates were supported.

The following has been added in LCM v1.2:

Nutanix NX and SX Platform LCM support requires AHV or ESXi 5.5, 6.0, or 6.5 and supports updates to the following components : HDD, SSD and NVMe drives. 

Dell XC Platform LCM support requires ESXi 5.5 and 6.0 and AOS 5.1.0.3 or newer and supports updates to the following XC components: XC BIOS, XC iDRAC, XC HBA controller, XC NIC, XC Disks (SSD and HDD). 

For more details, check the release notes on the Nutanix Support Portal.

Using LCM

If you’ve not had a look at LCM before, I suggest you update the LCM Framework to the latest version, run an ‘Inventory’ (discovery) job and take a look around.

It is a good idea to run a ‘Perform Inventory’ operation first. This will scan your cluster and check if there are any updates available for any components, including the LCM Framework itself.

lcm-perfinv.PNG

Go to the LCM page and select Options->Perform Inventory.  The status will change to “Perform Inventory in Progress” which takes a few minutes as your whole cluster is scanned.

lcm-perfinv-in-progress.PNG
You may see some available updates:

lcm-availupdate2.PNG

You can see that the above screenshot shows software and some hardware components that have available updates. In order to update LCM to the latest, I’ll select (and update) the ‘Cluster Software Component’ and hit the ‘Update Selected’ button.

 

lcm-availupdate3-sure.PNGRun that update. You will see a message that “Services will be restarted” – meaning the Nutanix internal services will restart (LCM related) – but this is a non disruptive operation to your workloads so it is safe to run this update anytime. Once you hit the “Apply 1 Update” button,  update process starts.

 

lcm-update-in-progress.PNG

Once the new update to LCM is installed, run Perform Inventory again to see if there are any new updates or components supported in the new version (now that you’ve updated the LCM Framework, there may be some more unlocked features).

If there are any other updates available, you may choose to update them as well using the same logic.

Future Plans

In the coming months you will see more unlocked updates appear in LCM, including broader hypervisor support, more hardware component support, more Nutanix software support (eg. NCC, Foundation etc) so that the current “Upgrade Software” menu will eventually be retired and LCM takes over all functions related to our “1-Click Upgrades”.

LCM in Prism Central will also launch in 2018, with the ability to expand LCM to handle upgrades across multiple clusters.

In the meantime, the LCM Engineering team would love to hear your suggestions and feedback. They also love twitter mentions, so please keep them coming.

 

Nutanix Foundation 3.7

Here are some of the improvements in Nutanix Foundation 3.7 which was released on Feb 27, 2017.

Reorder of Blocks via Drag n Drop 

You can now reorder blocks so that they are in the desired positions whilst keeping the IP addressing sequence intact. This is especially useful for large deployments.

Screen Shot 2017-02-27 at 4.28.53 PM.png

Clicking ‘Reorder Blocks’ will show a drag n drop popup

A demo video illustrating the effects of reordering blocks (thanks YJ for the video):

IP Sequencing

Instead of automatic increments of IP addresses by 1, you can use a ‘+’ operand to create the sequence you desire. See below for an example.

Screen Shot 2017-02-27 at 4.33.40 PM.png

Using +2 to auto fill every 2nd IP address in the sequence

Timezone support

You can now optionally select the timezone for your cluster at the Global Configuration page.

Screenshot 2017-03-02 10.27.46.png

Timezone support

Image Selection Page Improvements

The Image Selection Page no longer forces you to upload a new images and instead use the ones already on the detected nodes.  You can also add/delete ISOs from the UI.

Screen Shot 2017-02-27 at 4.39.45 PM.png

Image Selection Page Redesigned

Updating Foundation from UI

You can now use the UI to upload a Foundation tarball and update Foundation itself when new versions are available. It is highly recommended to always use the latest version, so it is easier to update. You can get the latest release notes and Foundation files from the Nutanix Support Portal.

Screen Shot 2017-02-27 at 4.41.43 PM.png

screen-shot-2017-02-27-at-4-42-10-pm

Summary

This is a great release by the Foundation Engineering team at Nutanix with over 70 bug fixes and improvements such as the above included. There are several exciting features coming in future releases (approximately every month) so always check the Nutanix Support Portal for the latest updates. If there is a feature you’d like so see, feel free to contact me.

Nutanix Power Loss and Self-Recovery Demo

Failures. They happen. When they do, it is usually stressful to get things back to normal. What if that stress of a complete failure of all your servers and storage could be avoided, with the added bonus of recovery being fast?

One of the benefits of going down the hyperconverged route is that your infrastructure becomes simplified. Complexity is the enemy of uptime – so keeping things simple will win every time when uptime matters.

The Setup:

A Nutanix NX-3450 block (essentially 4 ESXi hosts in a 2U appliance).
Nutanix Software AOS 4.5.1
ESXi 6.0 on the 4 hosts
vCenter 6.0 (vCSA) hosted on one of these hosts, with HA and DRS on.

This is a fully contained vSphere environment in a hyperconverged 2U package.

The Test:

Now I’m going to unexpectedly kill this 4-node cluster.

Hard.

That is, a complete power loss at the same time by killing the power to both power supplies in the Nutanix block.

All the Guest VMs get killed of course, including vCenter and including the Nutanix Controller VMs (CVMs).

How difficult will it be to get everything back up and running?

In the video below I power off the cluster hard, then power on and manually start a timer and see what happens and when.

(Keep reading below if you don’t want to watch the whole thing)

So what happened?

The video is posted above, but for those who don’t want to sit through here’s the summary timeline below. Note that the times below are from the laptop counter time shown in the video:

0m:00s – Timer is started once power is switched back on.

1m:11s – My VMware VIclient connection times out due to no response from vCenter.

2m:40s – First ESXi host responds to ping.

4m:18s – Disk lights start to respond, indicating that the Controller VMs are booting and the disks are passed through to them.

4m:49s – First CVM responds to ping.

6m:29s – Although not seen in on the screen, the NFS datastore was auto-restored to each ESXi host (allowing the Guest VMs to start, including vCenter). This means that the Nutanix cluster services have started.

9m:23s – vCenter, hosted on this failed cluster, has restarted and starts to respond to ping.

11m:50s – First successful attempt login to vCenter (hurry up vCenter! :)

12m:03s – vCenter login successful and things look good – all guest VMs have restarted.

…so it took just over 12 minutes for the infrastructure to recover…. without any human intervention at all. All the guest VMs are powered back on and are running. Sure, they will be going through their crash-recovery procedures and you probably need to fix up some applications, but the infrastructure is up and stable (which is of course a prerequisite before you can even start to troubleshoot the applications).

OK, so what?

You’ve just witnessed the power of treating your storage like any other application or VM in your environment.

You can see the Nutanix Controller VMs sitting next to the Windows Guest VMs on the vCenter screen. By remaining independent of the hypervisor, and independent of vCenter, your software-defined storage can self-recover from a complete power failure without any human intervention, no reliance on other technologies, and then your workloads can start as you’d expect (or want!) them to in a failure scenario.

There is no need for external “witnesses”, no mucking around, no hours/days of downtime while you speak to the vendor to help you recover…just get the job done and production back to normal ASAP.

The Nutanix software does the hard work for you. Software is where it’s at in 2016!

By contrast, what if you had “traditional” separate servers and storage, how long would it take to recover from a complete power outage of both? What about a failure of the SAN only? If you were to do the same with your SAN (pull the power hard) what would happen? How long would it take for you to recover it? Could you get away with not touching it at all and expect it to recover and VM workloads would be fine as well?

Remember, no one who pays the bills cares – they just want the applications UP.

In fact, this is how I normally move my Nutanix block around – I usually just pull the power. It saves me some time :)

Why do this test? (or “Whatever…my DC is designed to prevent power loss!”)

Things go wrong. Mains Power can die. UPS, Batteries, Generators can die. I’ve seen cases where someone has not refilled their backup diesel generator – and didn’t know until they needed it. I’ve seen a UPS firmware update kill power to a whole row of racks in the middle of the day. I’m sure many of you reading this are getting flashbacks from similar scenarios. It isn’t fun when it happens. I know of Nutanix customers who have had similar power issues and what I demo here is consistent with their experience.

I keep saying it, but ALWAYS test failure scenarios, especially in this new world of SDDC and hyperconvergence. You will find that not all hyperconverged players are created equal :)

Sure, you could perhaps auto-cutover to DR in this situation (depending on the length of outage) but isn’t it reassuring that you don’t have to worry about the infrastructure if you lost a whole cluster due to some unforeseen event?

How does Nutanix achieve this?

Writes are always to persistent storage on a Nutanix cluster, verified and checksummed before acknowledging that write back to the Guest VM making the request. Therefore, your data is always consistent from the Guest VM’s perspective. If the Guest VM thinks that a write has occurred, you can be assured that there are at least 2 copies of that write in a Nutanix cluster, across the nodes (the local node and one remote node at least).

You have a completely self-contained distributed file system that is designed from the ground up for handling failures and self-heal.

Other points:

Note that Nutanix can suffer a complete power loss of a block and your VMs can start on other blocks (normal HA) if you have a minimum of 3 blocks in your cluster. This is called Availability Domains (formerly ‘Block Awareness’) and it is inbuilt – you don’t have to configure anything. What this does is ensure data replicas are placed on different blocks from the source block. Cool.

Nutanix is simplifying the datacenter footprint, and more uptime is the result.

I often travel to remote locations and this demo has always had a positive response. Hopefully, I have shown you one of the many aspects in which Nutanix is unique in the hyperconverged and SDDC space.