In this post, I will be covering the process of upgrading NSX 6.1.4 to the newly released version 6.2.0.
The lab requiring the update is running vCenter 5.5 Build 2442329, with two ESXi hosts running 5.5 U2 Build 2638301. Most of the really cool improvements I saw at VMworld 2015 surrounding version 6.2 (cross site functionality) are dependent on vSphere 6.0. My company hasn’t touched 6.0 yet, so this lab remains at 5.5 in order to vet out functionality that can be leveraged back in the office. J Stay tuned…. I will be teaming up with Chris Williams @mistwire in the next few weeks to write up the process of a fresh NSX 6.2 install dropped on vSphere 6.0!!!
NOTE : The official documentation on the install/upgrade process is pretty detailed, and can be found on VMware’s site at https://www.vmware.com/support/pubs/nsx_pubs.html. The process I am following is very close to the steps outlined in the “NSX 6.2 Upgrade Guide” located at this link.
My environment has quite a few NSX features enabled on it that we will be able to test both during and following the upgrade procedure. These include:
- NSX Edge IPsec VPN
- NSX Edge Load Balancer
- NSX Edge FW Ruleset
- NSX Distributed Firewall Ruleset
- Multiple Logical Switches
- Logical Distributed Router
- OSPF between LDR and Edge Gateway
There are specific areas of the product that I have not set up on this lab, namely:
- Service composer configurations
- Third party integrations
- Data Security implementations
Therefor bear in mind the need to ALWAYS follow the official documentation as not every component is going to be vetted for upgrade success in this blog.
The upgrade process is not a zero downtime event, as we will see during certain portions of the upgrade. Let’s get started!
ORDER OF OPERATIONS
The main components of NSX get updated in the following fashion:
- NSX Manager
- NSX Controllers
- NSX Enabled ESXi Hosts
- NSX Edges (LDR and GW)
PRE-EMPTIVE BACKUPS – NSX MANAGER
First, we’ll create a backup of the NSX Manager appliance. All of the information related to Edge Gateway and LDR deployments, as well as host Cluster settings, Transport settings, and Distributed Firewall rulesets are all kept in the NSX Manager appliance, so backing this up will be a critical step.
Log in to the NSX Manager Web UI. (This isn’t the vCenter Web Client I am referring to, but rather the admin page for the appliance itself.)
Once authenticated, select “Backup and Restore”.
You can select either an FTP or SFTP target for the backup. Since this is just my lab, I haven’t scheduled backups, which should be done for an enterprise implementation. For this one-off, I am just going to push to a random CentOS 7 guest I have on the network via SFTP.
Once your target is set, let that baby rip! Just hit “Backup” and verify success.
OK – so we are backed up for NSX Manager in case things go south…. Let’s do the same for the Controller Cluster.
PRE-EMPTIVE BACKUPS – NSX CONTROLLERS
Log into the vSphere Web Client, and navigate to the “Networking and Security” tab.
Select “Installation” on the left hand menu, and navigate to the “Management” tab. From here, simply select the action to “Download the Controller Snapshot”. (You will notice in the screen shot that I have only ONE controller….. ghetto-lab J). The Controller information has all of the VXLAN identifications and configurations, so this piece is just as critical as the NSX Manager.
The backup will end up downloading a gzipped tar ball to your local system. Interestingly, the process for restoring a Controller Cluster isn’t documented, and it specifically calls out the need to engage VMware Support for a restore operation. Uggh… sounds like a potential headache if the upgrade goes bad and we need to restore. Hopefully this area of the product gets streamlined in the next release.
PRE-EMPTIVE BACKUPS – NSX EDGE DEVICES
All NSX Edge configurations (LDR and GW) are backed up as part of the NSX Manager backup. This includes related FW rule sets, as well as the DFW rule sets.
UPGRADE – NSX MANAGER
Now off to the races. Log back into the NSX Manager Web UI, and select “Upgrade”.
Verify we are starting from the expected version of NSX Manager, in our case 6.1.4. Select “Upgrade” and navigate to the TGZ upgrade file.
Once the TGZ file is selected, continue with the upgrade.
The NSX Manager will upload and check the integrity of the file, and then the upgrade will begin.
In my case, I had two runs fail to extract properly.
I had to shut down my NSX Manager, and increase the RAM allocation from the 6.1.4 setting of 12GB to the 6.2 requirement of 16GB. Once this was done, the extraction and install process moved forward without issue.
The wizard warns to take a snapshot of the VM, and also asks about SSH access. Click “Upgrade” once this has been done/answered.
The process will run the upgrade from this point without user interaction.
The NSX Manager will reboot into the newer version. SSH to the NSX Manager to ensure the service properly started, and also to verify the newer version.
Looks great! Also log in to the NSX Manager WEB UI to verify all services have successfully started up. (NSX Universal Sync Service is for vSphere 6.0 only!)
The documentation says to restart the vSphere Web Client Service to allow the NSX plugin to upgrade as well. Once done, log in to the Web Client to verify the newer version is also visible in vCenter.
All good…. On to the controllers!
UPGRADE – NSX CONTROLLER CLUSTER
If you are following best practice you have a cluster of three NSX Controllers. Log into the vSphere web client and navigate to “Networking & Security”.
Select “Installation” on the left. The view will default to the “Management” tab open, which will list the Controller Cluster and associated status.
Select the “Upgrade Available” link advertised under the Controller Cluster Status column for the NSX Manager, and initiate the Controller Cluster upgrade.
The NSX Manager will push the upgrade to the controllers, and cycle the upgrade to one controller at a time.
As the process cycles through the upgrades, it is normal to see a “Disconnected” status come and go from the various nodes as quorum status changes during the restarts. I didn’t observe any network drops or issues during the process of my upgrade, which is to be expected since the controllers are outside of the NSX data plane.
Once all are upgraded, verify a “Normal” status via web client.
Also login to each controller via SSH and run “show control-cluster status” to verify health at a more granular level.
And that is it! My controllers took between 5 and 9 minutes each to upgrade. The VMware documentation talks about steps to take if an upgrade process stalls, however I didn’t encounter any such issue.
UPGRADE – ESXi
Next up is the hosts. I have two 11th gen Dell rack servers in this cluster, so let’s get started getting the 6.2 VIBs pushed to these guys.
SSH to the first host and run the command “esxcli software vib list | grep esx” to get a list of the current NSX VIB versions. We care about the “esx-dvfilter-switch-security, esx-vsip,and esx-vxlan VIBs. We can see that my hosts are still running version 2691051.
Once in maintenance mode, log into the vSphere web client and navigate to “Networking & Security”.
Navigate to “Installation” and the associated “Host Preparation” tab. Similar to the Controller Cluster upgrade process, there will be an “Upgrade Available” link for the cluster which we will select.
Once selected the NSX Manager will remove the old VIBs, and push the new ones to all hosts in the cluster at once. The old code is running in memory, so this is a zero downtime action. Once the VIBs are pushed, the cluster will show all hosts with a status of “Not Ready”. The hosts will have to be rebooted into the new code set once the upgrade is complete.
Since I only have two hosts (and not the recommended bare minimum of three), I chose to vacate my hosts manually to get better control over the process. I also disabled both HA and DRS on my cluster for the upgrade, so I don’t have anti-affinity rules battling my upgrade process. Whether you decide to do this manually or let DRS handle the work, each host will need to be cycled through maintenance mode for the upgrade, so begin by getting the first host vacated.
Next I hit “Resolve” for the host that I manually vacated, which initiates the reboot. If you have DRS enabled on the cluster, you can select “Resolve” at the cluster level, and the system will cycle through the cluster rebooting one host at a time automatically.
Once the host comes back up and re-registers with vCenter, we can verify the upgrade was successful.
Also log back into the shell to verify the VIB version. (in 6.2,the “esx-dvfilter-switch-security” VIB gets combined with the “esx-vxlan” VIB.)
Sweet. On to the remaining hosts in the cluster for a reboot. Once all are done, we have successfully upgraded the hosts to 6.2!
UPGRADE – EDGE Devices
Final component to upgrade are the Edge devices. These include any Logical Distributed Routers or Gateway Devices. The process for the Edge upgrade actually instantiates a completely new Edge inline with the old, and then cuts over to the new by swapping over the vNIC connection and sending a GARP upstream. The upgrade guide says that we can expect some sort of network interruption for this process, which can be mitigated on the Edge Gateway if you are leveraging multiple Edge GW devices running active/active via ECMP. LDR is not capable of running ECMP, so this mitigation would not be available on the LDR. I am running my Edge devices only in HA mode, and am not leveraging ECMP, so I am very interested to see how many pings get dropped via my Edge devices during this cutover.
We’ll start this process with the LDR in my environment.
Log into the vSphere web client and navigate to “Networking & Security”.
Navigate to the Edge that needs the upgrade. Select “Actions” and then “Upgrade Version”.
We can see in the vSphere task window that the upgrade process is indeed deploying a new upgraded OVF to replace the existing.
As mentioned earlier, I am running both a single Edge LDR and a single Edge GW in this lab, and both are set with HA. This means there is a standby VM deployed as well for each LDR and GW. What I witnessed during my upgrade was a little unsettling…. I had assumed the described “vNIC swap and GARP” between the old and new appliances would be maybe seconds of interruption. Not so.
I had rolling pings going to two CentOS 7 systems running on subnets fronted by this LDR HA pair. What I witnessed is that the pings began getting dropped at the very START of the upgrade process, or to be precise at the same time that the upgrade process began spinning up the new appliance!! Uggh. The steps of the upgrade and rolling pings went like this:
- Selected “Upgrade Version” at 13:05:43. First OVF to replace LDR0 begins deploying. Rolling pings stopped immediately.
- New appliance done deploying. Powered on at 13:09:04.
- vNIC swap, GARP and rename of appliance to replace LDR0. Shutdown and delete of old LDR0. All between about 13:09:24 and 13:09:45.
- Second OVF begins deploying to replace LDR1 in the HA pair at 13:09:50.
- New appliance done deploying. Powered on at 13:10:09.
- vNIC swap, GARP and rename of appliance to replace LDR1. Shutdown and delete of LDR1. All between about 13:10:30 and 13:10:57.
- Rolling pings start working again, at 13:10:57.
Yuck…. A little over a 5 minute outage. Lame. Well they did warn us at least. Plan your maintenance window accordingly!!
My event list covering the above installer steps is here:
This makes me wonder if I could have cut my downtime by un-doing the HA config before proceeding with the upgrade, since I observed a complete outage while it serviced the upgrade to both the active and passive units. Stands to reason.
In any event, besides the lengthy outage (5 min) I am upgraded without issue!
Now on the Edge Gateway Device.
I am running several services via this Edge HA pair, including Load Balancer, IPsec VPN, and FW. I am expecting the same experience as far as the LDR outage since I am not leveraging ECMP in this lab. Here we go…
As with the LDR, navigate to the Edge that needs the upgrade. Select “Actions” and then “Upgrade Version”.
The process is the same as described above for the LDR section – new OVF’s are deployed one at a time, and are swapped out with the old active and passive GW appliances in a serial fashion.
Surprisingly, my outage for this process on the GW devices was far less, only about a minute. As before, I had several rolling pings going to destinations egressing the Edge GW, including one to www.gooogle.com (18.104.22.168), as well as one to a Fedora 20 system running at the other end of the IPsec tunnel being maintained by this Edge GW. In addition, I had an aggressive wget command running in an endless loop to hit the LB running on this Edge which is front ending a couple of CentOS 7 systems running Nginx web server.
Steps, timings, and observations for this run:
- Selected “Upgrade Version” at 13:36:00. First OVF to replace EDGE0 begins deploying. Rolling pings continue to flow.
- New appliance done deploying. Powered on at 13:36:15. Rolling pings continue to flow.
- vNIC swap, GARP and rename of appliance to replace EDGE0. Shutdown and delete of old EDGE0. All between about 13:36:33 and 13:36:49. Rolling pings continue to flow.
- Second OVF begins deploying to replace EDGE1 in the HA pair at 13:36:54.
- New appliance done deploying. Powered on at 13:37:07.
- vNIC swap, GARP and rename of appliance to replace EDGE1. Shutdown and delete of old EDGE1. All between about 13:37:26 and 13:37:42. Pings and wgets stop at 13:37:26.
- Rolling pings and wgets start working again, at 13:38:22.
This somehow went far better than the LDR – less than 60 seconds versus over 5 minutes. I’ll bet if I had ECMP this would have been only a few seconds perhaps.
My event list covering the above installer steps is here:
Hmm. I am also wondering now if the length of outage depends on whether the installer begins with the active or passive unit in your HA pair. Unfortunately I didn’t think to check or verify this detail for beginning, so I will leave this open question for next time.
Either way you cut it, neither upgrade of Gateway or Logical Distributed Router is without some level of outage, so be sure to plan for this!
I now have both Edge’s upgraded, and all services that were running previous to upgrade are working fine post, including FW rule sets, IPsec VPN tunnels, and Load Balancing.
This was pretty painless. No major hiccups, and the only outages I experienced were indeed mentioned in the upgrade documentation located at https://www.vmware.com/support/pubs/nsx_pubs.html
All components that I have deployed in the lab worked fine after post upgrade testing.
Next time I am going to pay better attention to which Edge is active vs passive, or perhaps disable HA to see how the installer handles down time with no passive unit. If anyone has input in this area, please feel free to comment and share your experience!