Nutanix – Installer timeout, retrying at [2402/2430]

I had several incidents, attempting various installation media when installing Nutanix CE2. Both installing to USB and directly to SSD got timeouts and did not work as expected.

The assumption of this blog is that you’re installing from a USB. Other install media could pose different issues, but was not tested.

If you want the TL;DR versions, go to Solution with USB or Solution with SSD whichever fits your need.

1. Installation to USB

It is no longer recommended to install the Nutanix AHV to a USB, while I could not find any official documentation of this, this is what Nutanix Employees are saying on Reddit and on their Discord (references will be linked at the bottom of the page).

However, installing to USB also caused a timeout as it would appear to not be fast enough, regardless of USB 2.0 or 3.2. Maybe there is a driver “issue” that limits it to slow speeds regardless. Attempted Kingston USBs and SanDisk USBs, all timed out.

Nutanix Installation to USB timeout – current retry 0

A Reddit user told me to wait it out, as it’s not a “retry” that restarts the installation, it will indeed continue the installation. I left it running and at around 500/2430 in retry 1, my installation was successful, on USB.

1.1 Solution with USB

Do not terminate the installation after it has failed its retry, just let it run and eventually it will install.

As an additional thing, if you can SSH to the machine or have possibility to access /tmp/installer_vm.log, do tail -f /tmp/installer_vm.log and follow the actual installation progress.

1.1.1 Caveats with USB

Worth mentioning, again, Nutanix Employees have stated that they no longer recommend USB.

If this is your only option, please make sure you understand that AHV updates (through LCM) is not supported (though can sometimes be done with some hackery).
And you will face USB storage driver issues, which you can read about here: Nutanix – Updating CE2 from LCM, server not coming back up

2. Installation to SSD

After having successfully installed on USB, I figured I wanted to retry installing AHV onto my SSD to avoid the caveats listed above.

I started out having the installation run through the night and woke up to the screen below. “InstallerVM timeout occurred, current retry 4”.

Nutanix Installation to SSD timeout – current retry 4

This screenshot shows it reaching its final retry and still not finishing installation, the python “crash” is simply due to poor error handling, as there is no error messages if the installer reaches 4 retries without succeeding. I then thoroughly investigated the logs, to figure out what is going on. The log at /tmp/installer_vm.log clearly showed the installation being finished, having run post-installation scripts, however never showed me that it had finished and finally crashed on retries.

Excerpt from the installer_vm.log file here:

Installation

 1) [x] Language settings                 2) [x] Time settings
        (English (United States))                (UTC timezone)
 3) [x] Installation source               4) [x] Software selection
        (Local media)                            (Custom software selected)
 5) [x] Installation Destination          6) [x] Network configuration
        (Warning checking storage                (Wired (enp0s2) connected)
        configuration)
 7) [ ] User creation
        (No user will be created)

Progress
Setting up the installation environment
.
Creating disklabel on /dev/sda
.
Creating efi on /dev/sda1
.
Creating ext4 on /dev/sda2
.
Running pre-installation scripts
.
Starting package installation process
Preparing transaction from installation source
Installing libgcc (1/517)
... ((Commented out due to too big file))
Installing nutanix-ahv-release (515/517)
Installing grub2-efi-x64 (516/517)
Installing dosfstools (517/517)
Performing post-installation setup tasks
Installing boot loader
.
Performing post-installation setup tasks
.

Configuring installed system
.
Writing network configuration
.
Creating users
.
Configuring addons
.
Generating initramfs
.
Running post-installation scripts

So, the progress we are presented is not the actual installation progress.

2.1 Some observations

(As I’ve posted it on Reddit). The installer starts an “Installer VM”.. So…

  1. The installation takes place from the InstallerVM (from which we were reading logs doing tail -f /tmp/installer_vm.log)
  2. The script runs a default 40 minutes, before retrying ((2400 seconds) + 30 seconds interval), [30/2430], at a default of 4 retries.
  3. To determine if the installation is successful, the loop checks if the InstallerVM is still on, if it’s on, it’s not done and runs another 30 seconds. If it’s done it says “Hypervisor installation is done, brother, we golden”.

I noticed from my logs that my installation was indeed finished, all packages were transferred, but it never shut down the InstallerVM, hence never successful. But it appears some part of either the timeout or retry terminology in the script is making the InstallerVM “stall” or at least not shut down upon finishing. Not saying that’s definitely what happens, and it only failed this way when installing to SSD directly – but an observation nonetheless.

2.1.1 Why do we even retry?

According to comments in the installation script:

# ESXi has a ~5% chance to stuck at writing to disk in QEMU, retries are helpful to recover.

Having changed too much of the script to dig deeper into what actually happens, my assumption is that some part of the “unstuck” fix for ESXi users are messing with us occasionally. My logs however reported nothing.

I wonder though – How many Nutanix users are using ESXi? So in total of all Nutanix users, whether esx, hyperv, kvm or xen – how many % of all users combined is it really that had an issue that was solved with this retry?

Additionally – The progress of 30 second interval increment up to a hardcoded of 2400 seconds, to then see a “failed” message after which it will continue and finish installation, masqueraded in a “retry” sequence seems like .. a… curious solution. Not hating, merely questioning – I would have loved to see the installer_vm.log output instead of the 2400 seconds.

Oh well, maybe next release will measure [5/290 miles per hour] — Obviously kidding.

2.2 Solution with SSD

Welcome back from my observations/rant.

At this point I had removed some of the retry code of the installation script, while leaving as much as possible in the script to not hit other obstacles. (Why the script may also look a bit messy).

I have chosen to share my installation file as I’m seeing so many poor, frustrated souls on Reddit and Discord, facing same issues as I did, hopefully this post will find its way onto Google. Unfortunately I’m seeing people who I know, now know, the solution, but misguiding these poor souls on Reddit and Discord.

Link to GitHub script: https://github.com/Oreax/nutanixinstaller/tree/main

2.2.1 How to run the script

The process is also written in the script itself, but for good measure I will post it here as well:

  1. Once the installer GUI has popped up, press CTRL + C to exit the installer.
  2. Type in the following command (optional – but easier for step 3):
    service sshd start
  3. Replace the following file with the contents of this file /phoenix/imaging_helper/installer_vm.py
  4. Relaunch the installer (from server screen) in /root/ with: ./ce_installer && screen -r
  5. (Optional, but highly recommended) Follow the installer progress in SSH terminal with: tail -f /tmp/installer_vm.log

3. References

I hope this helps some of you out!

Over’n’Out,

me.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *