GRUB2: How to Avoid Boot Loops by Limiting Retries


When it comes to systems where manual intervention is a rare luxury – think remote servers or embedded systems – managing boot slots in a resilient manner is critical. This article introduces a refined GRUB2 configuration designed to limit the number of boot attempts, helping you steer clear of infinite boot loops.

GRUB2 Configuration

The GRUB2 setup outlined below uses a MAX_RETRIES variable to set a limit on the number of boot attempts for each available slot – be it A or B. Since arithmetics are not supported out of the box, the try counters have to be increased manually. There are some versions of GRUB2 available with Lua scripting support, but it’s usually not part of the standard GRUB2 installation.

set default=0
set timeout=3

set MAX_TRIES=3
set ORDER="A B"
set A_OK=0
set B_OK=0
set A_TRY=0
set B_TRY=0
set A_INDEX=0
set B_INDEX=1
load_env

# Select bootable slot
for SLOT in $ORDER; do
    eval "INDEX=\${${SLOT}_INDEX}"
    eval "OK=\${${SLOT}_OK}"
    eval "TRY=\${${SLOT}_TRY}"

    # If bootable and has less than MAX_TRIES
    if [ "$OK" -eq 1 -a "$TRY" -lt $MAX_TRIES ]; then
        set default=$INDEX

        # Increment attempts and save back to slot
        if [ "$TRY" -eq 0 ]; then
            set TRY=1
        elif [ "$TRY" -eq 1 ]; then
            set TRY=2
        elif [ "$TRY" -eq 2 ]; then
            set TRY=3
        fi
        eval "${SLOT}_TRY=$TRY"

        break
    fi
done

# Disable timeout if no slot is safe to boot
if [ "$default" -eq 0 -a "$A_TRY" -ge $MAX_TRIES -a "$B_TRY" -ge $MAX_TRIES ]; then
    timeout=-1
fi

save_env A_OK A_TRY B_OK B_TRY

CMDLINE="panic=60 quiet"

menuentry "Slot A (OK=$A_OK TRY=$A_TRY)" {
    linux (hd0,2)/kernel root=/dev/sda2 $CMDLINE rauc.slot=A
}

menuentry "Slot B (OK=$B_OK TRY=$B_TRY)" {
    linux (hd0,3)/kernel root=/dev/sda3 $CMDLINE rauc.slot=B
}

Updated Boot Logic Explained

In this improved GRUB2 configuration, the bootloader is instructed to find a suitable bootable slot according to a set of conditions. To get started, a series of variables are initialized and loaded from the GRUB environment. These variables include:

  • ORDER: Specifies the boot sequence, for example, “A B”.
  • A_OK and B_OK: Indicates if Slot A or Slot B is bootable (1) or not (0).
  • A_TRY and B_TRY: Stores the number of boot attempts made for Slot A or Slot B.
  • MAX_TRIES: Specifies the maximum number of retries allowed for a boot slot.

Boot Slot Selection

A for loop iterates over the slots specified in the ORDER variable. Within the loop, the following steps occur:

  1. The script uses the eval function to dynamically generate the variables for each slot (e.g. OK and TRY).
  2. It then checks whether a slot is bootable (OK=1) and whether the slot has been tried less than the maximum number of times (TRY < MAX_TRIES).
  3. If the above conditions are met, the script sets default to the index of the slot, enabling it to be booted.
  4. The TRY count for the bootable slot is incremented.
  5. Finally, the timeout is set to 3, allowing automatic booting.

Once the system is booted and healthy the corresponding $SLOT_TRY variable needs to be reset to 0, else it will continue incrementing to MAX_TRIES with each reboot. After installing an update, the ORDER needs to be swapped.

Error Handling

In case no slots meet the conditions, GRUB will wait indefinitely for manual intervention (timeout=-1). This happens when both slots have reached the MAX_TRIES limit.

By using this approach, the system ensures that it doesn’t enter into an infinite boot loop with problematic slots. Instead, it gracefully degrades to a state that allows user intervention. In some situations, booting into a rescue shell may be a good alternative – this can be done by setting the corresponding default slot instead.

Benefits and Use Cases

The primary advantage of this approach is its robustness. A system configured with this GRUB2 setup won’t remain stuck in an endless boot loop. Instead, it’ll disable slots that fail to boot after a certain number of tries. This makes the system both smart and self-healing, offering automatic failover by switching to the next available slot when one fails.

Modifying Variables with grub-editenv

The variables – such as MAX_RETRIES, A_OK, and B_OK can be modified using the grub-editenv utility. This can be useful for scenarios where you want to manually override the automatic behavior, for testing or debugging purposes.

# Example: Setting MAX_RETRIES to 5
grub-editenv /boot/grub/grubenv set "MAX_RETRIES=5"

Integration with RAUC

For those who are using the Robust Auto-Update Controller (RAUC) for managing updates on their embedded systems, this GRUB2 configuration can be plugged in seamlessly. While a full explanation is out of scope here, know that RAUC can set these GRUB environment variables as part of its update process, enhancing the reliability of your updates.

Conclusion

With a simple tweak in your GRUB2 configuration, you can make your systems more resilient and easier to maintain. So go ahead and integrate this setup into your projects.

#grub2 #resilience #failover #rauc #linux