Reliability

The most important job of Artemis is to keep devices updatable. Since updates are delivered through the broker, it is critical that the device can reach the broker.

Strategy

The Artemis service that is installed on the devices periodically downloads its configuration from the broker.

If the device can't reach the broker, it will retry more and more aggressively. First it will reduce the interval between check-ins. At some point, it will turn off non-critical containers during connection attempts. Eventually, it will even disable critical containers for a short period of time.

All of these measures don't help if the broker is not available anymore. In that case, the device attempts to reach a recovery URL. This URL can provide a new broker configuration, allowing the device to connect again.

Being able to contact the broker means that the device can fetch new configurations, and, hopefully, also firmware. However, any new firmware isn't guaranteed to be able to do the same. As such, Artemis only commits to a new firmware after it has successfully connected to the broker with the new firmware.

As a last resort, Artemis also uses a watchdog. If the device can't connect to the broker for a long time, the watchdog will trigger and reboot the device. This also guards against the Artemis service itself getting into a bad state.

Max offline

The frequency at which Artemis contacts the broker can be configured using the max-offline setting in the pod specification. A value of 0s means that the device stays connected to the broker and polls it continuously (but at most every 20 seconds). A value of 1h means that the device can be offline for up to an hour before it tries to reconnect.

Here is a pod specification with a max-offline value of 1h:

# yaml-language-server: $schema=https://toit.io/schemas/artemis/pod-specification/v1.json

$schema: https://toit.io/schemas/artemis/pod-specification/v1.json
name: my-pod
sdk-version: v2.0.0-alpha.163
artemis-version: v0.25.0
max-offline: 1h
firmware-envelope: x64-linux
containers: {}

The max-offline value can also be set using the artemis CLI:

artemis device -d DEVICE-ID set-max-offline 1h

This configuration change is not permanent and will be lost with the next firmware update.

Status states

The Artemis service on the device keeps track of its connection status. It is either green, yellow, orange or red. As of September 2024, the following heuristics are used to determine the status:

  • green: the device is within the max-offline window.
  • yellow: the device is outside the max-offline window, but within a factor of 2. For example, if max-offline is set to 1h, then the device is in yellow state if it hasn't connected for more than 1h, but less than 2h.
  • orange: the device is outside the max-offline window, but within a factor of 3.
  • red: the device is outside the max-offline window, and has been for more than 3 times the max-offline value.

The following actions (again, as of September 2024) are taken based on the status.

Orange

When a device is in orange state, Artemis reduces the interval between reconnection attempts by 50%. For example, if max-offline is set to 1h, then the device will try to reconnect every 30 minutes.

Artemis also disables all non-critical containers during reconnection attempts.

In this state, Artemis always tries a random entry from the recovery-URLs list to get a new broker configuration.

Red

Devices that can't connect to the broker for a long time are put into red state.

The red state is a more extreme version of the orange state. Artemis will reduce the reconnect interval by a factor of 4. For example, if max-offline is set to 1h, then the device will try to reconnect every 15 minutes.

As for the orange state, Artemis disables all non-critical containers during reconnection attempts. However, now Artemis also disables critical containers 15% of the time.

Contrary to the orange state, Artemis does not always contact the recovery URL. In case the recovery-URL connection makes things worse Artemis only fetches recovery URLs 20% of the time.

Recovery URLs

Even though recovery URLs are baked into pods, they are stored as properties of the fleet. We don't expect recovery URLs to change often, and having them in the fleet configuration avoids duplication.

A recovery URL can be added to an existing fleet with the following command:

artemis fleet recovery add RECOVERY-URL

If the existing broker for a device doesn't exist anymore, then the device can use the recovery URL to get a new broker configuration. The new broker configuration can be generated by running artemis fleet recovery export -o recovery.json and serving the resulting recovery.json file at the recovery URL.

We recommend to use a different domain as recovery URL. This way, the recovery URL is not affected by the same issues as the main broker.

Note that recovery URLs don't need to be online until they are needed.

Watchdogs

As mentioned earlier, Artemis uses watchdogs to monitor its own health. Given a certain max-offline value, the watchdog will trigger if the device hasn't connected to the broker for more than 5 times the max-offline value (or at least 2 hours, if max-offline is small).

This watchdog is a powerful last resort in that it handles many different failures. If Artemis itself gets into a bad state the watchdog will eventually help to recover. Similarly, user programs can communicate with the Artemis service through the Artemis package and a bug in the user program could accidentally disallow Artemis to connect to the broker.