Updating Rook Ceph


    Rook Ceph is one of the storage providers of Lokomotive. With a distributed system as complex as Ceph, the update process is not trivial. This document enlists steps on how to perform the update and how to monitor this process.


    • A Lokomotive cluster accessible via kubectl.

    • Update process should not be disruptive, but it is recommended to schedule a downtime for the applications consuming the Rook Ceph PVs and make sure to take a backup of the data before starting the below update procedure in case some problem arises. Read more about the backup process .


    The following steps are inspired by the rook docs.

    Step 1: Ensure AUTOSCALE is set to on

    Start a shell in the toolbox pod as specified in this doc and run the following command:

    # ceph osd pool autoscale-status | grep replicapool
    replicapool      0                 3.0         3241G  0.0000                                  1.0      32              on

    Ensure that the AUTOSCALE column outputs on and not warn. This should always be on but it’s especially important for updates. If the output of the AUTOSCALE column says warn, then run the command below to make sure that pool autoscaling is enabled. It is required to ensure that the placement groups scale up as the data in the cluster increases.

    ceph osd pool set replicapool pg_autoscale_mode on

    Step 2: Watch

    Watch events, updates and pods.

    Step 2.1: Ceph status

    Leave the following running in the toolbox pod:

    watch ceph status

    Ensure that the output says that health: is HEALTH_OK. Match the output such that everything looks as explained in the rook update docs .

    IMPORTANT: Don’t proceed further if the output is anything other than HEALTH_OK.

    During the ongoing update and after completion, the output should stay in HEALTH_OK state, although if the cluster is more than 60% full, it can sometimes turn into HEALTH_WARN temporarily.

    Step 2.2: Pods in rook namespace

    Open another terminal window and keep an eye on the STATUS field of the following output. Make sure that the pods are restarted correctly and don’t go into CrashLoopBackOff state. Leave the following command running:

    watch kubectl -n rook get pods -o wide

    Step 2.3: Rook version update

    Run the following command in a new terminal window to keep an eye on the rook version as it is updated for all the sub-components:

    watch --exec kubectl -n rook get deployments -l rook_cluster=rook -o \
      jsonpath='{range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{"  \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'
    watch --exec kubectl -n rook get jobs -o \
      jsonpath='{range .items[*]}{.metadata.name}{"  \tsucceeded: "}{.status.succeeded}{"      \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'

    You should see that rook-version slowly changes to v1.6.5.

    Step 2.4: Ceph version update

    Run the following command to keep an eye on the Ceph version update as the new pods come up in a new terminal window:

    watch --exec kubectl -n rook get deployments -l rook_cluster=rook -o \
      jsonpath='{range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{"  \tceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}'

    You should see that ceph-version slowly changes to 15.2.13.

    Step 2.5: Events in rook namespace

    In a new terminal leave the following command running, to keep track of the events happening in the rook namespace. Keep an eye on the column TYPE of the following output and especially events that are not of type Normal.

    kubectl -n rook get events -w

    Step 3: Dashboards

    Monitor various dashboards.

    Step 3.1: Ceph

    Open the Ceph dashboard in a browser window. Read the docs here to access the dashboard.

    NOTE: Accessing the dashboard can be a hassle because while the components are upgrading you may lose access to it multiple times.

    Step 3.2: Grafana

    Gain access to the Grafana dashboard as instructed here . And keep an eye on the dashboard named Ceph - Cluster.

    NOTE: The data in the Grafana dashboard will always be outdated compared to the watch ceph status running inside the toolbox pod.

    Step 4: Make a note of existing image versions

    Make a note of the images of the pods in the rook namespace:

    kubectl -n rook get pod -o \
      jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0].image}{"\n\n"}{end}'

    After the update is complete, we can verify the output of the above command to see if the workloads now run updated images.

    Step 5: Perform updates

    With everything monitored, you can start the update process now by executing the following commands:

    # This replace command may fail for the CRDs not installed already.
    kubectl replace -f https://raw.githubusercontent.com/rook/rook/v1.6.5/cluster/examples/kubernetes/ceph/crds.yaml
    kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.6.5/cluster/examples/kubernetes/ceph/crds.yaml
    lokoctl component apply rook rook-ceph

    Step 6: Verify that the CSI images are updated

    Verify if the images were updated, comparing it with the output of the Step 4 .

    kubectl -n rook get pod -o \
      jsonpath='{range .items[*]}{.metadata.name}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0].image}{"\n\n"}{end}'

    Step 7: Final checks

    Once everything is up to date, then run the following commands in the toolbox pod, to verify if all the OSDs are in up state:

    ceph osd status

    Additional resources