Skip to content

Unexpected DB outage when cluster is removed from maintenance mode after cluster service restart. #121

@sairamgopal

Description

@sairamgopal

Issue:
On a fully operational cluster, when cluster is put to maintenance mode and Pacemaker/Cluster service is restarted then after removing cluster from maintenance mode DB on primary is stopped and started again which results in an outage to the customers.

Recreate the issue with below steps

  1. Make sure cluster is fully operational with one Promoted and one Demoted node. HSR is in sync
  2. Put the cluster to maintenance mode ( crm configure property maintenance-mode=true )
  3. Stop the cluster service on both the nodes ( crm cluster stop )
  4. Start the cluster service on both the nodes ( crm cluster start )
  5. Remove cluster from Maintenance mode ( crm configure property maintenance-mode=false )

After Step 5 DB on Primary will be restarted or sometimes triggers the failover.

Reason:
This is happening because If you attempt to start cluster services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node. However, it will take no further action other than determining the resources' status.

so after step 4 a probe is initiated using SAPHana and SAPHanaTopology Resources.

In SAPHanaTopology when it is identified as probe in monitoring clone function it only check and Sets the attribute for Hana Version, but it is not doing any check for current cluster state. Because of this "hana_roles" and "master-rsc_SAPHana_HDB42" attributes are not set in the cluster primary.

Also in SAPHana Resource agent it is trying to get the status of role attribute (which is not set by that time) and setting score to 5 during the probe and later when cluster is removed from maintenance mode, resource agent checks for the roles attribute and its score, as those values are not as expected, agent is trying to fix the cluster and DB stop-Start is happening.

Resolution:
To overcome this issue, If we add a check to identify the status of the primary node and set the "hana__roles" attribute during probe, then when cluster is removed from the maintenance, cluster will not try to stop and start the DB or to trigger a failover as it will see the operational primary node.

I have already modified the code and tested multiple scenarios, cluster functionality is not disturbed and the mentioned issue is resolved. I don't think these changes to SAPHana Resource agent will cause additional issues because, during probe we are setting the attributes only if the we identify the primary node. But need your expertise to check and finalize if this approach can be used or suggest any other alternative/fix to overcome the above mentioned issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions