From 6d3d669bb9b3e4e7d79552ced851af8e3973e8a5 Mon Sep 17 00:00:00 2001 From: Meike Chabowski Date: Wed, 22 May 2024 22:07:39 +0200 Subject: [PATCH] Implemented edits after stylecheck Fixed typos, wording, punctuation, abbreviations, format after having performed stylecheck and spellcheck. --- ...P-convergent-mediation-ha-setup-sle15.adoc | 405 +++++++++--------- 1 file changed, 198 insertions(+), 207 deletions(-) diff --git a/adoc/SAP-convergent-mediation-ha-setup-sle15.adoc b/adoc/SAP-convergent-mediation-ha-setup-sle15.adoc index 61c14993..85a02fd4 100644 --- a/adoc/SAP-convergent-mediation-ha-setup-sle15.adoc +++ b/adoc/SAP-convergent-mediation-ha-setup-sle15.adoc @@ -116,7 +116,7 @@ here. - Other fencing is possible, but not explained here. -- Filesystem managed by the cluster - either on shared storage or NFS, not explained +- File system managed by the cluster - either on shared storage or NFS, not explained in detail here. - On-premises deployment on physical and virtual machines. @@ -127,10 +127,10 @@ details). === High availability for the {ConMed} ControlZone platform and UI The HA solution for CM ControlZone is a two node active/passive cluster. -A shared NFS filesystem is statically mounted by OS on both cluster nodes. This -filesystem holds work directories. Client-side write caching has to be disabled. +A shared NFS file system is statically mounted by OS on both cluster nodes. This +file system holds work directories. Client-side write caching needs to be disabled. The ControlZone software is installed into the central shared NFS, but is also -copied to both nodes´ local filesystems. The HA cluster uses the central directory +copied to both nodes´ local file systems. The HA cluster uses the central directory for starting/stopping the ControlZone services. However, for monitoring the local copies of the installation are used. @@ -138,17 +138,17 @@ The cluster can run monitor actions even when the NFS temporarily is blocked. Further, software upgrade is possible without downtime (rolling upgrade). // TODO PRIO2: Get rid of the central software. Use central NFS for work directory only. -.Two-node HA cluster and statically mounted filesystems +.Two-node HA cluster and statically mounted file systems image::sles4sap_cm_cluster.svg[scaledwidth=100.0%] The ControlZone services platform and UI are handled as active/passive resources. The related virtual IP adress is managed by the HA cluster as well. -A filesystem resource is configured for a bind-mount of the real NFS share. In -case of filesystem failures, the cluster takes action. However, no mount or umount +A file system resource is configured for a bind-mount of the real NFS share. In +case of file system failures, the cluster takes action. However, no mount or umount on the real NFS share is done. -All cluster resources are organised as one resource group. This results in -correct start/stop order as well as placement, while keeping the configuration +All cluster resources are organized as one resource group. This results in +a correct start/stop order and placement, while keeping the configuration simple. .ControlZone resource group @@ -174,9 +174,9 @@ and UI, together with related IP address. NOTE: Neither installation of the basic {sleha} cluster, nor installation of the CM ControlZone software is covered in the document at hand. -Please consult the {sleha} product documentation for installation instructions +Consult the {sleha} product documentation for installation instructions (https://documentation.suse.com/sle-ha/15-SP4/single-html/SLE-HA-administration/#part-install). -For Convergent Mediation installation instructions, please refer to the respective +For Convergent Mediation installation instructions, refer to the respective product documentation (https://infozone.atlassian.net/wiki/spaces/MD9/pages/4849683/Installation+Instructions). @@ -184,26 +184,26 @@ product documentation [[sec.prerequisites]] === Prerequisites -For requirements of {ConMed} ControlZone, please refer to the product documentation -(https://infozone.atlassian.net/wiki/spaces/MD9/pages/4849685/System+Requirements). +For requirements of {ConMed} ControlZone, refer to the product documentation +at https://infozone.atlassian.net/wiki/spaces/MD9/pages/4849685/System+Requirements. -For requirements of {sles4sap} and {sleha}, please refer to the product documentation -(https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/article-installation.html#sec-ha-inst-quick-req). +For requirements of {sles4sap} and {sleha}, refer to the product documentation +at https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/article-installation.html#sec-ha-inst-quick-req. Specific requirements of the SUSE high availability solution for CM ControlZone -are: +are as follows: - This solution is supported only in the context of {SAP} RISE. - {ConMed} ControlZone version 9.0.1.1 or higher is installed and configured on -both cluster nodes. If the software is installed into a shared NFS filesystem, the -binaries are copied into both cluster nodes´ local filesystems. Finally the local -configuration has to be adjusted. Please refer to {ConMed} documentation for details. +both cluster nodes. If the software is installed into a shared NFS file system, the +binaries are copied into both cluster nodes´ local file systems. Finally the local +configuration needs to be adjusted. Refer to {ConMed} documentation for details. - CM ControlZone is configured identically on both cluster nodes. User, path names and environment settings are the same. -- Only one ControlZone instance per Linux cluster. Thus one platform service and +- There is only *one* ControlZone instance per Linux cluster. Accordingly, there is only one platform service and one UI service per cluster. - The platform and UI are installed into the same MZ_HOME. @@ -211,56 +211,56 @@ one UI service per cluster. - Linux shell of the mzadmin user is `/bin/bash`. - The mzadmin´s `~/.bashrc` inherits MZ_HOME, JAVA_HOME and MZ_PLATFORM -from SAPCMControlZone RA. This variables need to be set as described in the RA´s -documentation, i.e. manual page ocf_suse_SAPCMControlZone(7). +from SAPCMControlZone RA. These variables need to be set as described in the RA´s +documentation, that is the manual page ocf_suse_SAPCMControlZone(7). -- When called by the resource agent, mzsh connnects to CM ControlZone services -via network. The service´s virtual hostname or virtual IP address managed by the +- When called by the resource agent, `mzsh` connnects to CM ControlZone services +via network. The service´s virtual host name or virtual IP address managed by the cluster should not be used when called by RA monitor actions. - Technical users and groups are defined locally in the Linux system. If users are -resolved by remote service, local caching is neccessary. Substitute user (su) to +resolved by remote service, local caching is neccessary. `Substitute user` (su) to the mzadmin needs to work reliable and without customized actions or messages. -- Name resolution for hostnames and virtual hostnames is crucial. Hostnames of +- Name resolution for host names and virtual host names is crucial. Host names of cluster nodes and services are resolved locally in the Linux system. -- Strict time synchronization between the cluster nodes, e.g. NTP. All nodes of a +- Strict time synchronization between the cluster nodes, for example NTP, is required. All nodes of a cluster have configured the same timezone. -- Needed NFS shares (e.g. `/usr/sap/`) are mounted statically or by automounter. -No client-side write caching. File locking might be configured for application +- Needed NFS shares (for example `/usr/sap/`) are mounted statically or by automounter. +No client-side write caching is happening. File locking should be configured for application needs. -- The RA monitoring operations have to be active. +- The RA monitoring operations need to be active. -- RA runtime almost completely depends on call-outs to controlled resources, OS and +- RA runtime almost completely depends on call-outs to controlled resources, operating system, and Linux cluster. The infrastructure needs to allow these call-outs to return in time. -- The ControlZone application is not started/stopped by OS. Thus there is no SystemV, +- The ControlZone application is not started/stopped by the operating system. Thus, there is no SystemV, systemd or cron job. - As long as the ControlZone application is managed by the Linux cluster, the -application is not started/stopped/moved from outside. Thus no manual actions are +application is not started/stopped/moved from outside. Thus, no manual actions are done. The Linux cluster does not prevent from administrative mistakes. -However, if the Linux cluster detects the application running at both sites in -parallel, it will stop both and restart one. +However, if the Linux cluster detects the application is running on both sites in +parallel, both are stopped and one of them is restarted. -- Interface for the RA to the ControlZone services is the command mzsh. Ideally, -the mzsh should be accessed on the cluster nodes´ local filesystems. -The mzsh is called with the arguments startup, shutdown and status. Its return -code and output is interpreted by the RA. Thus the command and its output needs +- The interface for the RA to the ControlZone services is the command `mzsh`. Ideally, +`mzsh` should be accessed on the cluster nodes´ local file systems. +`mzsh` is called with the arguments `startup`, `shutdown` and `status`. Its return +code and output is interpreted by the RA. Thus, the command and its output need to be stable. The mzsh shall not be customized. Particularly environment variables set thru `~/.bashrc` must not be changed. -- The mzsh is called on the active node with a defined interval for regular resource +- `mzsh` is called on the active node with a defined interval for regular resource monitor operations. It also is called on the active or passive node in certain situations. Those calls might run in parallel. === The setup procedure at a glance For a better understanding and overview, the installation and setup is divided into -nine nice steps. +nine steps. // - Collecting information - <> @@ -290,7 +290,7 @@ nine nice steps. [[sec.information]] === Collecting information -The installation should be planned properly. You should have all needed parameters +The installation should be planned properly. You should have all required parameters already in place. It is good practice to first fill out the parameter sheet. [width="100%",cols="25%,35%,40%",options="header"] @@ -410,8 +410,8 @@ Check this on both nodes. ==== IP addresses and virtual names Check if the file `/etc/hosts` contains at least the address resolution for -both cluster nodes `{myNode1}`, `{myNode1}` as well as the ControlZone virtual -hostname `sap{mySidLc}cz`. Add those entries if they are missing. +both cluster nodes `{myNode1}`, `{myNode1}`, and the ControlZone virtual +host name `sap{mySidLc}cz`. Add those entries if they are missing. [subs="attributes"] ---- @@ -428,9 +428,9 @@ See also manual page hosts(8). ==== Mount points and NFS shares Check if the file `/etc/fstab` contains the central NFS share MZ_HOME. -The filesystem is statically mounted on all nodes of the cluster. -The correct mount options are depending on the NFS server. However, client-side -write caching has to be disabled in any case. +The file system is statically mounted on all nodes of the cluster. +The correct mount options depend on the NFS server. However, client-side +write caching needs to be disabled in any case. [subs="attributes"] ---- @@ -445,7 +445,7 @@ write caching has to be disabled in any case. // TODO PRIO1: above output Check this on both nodes. -See also manual page mount(8), fstab(5) and nfs(5), as well as TID 20830, TID 19722. +See also manual page mount(8), fstab(5) and nfs(5), and TID 20830, TID 19722. ==== Linux user and group number scheme @@ -461,7 +461,7 @@ Check if the file `/etc/passwd` contains the mzadmin user `{mySapAdm}`. Check this on both nodes. See also manual page passwd(5). -==== Password-free ssh login +==== Password-free SSH login // TODO PRIO2: content @@ -476,7 +476,7 @@ See also manual page passwd(5). Check this on both nodes. See also manual page ssh(1) and ssh-keygen(1). -==== Time synchronisation +==== Time synchronization // TODO PRIO2: content @@ -503,7 +503,7 @@ See also manual page chronyc(1) and chrony.conf(5). ==== Watchdog -Check if the watchdog module is loaded correctly. +Check if the *watchdog* module is loaded correctly. [subs="specialchars,attributes"] ---- @@ -523,15 +523,14 @@ sbd 686 root 4w CHR 10,130 0t0 410 /dev/watchdog ---- Check this on both nodes. Both nodes should use the same watchdog driver. -Which dirver that is depends on your hardware or hypervisor. -See also -https://documentation.suse.com/sle-ha/15-SP4/single-html/SLE-HA-administration/#sec-ha-storage-protect-watchdog . +Which driver that is depends on your hardware or hypervisor. For more information, see +https://documentation.suse.com/sle-ha/15-SP4/single-html/SLE-HA-administration/#sec-ha-storage-protect-watchdog. ==== SBD device It is a good practice to check if the SBD device can be accessed from both nodes and contains valid records. Only one SBD device is used in this example. For -production, always three devices should be used. +production, three devices should always be used. [subs="specialchars,attributes"] ---- @@ -566,11 +565,10 @@ Timeout (msgwait) : 120 Active: active (running) since Tue 2024-05-14 16:37:22 CEST; 13min ago ---- -Check this on both nodes. -For more information on SBD configuration see -https://documentation.suse.com/sle-ha/15-SP4/single-html/SLE-HA-administration/#cha-ha-storage-protect , -as well as TIDs 7016880 and 7008216. See also manual page sbd(8), stonith_sbd(7) and -cs_show_sbd_devices(8). +Check this on both nodes. For more information on SBD configuration, see +* https://documentation.suse.com/sle-ha/15-SP4/single-html/SLE-HA-administration/#cha-ha-storage-protect +* TID 7016880 and TID 7008216 +* manual page sbd(8), stonith_sbd(7), and cs_show_sbd_devices(8) ==== Corosync cluster communication @@ -591,7 +589,7 @@ Check this on both nodes. See appendix <> for a `corosync.conf` example. See also manual page systemctl(1), corosync.conf(5) and corosync-cfgtool(1). -==== systemd cluster services +==== `systemd` cluster services // TODO PRIO2: content @@ -633,19 +631,18 @@ Check this on both nodes. See also manual page crm_mon(8). - [[cha.cm-basic-check]] == Checking the ControlZone setup -The ControlZone needs to be tested without Linux cluster before integrating +The ControlZone needs to be tested without the Linux cluster before integrating both. Each test needs to be done on both nodes. === Checking ControlZone on central NFS share -Check mzadmin´s environment variables MZ_HOME, JAVA_HOME, PATH and check the +Check the mzadmin´s environment variables MZ_HOME, JAVA_HOME, PATH. Then check the `mzsh startup/shutdown/status` functionality for MZ_HOME on central NFS. -This is needed on both nodes. Before starting ControlZone services on one node, -make very sure it is not running on the other node. +This is needed on both nodes. Before starting the ControlZone services on one node, +ensure they are not running on the other node. [subs="specialchars,attributes"] ---- @@ -711,12 +708,12 @@ platform is not running 2 ---- -Do the above on both nodes. +Perform the above steps on both nodes. === Checking ControlZone on each node´s local disk -Check mzadmin´s environment variables MZ_HOME, JAVA_HOME, PATH and check the -`mzsh status` functionality for MZ_HOME on local disk. +Check the mzadmin´s environment variables MZ_HOME, JAVA_HOME, PATH. Then check the +`mzsh status` functionality for MZ_HOME on the local disk. This is needed on both nodes. [subs="specialchars,attributes"] @@ -750,9 +747,8 @@ ui is running 0 ---- -Do the above on both nodes. The ControlZone services should be running on either -node, but not on both in parallel, of course. - +Perform the above steps on both nodes. The ControlZone services should be running on either +node, but not on both in parallel. [[cha.ha-cm]] @@ -763,8 +759,8 @@ node, but not on both in parallel, of course. [[sec.ha-bashrc]] === Preparing mzadmin user ~/.bashrc file -Certain values for environment variables JAVA_HOME, MZ_HOME and MZ_PLATFORM are -needed. For cluster actions, the values are inherited from the RA thru related +For the environment variables JAVA_HOME, MZ_HOME and MZ_PLATFORM, +certain values are required. For cluster actions, the values are inherited from the RA through related RA_... variables. For manual admin actions, the values are set as default. This is needed on both nodes. @@ -788,7 +784,7 @@ export JAVA_HOME=${RA_JAVA_HOME:-"{mzJavah}"} See <> and manual page ocf_suse_SAPCMControlZone(7) for details. [[sec.ha-filsystem-monitor]] -=== Preparing the OS for NFS monitoring +=== Preparing the operating system for NFS monitoring // TODO PRIO2: content This is needed on both nodes. @@ -806,14 +802,14 @@ mount(8). === Adapting the cluster basic configuration // TODO PRIO2: content -All steps for loading configuration into the Cluster Information Base (CIB) need -to be done only on one node. +All steps to load the configuration into the Cluster Information Base (CIB) only need +to be performed on one node. ==== Adapting cluster bootstrap options and resource defaults The first example defines the cluster bootstrap options, the resource and operation -defaults. The stonith-timeout should be greater than 1.2 times the SBD on-disk msgwait -timeout. The priority-fencing-delay should be at least 2 times the SBD CIB pcmk_delay_max. +defaults. The STONITH timeout value should be greater than 1.2 times the SBD on-disk msgwait +timeout value. The priority fencing delay value should be at least twice the SBD CIB pcmk_delay_max value. [subs="specialchars,attributes"] ---- @@ -848,7 +844,7 @@ See also manual page crm(8), sbd(8) and SAPCMControlZone_basic_cluster(7). ==== Adapting SBD STONITH resource -The next configuration part defines an disk-based SBD STONITH resource. +The next configuration step defines a disk-based SBD STONITH resource. Timing is adapted for priority fencing. [subs="specialchars,attributes"] @@ -866,7 +862,7 @@ Load the file to the cluster. ---- # crm configure load update crm-sbd.txt ---- -See also manual pages crm(8), sbd(8), stonith_sbd(7) and SAPCMControlZone_basic_cluster(7). +See also manual pages crm(8), sbd(8), stonith_sbd(7), and SAPCMControlZone_basic_cluster(7). [[sec.cm-ha-cib]] === Configuring ControlZone cluster resources @@ -875,9 +871,9 @@ See also manual pages crm(8), sbd(8), stonith_sbd(7) and SAPCMControlZone_basic_ ==== Virtual IP address resource -Now an IP adress resource `rsc_ip_{mySid}` is configured. -In case of IP address failure (or monitor timeout), the IP address resource gets -restarted until it gains success or migration-threshold is reached. +Next, configure an IP adress resource `rsc_ip_{mySid}`. +In the event of an IP address failure (or monitor timeout), the IP address resource is +restarted until it is successful or the migration threshold is reached. [subs="specialchars,attributes"] ---- @@ -898,19 +894,17 @@ Load the file to the cluster. ---- See also manual page crm(8) and ocf_heartbeat_IPAddr2(7). -==== Filesystem resource (only monitoring) +==== File system resource (only monitoring) -A shared filesystem might be statically mounted by OS on both cluster nodes. -This filesystem holds work directories. It must not be confused with the -ControlZone application itself. Client-side write caching has to be disabled. +A shared file system might be statically mounted by the operating system on both cluster nodes. +This file system holds work directories. It must not be confused with the +ControlZone application itself. Client-side write caching needs to be disabled. -A Filesystem resource `rsc_fs_{mySid}` is configured for a bind-mount of the real -NFS share. -This resource is grouped with the ControlZone platform and IP address. In case -of filesystem failures, the node gets fenced. -No mount or umount on the real NFS share is done. -Example for the real NFS share is `/usr/sap/{mySid}/.check` , example for the -bind-mount is `/usr/sap/.check_{mySid}` . Both mount points have to be created +A file system resource `rsc_fs_{mySid}` is configured for a bind-mount of the real +NFS share. This resource is grouped with the ControlZone platform and IP address. In the event +of a file system failures, the node gets fenced. No mount or umount on the real NFS share is done. +An example for the real NFS share is `/usr/sap/{mySid}/.check` , an example for the +bind-mount is `/usr/sap/.check_{mySid}` . Both mount points need to be created before the cluster resource is activated. [subs="specialchars,attributes"] @@ -939,12 +933,12 @@ and nfs(5). ==== SAP Convergent Mediation ControlZone platform and UI resources -A ControlZone platform resource `rsc_cz_{mySid}` is configured, handled by OS user -`{mySapAdm}`. The local `{mzsh}` is used for monitoring, but for other actions +A ControlZone platform resource `rsc_cz_{mySid}` is configured, handled by the operating system user +`{mySapAdm}`. The local `{mzsh}` is used for monitoring, but for other actions, the central `/usr/sap/{mySid}/bin/mzsh` is used. -In case of ControlZone platform failure (or monitor timeout), the platform resource -gets restarted until it gains success or migration-threshold is reached. -If migration-threshold is reached, or if the node fails where the group is running, +In the event of a ControlZone platform failure (or monitor timeout), the platform resource +is restarted until it is successful or the migration threshold is reached. +If the migration threshold is reached, or if the node where the group is running fails, the group will be moved to the other node. A priority is configured for correct fencing in split-brain situations. @@ -972,12 +966,12 @@ Load the file to the cluster. # crm configure load update crm-cz.txt ---- -A ControlZone UI resource `rsc_ui_{mySid}` is configured, handled by OS user +A ControlZone UI resource `rsc_ui_{mySid}` is configured, handled by the operating system user `{mySapAdm}`. The local `{mzsh}` is used for monitoring, but for other actions the central `/usr/sap/{mySid}/bin/mzsh` is used. -In case of ControlZone UI failure (or monitor timeout), the UI resource gets -restarted until it gains success or migration-threshold is reached. -If migration-threshold is reached, or if the node fails where the group is running, +In the event of a ControlZone UI failure (or monitor timeout), the UI resource is +restarted until it is successful or the migration threshold is reached. +If the migration threshold is reached, or if the node where the group is running fails, the group will be moved to the other node. [subs="specialchars,attributes"] @@ -1004,7 +998,7 @@ Load the file to the cluster. # crm configure load update crm-ui.txt ---- -An overview on the RA SAPCMControlZone parameters are given below. +Find an overview on the RA SAPCMControlZone parameters below: [[tab.ra-params]] [width="100%",cols="30%,70%",options="header"] @@ -1027,8 +1021,8 @@ is used for all actions. In case two paths are given, the first one is used for monitor actions, the second one is used for start/stop actions. If two paths are given, the first needs to be on local disk, the second needs to be on the central NFS share with the original CM ControlZone installation. Two paths are separated -by a semi-colon (;). The mzsh contains settings that need to be consistent with -MZ_PLATFORM, MZ_HOME, JAVA_HOME. Please refer to Convergent Mediation product +by a semicolon (;). The mzsh contains settings that need to be consistent with +MZ_PLATFORM, MZ_HOME, JAVA_HOME. Refer to Convergent Mediation product documentation for details. Optional. Unique, string. Default value: "/opt/cm/bin/mzsh". @@ -1039,15 +1033,15 @@ actions. In case two paths are given, the first one is used for monitor actions, the second one is used for start/stop actions. If two paths are given, the first needs to be on local disk, the second needs to be on the central NFS share with the original CM ControlZone installation. See also JAVAHOME. Two paths are -separated by semi-colon (;). +separated by semicolon (;). Optional. Unique, string. Default value: "/opt/cm/". |MZPLATFORM |URL used by mzsh for connecting to CM ControlZone services. Could be one or two URLs. If one URL is given, that URL is used for all actions. In case two URLs are given, the first one is used for monitor and stop actions, -the second one is used for start actions. Two URLs are separated by semi-colon -(;). Should usually not be changed. The service´s virtual hostname or virtual IP +the second one is used for start actions. Two URLs are separated by semicolon +(;). Should usually not be changed. The service´s virtual host name or virtual IP address managed by the cluster must never be used for RA monitor actions. Optional. Unique, string. Default value: "http://localhost:9000". @@ -1058,7 +1052,7 @@ actions. In case two paths are given, the first one is used for monitor actions, the second one is used for start/stop actions. If two paths are given, the first needs to be on local disk, the second needs to be on the central NFS share with the original CM ControlZone installation. See also MZHOME. Two paths are -separated by semi-colon (;). +separated by semicolon (;). Optional. Unique, string. Default value: "/usr/lib64/jvm/jre-17-openjdk". |=== @@ -1068,11 +1062,11 @@ See also manual page crm(8) and ocf_suse_SAPCMControlZone(7). ==== CM ControlZone resource group ControlZone platform and UI resources `rsc_cz_{mySid}` and `rsc_ui_{mySid}` are grouped -with filesystem `rsc_fs_{mySid}` and IP address resource `rsc_ip_{mySid}` into group -`grp_cz_{mySid}`. The filesystem starts first, then platform, IP address starts before -UI. The resource group might run on either node, but never in parallel. -If the filesystem resource gets restarted, all resources of the group will restart as -well. If the platform or IP adress resource gets restarted, the UI resource will +with file system `rsc_fs_{mySid}` and IP address resource `rsc_ip_{mySid}` into group +`grp_cz_{mySid}`. The file system starts first, then comes the platform. The IP address starts before +the UI. The resource group might run on either node, but never in parallel. +If the file system resource is restarted, all resources of the group will restart as +well. If the platform or IP adress resource is restarted, the UI resource will restart as well. [subs="specialchars,attributes"] @@ -1134,7 +1128,7 @@ Full List of Resources: Congratulations! The HA cluster is up and running, controlling the ControlZone resources. -Now it might be a good idea to make a backup of the cluster configuration. +It is now advisable to create a backup of the cluster configuration. [subs="specialchars,attributes,verbatim,quotes"] ---- @@ -1153,26 +1147,24 @@ See the appendix <> for a complete CIB example. [[sec.testing]] === Testing the HA cluster -As with any HA cluster, testing is crucial. Make sure that all test cases derived -from customer expectations are conducted and passed. Otherwise the project is likely +As with any HA cluster, testing is crucial. Ensure that all test cases derived +from customer expectations are executed and passed. Otherwise, the project is likely to fail in production. - Set up a test cluster for testing configuration changes and administrative procedures before applying them on the production cluster. - Carefully define, perform, and document tests for all scenarios that should be -covered, as well as all maintenance procedures. +covered, and do the same for all maintenance procedures. -- Test ControlZone features without Linux cluster before doing the overall -cluster tests. +- Before performing full cluster testing, test the ControlZone features without the Linux cluster. -- Test basic Linux cluster features without ControlZone before doing the overall -cluster tests. +- Test basic Linux cluster features without ControlZone before perforing full cluster testing. -- Follow the overall best practices, see <>. +- Follow general best practices, see <>. -- Open an additional terminal window on an node that is expected to not get fenced. -In that terminal, continously run `cs_show_cluster_actions` or alike. +- Open an additional terminal window on a node that is expected to not be fenced. +In that terminal, continously run `cs_show_cluster_actions` or similar. See manual page cs_show_cluster_actions(8) and SAPCMControlZone_maintenance_examples(7). The following list shows common test cases for the CM ControlZone resources managed @@ -1193,21 +1185,23 @@ by the HA cluster. // Testing cluster reaction on network split-brain - <> -This is not a complete list. Please define additional test cases according to your +This is not a complete list. Define additional test cases according to your needs. Some examples are listed in <>. -And please do not forget to perform every test on each node. +Do not forget to perform every test on each node. -NOTE: Tests for the basic HA cluster as well as tests for the bare CM ControlZone -components are not covered in this document. Please refer to the respective product -documentation for this tests. +NOTE: Tests for the basic HA cluster and tests for bare CM ControlZone +components are not covered in this document. Information about these tests +can be found in the relevant product documentation. // TODO PRIO2: URLs to product docu for tests -The test prerequisite, if not described differently, is always that both cluster -nodes are booted and joined to the cluster. SBD and corosync are fine. -NFS and local disks are fine. The ControlZone resources are all running. -No failcounts or migration constraints are in the CIB. The cluster is idle, no -actions are pending. +Unless otherwise stated, the test prerequisites are always that +* both cluster nodes are booted and connected to the cluster. +* SBD and corosync are fine. +* NFS and local disks are fine. +* the ControlZone resources are all running. +* no failcounts or migration constraints are in the CIB. +* the cluster is idle, no actions are pending. [[sec.test-restart]] ==== Manually restarting ControlZone resources in-place @@ -1262,8 +1256,8 @@ actions are pending. ---- .{testExpect} -. The cluster stops all resources gracefully. -. The filesystem stays mounted. +. The cluster gracefully stops all resources. +. The file system stays mounted. . The cluster starts all resources. . No resource failure happens. ========== @@ -1309,8 +1303,8 @@ actions are pending. ---- .{testExpect} -. The cluster stops all resources gracefully. -. The filesystem stays mounted. +. The cluster gracefully stops all resources. +. The file system stays mounted. . The cluster starts all resources on the other node. . No resource failure happens. ========== @@ -1332,7 +1326,7 @@ actions are pending. # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Manually kill ControlZone UI (on e.g. `{mynode1}`). +. Manually kill ControlZone UI (on, for example, `{mynode1}`). + [subs="specialchars,attributes"] @@ -1357,9 +1351,9 @@ actions are pending. ---- .{testExpect} -. The cluster detects faileded resource. -. The filesystem stays mounted. -. The cluster re-starts UI on same node. +. The cluster detects failed resources. +. The file system stays mounted. +. The cluster restarts the UI on the same node. . One resource failure happens. ========== @@ -1380,7 +1374,7 @@ actions are pending. # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Manually kill ControlZone platform (on e.g. `{mynode1}`). +. Manually kill ControlZone platform (on, for example, `{mynode1}`). + [subs="specialchars,attributes"] ---- @@ -1404,9 +1398,9 @@ actions are pending. ---- .{testExpect} -. The cluster detects faileded resource. -. The filesystem stays mounted. -. The cluster re-starts resources on same node. +. The cluster detects faileded resources. +. The file system stays mounted. +. The cluster restarts resources on the same node. . One resource failure happens. ========== @@ -1427,7 +1421,7 @@ actions are pending. {mynode2}:~ # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Manually kill cluster node, where resources are running (e.g. `{mynode1}`). +. Manually kill cluster node, where resources are running (for example `{mynode1}`). + [subs="specialchars,attributes"] ---- @@ -1435,7 +1429,7 @@ actions are pending. {mynode2}:~ # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Re-join fenced node (e.g. `{mynode1}`) to cluster. +. Rejoin fenced node (for example `{mynode1}`) to cluster. + [subs="specialchars,attributes"] ---- @@ -1452,10 +1446,10 @@ actions are pending. ---- .{testExpect} -. The cluster detects failed node. -. The cluster fences failed node. +. The cluster detects a failed node. +. The cluster fences a failed node. . The cluster starts all resources on the other node. -. The fenced node needs to be joined to the cluster. +. The fenced node needs to be connected to the cluster. . No resource failure happens. ========== @@ -1476,7 +1470,7 @@ actions are pending. {mynode2}:~ # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Manually block port for NFS, where resources are running (e.g. `{mynode1}`). +. Manually block port for NFS, where resources are running (for example `{mynode1}`). + [subs="specialchars,attributes"] ---- @@ -1485,7 +1479,7 @@ actions are pending. {mynode2}:~ # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Re-join fenced node (e.g. `{mynode1}`) to cluster. +. Rejoin fenced node (for example `{mynode1}`) to cluster. + [subs="specialchars,attributes"] ---- @@ -1505,9 +1499,9 @@ actions are pending. .{testExpect} . The cluster detects failed NFS. -. The cluster fences node. +. The cluster fences the node. . The cluster starts all resources on the other node. -. The fenced node needs to be joined to the cluster. +. The fenced node needs to be connected to the cluster. . Resource failure happens. ========== @@ -1538,7 +1532,7 @@ actions are pending. {mynode2}:~ # cs_wait_for_idle -s 5; crm_mon -1r ---- + -. Re-join fenced node (e.g. `{mynode1}`) to cluster. +. Rejoin fenced node (for example `{mynode1}`) to cluster. + [subs="specialchars,attributes"] ---- @@ -1558,24 +1552,23 @@ actions are pending. .{testExpect} . The cluster detects failed corosync. -. The cluster fences node. +. The cluster fences the node. . The cluster keeps all resources on the same node. -. The fenced node needs to be joined to the cluster. -. No resource failure. +. The fenced node needs to be connected to the cluster. +. No resource failure happens. ========== [[sec.test-additional]] === Additional tests -Please define additional test cases according to your needs. Some cases you might -want to test are listed below. +Define additional test cases according to your needs. Some test cases you should test are listed below. - Remove virtual IP address. -- Stop and re-start passive node. -- Stop and parallel re-start of all cluster nodes. +- Stop and restart passive node. +- Stop and restart in parallel all cluster nodes. - Isolate the SBD. -- Maintenance procedure with cluster continuously running, but application restart. -- Maintenance procedure with cluster restart, but application running. +- Maintenance procedure with cluster is continuously running, but application restarts. +- Maintenance procedure with cluster restarts, but application is running. - Kill the corosync process of one cluster node. See also manual page crm(8) for cluster crash_test. @@ -1584,38 +1577,38 @@ See also manual page crm(8) for cluster crash_test. == Administration -HA clusters are complex, the CM ControlZone is complex. -Deploying and running HA clusters for CM ControlZonen needs preparation and -carefulness. Fortunately, most pitfalls and lots of proven procedures are already -known. This chapter outlines common administrative tasks. +HA clusters are complex, and the CM ControlZone is also complex. +Deploying and running HA clusters for CM ControlZonen needs preparation, caution and +care. Fortunately, most of the pitfalls and many best practices are already +known. This chapter describes general administrative tasks. [[sec.best-practice]] === Dos and don'ts -Five basic rules will help to avoid known issues. +The following five basic rules will help you avoid known issues: - Carefully test all configuration changes and administrative procedures on the -test cluster before applying them on the production cluster. +test cluster before applying them to the production cluster. -- Before doing anything, always check for the Linux cluster's idle status, -left-over migration constraints, and resource failures as well as the +- Before taking any action, always check the Linux cluster's idle status, +remaining migration constraints, and resource failures, plus the ControlZone status. See <>. -- Be patient. For detecting the overall ControlZone status, the Linux cluster -needs a certain amount of time, depending on the ControlZone services and the -configured intervals and timeouts. +- Be patient. The Linux cluster requires a certain amount of time to detect +the overall status of the ControlZone, depending on the ControlZone services +and the configured intervals and timeouts. - As long as the ControlZone components are managed by the Linux cluster, they -must never be started/stopped/moved from outside. Thus no manual actions are done. +may never be started/stopped/moved from outside. This means that no manual +intervention is required. See also the manual page SAPCMControlZone_maintenance_examples(7), -SAPCMControlZone_basic_cluster(7) and ocf_suse_SAPCMControlZone(7). +SAPCMControlZone_basic_cluster(7), and ocf_suse_SAPCMControlZone(7). [[sec.adm-show]] === Showing status of ControlZone resources and HA cluster -This steps should be performed before doing anything with the cluster, and after -something has been done. +Perform the following steps each time before and after you do any work on the cluster. [subs="specialchars,attributes"] ---- @@ -1626,23 +1619,23 @@ something has been done. # cs_clusterstate -i ---- See also manual page SAPCMControlZone_maintenance_examples(7), crm_mon(8), -cs_clusterstate(8), cs_show_cluster_actions(8). +cs_clusterstate(8), and cs_show_cluster_actions(8). === Watching ControlZone resources and HA cluster -This can be done during tests and maintenance procedures, to see status changes -almost in real-time. +During testing and maintenance, you can run the following command to view +near real-time status changes. [subs="specialchars,attributes"] ---- # watch -s8 cs_show_cluster_actions ---- See also manual page SAPCMControlZone_maintenance_examples(7), crm_mon(8), -cs_clusterstate(8), cs_show_cluster_actions(8). +cs_clusterstate(8), and cs_show_cluster_actions(8). === Starting the ControlZone resources -The cluster is used for starting the resources. +Use the cluster for starting the resources. [subs="specialchars,attributes"] ---- @@ -1654,7 +1647,7 @@ See also manual page SAPCMControlZone_maintenance_examples(7), crm(8). === Stopping the ControlZone resources -The cluster is used for stopping the resources. +Use the cluster for stopping the resources. [subs="specialchars,attributes"] ---- @@ -1666,10 +1659,10 @@ See also manual page SAPCMControlZone_maintenance_examples(7), crm(8). === Migrating the ControlZone resources -ControlZone application and Linux cluster are checked for clean and idle state. -The ControlZone resources are moved to the other node. The related location rule -is removed after the takeover took place. ControlZone application and HA cluster -are checked for clean and idle state. +ControlZone application and Linux cluster are checked for a clean and idle state. +The ControlZone resources are moved to the other node. +The associated location rule will be removed after the takeover took place. ControlZone +application and HA cluster are then checked for a clean and idle state. [subs="specialchars,attributes"] ---- @@ -1689,18 +1682,16 @@ are checked for clean and idle state. ---- See also manual page SAPCMControlZone_maintenance_examples(7). -=== Example for generic maintenance procedure. +=== Example for generic maintenance procedure -Generic procedure, mainly for maintenance of the ControlZone components. The -resources are temporarily taken out of cluster control. The Linux cluster remains -running. +Find below a generic procedure, mainly for maintenance of the ControlZone components. +The resources are temporarily removed from cluster control. The Linux cluster remains active. -ControlZone application and HA cluster are checked for clean and idle state. -The ControlZone resource group is set into maintenance mode. This is needed to -allow manual actions on the resources. After the manual actions are done, the -resource group is put back under cluster control. It is neccessary to wait for -each step to complete and to check the result. ControlZone application and HA -cluster are finally checked for clean and idle state. +ControlZone application and HA cluster are checked for a clean and idle state. +The ControlZone resource group is set to maintenance mode. This is required to enable manual +actions on the resources. After the manual actions are completed, the resource group is placed +back under cluster control. It is neccessary to wait for the completion of each step and to check the results. +ControlZone application and HA cluster are finally checked for a clean and idle state. [subs="specialchars,attributes"] ---- @@ -1723,7 +1714,7 @@ See also manual page SAPCMControlZone_maintenance_examples(7). === Showing resource agent log messages -Failed RA actions on one node are shown from the current messages file. +Failed RA actions on a node are displayed in the current message file.. [subs="specialchars,attributes"] ---- @@ -1733,7 +1724,7 @@ See also manual page ocf_suse_SAPCMControlZone(7). === Cleaning up resource failcount -This might be done after the cluster has recovered the resource from a failure. +Cleaning up resource failcount can be done after the cluster has recovered the resource from a failure. [subs="specialchars,attributes"] ---- @@ -1790,8 +1781,8 @@ export JAVA_HOME=${RA_JAVA_HOME:-"{mzJavah}"} [[sec.appendix-crm]] === CRM configuration for a typical setup -Find below a typical CRM configuration for an CM ControlZone instance, -with a dummy filesystem, platform and UI services and related IP address. +Find below a typical CRM configuration for a CM ControlZone instance, +with a dummy file system, platform and UI services and related IP address. [subs="specialchars,attributes"] ----