Automating the Creation of SAP HANA and SAP S/4HANA Pacemaker Clusters
Start course
2h 10m

This course covers Ansible automation for SAP. We'll start off with introductions to both SAP and Ansible and then we'll present the use cases of automation with Ansible that we have built for SAP. You'll then be guided through a demonstration of an end-to-end deployment of SAP HANA and SAP applications like NetWeaver and S4/HANA.

Learning Objectives

  • Learn the fundamentals of what SAP and Ansible are and how they work
  • Learn how to patch SAP landscapes
  • Understand SAP HANA and SAP Netweaver maintenance
  • Automate the deployment of SAP S/4HANA databases with Ansible Tower
  • Automate the creation of SAP HANA and SAP S/4HANA Pacemaker Clusters
  • Automate the migration of SAP workloads from SUSE Linux Enterprise Server to Red Hat Enterprise Linux
  • Learn how to carry out SAP Application Server Autoscaling

Intended Audience

This course is intended for anyone who wants to learn how to learn how Ansible automation can be used with their SAP workloads.


To get the most from this course, you should have basic knowledge of Ansible and SAP.


So, in this video we are going to showcase another use case of Ansible for SAP that is also very useful to build high availability environments, high availability clusters for your

SAP workloads. So, we can create these clusters with pacemaker with a Red Hat availability add-on and we can create them at the database level and our application level so for SAP HANA or for the other databases that are supported by SAP. So, MaxDB, Oracle, MySQL server and DB and then on the application tier for any Netweaver based application we can create a cluster as well.

Let's have a look at the architecture, the general architecture of what we can do in terms of clustering and a typical solution design for this high availability scenarios for SAP.

So, we can see here that the users are going to connect either from a mobile device, from a laptop, from a desktop, whatever they're going to connect to the applications and here we can see that we have the principal application server of the application tier of the SAP Netweaver or S/4HANA. We can have several application servers to distribute the workload and then we have here the main or the main part or the core of the application tier which are the ASCS instance. So, central services instance and the ERS, instance of the enqueue replication server instance. So, those two instances are possible single points of failure of an SAP installation, that's why we want to cluster them so that they are highly available and that if there's any issue in one of them they can failover to another node and be resilient.

So, this is a typical deployment for example in a hyperscaler where we have different Availability

Zones, different data centers and we want to have for example for the application tier, we want to have one node and one Availability Zone or in one data center another node in another data center and on both nodes the essential services instance and the Enqueue Replication server can run. So, if whenever there's an issue in one node it will be failover to the other node without any issue and then we have the database tier with two HANA databases. In this case HANA but as we said it can be any other database supported by SAP and just like that we will have one HANA in one data center or one Availability Zone and the other HANA or data server in another one.

So that would be another cluster built with Pacemaker. In this case, in the case that we are going to show in the demo we are using HANA. So, we are using HANA System replication for that as well and we can have the Fencing Agent that will be dependent on the architecture where the system so that the workloads are running on. So, for example in a hyperscaler we will have if it's AWS, we will have Fencing Agent that talks to the API that AWS uses for fencing. If it's VMware the same thing it will connect to the hypervisor and fence the nodes, this fencing for those who are not familiar with the concept. Concept is to avoid split brain situations in the cluster so when we have only two nodes cluster and they lose the connection between themselves but they are still working correctly both of them, they might think although they might think that they're still the primary node and since they cannot connect to the other node, they will think that there is a problem that they need to take over and there can be problems if they access to the database resources both at the same time because they can overwrite the registries, the records and cause a lot of mess. So, to avoid that we have this Fencing

Agent. What it will do is that when it detects that one node has lost connection to the other one, it will fence or make not available that node. For example let's see that this one is not responding. So, the Fencing Agent will fence it and normally what we will do will be a reboot of the application of the whole node. If it comes back and the connection is established it will leave it like that otherwise it can fence it and shut it down completely until it's recovered because it will need manual intervention. So that's the main concept of the clustering in general and in particular with Pacemaker. Here we can see as well the disaster recovery design for which we are not using Pacemaker because normally when we need to failover to a different disaster recovery site we do that manually after having declared the disaster and on the primary data centers. So, we will just concentrate on this bit. We will show how to build a Pacemaker cluster for a HANA database with Ansible. The first step will be to enable the HANA System Replication between two of them and then we will proceed to create the cluster. As usual this all can be included in just a single pipeline, every single workflow and added to the end-to-end deployment of SAP HANA or S/4HANA environments. So, just as we did with in the other videos where we deployed HANA database or an application, we can add this step to that workflow so that we don't need to run a server once we will have just turned into an with the high availability. So, now let's go to the Ansible Tower again and we already saw all the templates that we had the projects etc. So, now since we are going to run in order to show you first the HANA System Replication and then the Pacemaker, let's just run the job Templates separately, let's not add them to any any workflow. Okay! So, if we go in here or well if we go first to the project associated to this HANA System Replication enablement or activation, we can see as we saw already where we're taking it from the Github repository the branch that we are using etc and if we go back to the inventory some of the variables that we saw there are relevant to this playbook. So, if we go to the

SAP HANA host for example hana1, we will see some of the variables that are for this playbook.

Okay! The HANA system replication, the SID that we are using for both our installations, this the instance number, the passwords that we're going to use again as we said we can and we should put all the passwords in encrypted with the Ansible features for that the Ansible Vault.

Yeah! The roles that they are going to perform in the system replication, if it's going to be primary, if it's going to be a secondary, if we go to hana2, we will see that the role there is a secondary of the system replication. Okay! Let's go back to the hsr and let's trigger this template or this test job. As usual we will see all the locks here of what is happening and in the meantime if we go to the servers we can check that the system replication is not activated yet.

Okay! So, it tells us that the system still hasn't got the system replication enabled. So that's what is being done at the moment.

Let's see how it's progressing?  Okay! So, it's enabling the system replication on the primary node. Then it stops for a bit to give it time to become actually active and then it will register the secondary node onto the primary. Once it's done, if we run again this script, we will see that the system replication has been configured. We will see the status of it and how they are communicating both servers. It has taken a backup first of the primary database because that's one of the prerequisites to being able yeah to activate the HANA System Replication. Okay!

Now stopping the HANA database on the secondary so that we can register it to the primary.

As we can see it says skipping in hana1 because this is a step that needs to be done only on the secondary and yeah and the other way around. The tasks that need to be done only in the primary server on hana1 are being skipped on hana2.

Okay! So, it must be stopping the database, the secondary and once it's done it will register yeah.

It's been pretty quick, has reduced to the secondary, to the primary is giving it time to sync.

Actually let's check what the stage is saying for the on the primary node. Yeah! So, now we can see that the system replication has been activated and we can see here that we are the primary. So, hana1 is the primary server of the replication. The name in the replication is DC1, we have given that name and here we can see all the information. So, all the databases or all the databases inside the HANA installation that are being replicated. So, the systemdb that's the one for metadata that comes with an installation of HANA and then the actual database that we have created rhe. So, all the processes are being replicated onto the secondaries of the name- server, the indexserver the xsengine, everything. The status is still unknown because it's still registering and syncing with the secondary but we can see that this is the pier in hana2 and the ports are being used for the replication of each one of the services. Let's go back to tower and now the playbook has and the task has finished so, now we should be able to see the status.

Okay! So, now we consider the status is ACTIVE. If we log on to the secondary server we can also check from that side.

Okay! press change to root. Then to the hana admin and we do the same.

Okay and here we can see that we are the secondary, so mode SYNC that's the synchronization mode of the system replication. There are different  modes for example sync, syncmem, async for asynchronous. This is the name of the site in the replication. So, this is like the logical name and here is who's the primary, so hana1. So, the system replication is enabled now and next, we do next is to enable or create the Pacemaker cluster so, let's go back to tower.

We go back to the Inventories. So, because these variables for the deployment of the cluster are going to be common to both hana servers we've put them here.

So, here we have all the variables for the Pacemaker. So, the virtual IP with which we want to deploy the cluster, the IPs, the actual IPs of the servers, they're fully qualified domain names etc. So, we have everything here ready just to run the template. So, let's go for it HANA high availability Pacemaker. Let's see how it goes?

It shouldn't take too long because it's quite fast to deploy the cluster. Just to say that all these playbooks or rows that we are using are open source, open community so, you can find them on Github, you can find them on galaxy. So, galaxy is the source where or like the official source of playbooks and roles for Ansible. So, anyone can create their playbooks and they can try to upload them to galaxies. So, for that they need to go through some testing by the users and once they've scored the roles and and they've achieved a certain score so they are deemed valid and official. They are moved to Ansible Galaxy which is the central source of repositories and roles for Ansible.

Okay! Now it's creating the cluster, the resources, the virtual IP and the resources for the hana database. It's adding the constraints needed for the correct working of the cluster. So, now let's go back to it because it's finished and now let's turn to let's try to run a command pacemaker command. So, we can see the stages and we have the cluster created. So, we can see both nodes, so hana1 and hana2 are parts of the cluster and those are all the resources. So, the topology of the database which tells the cluster the status at every moment of the cluster, if it's healthy or not what the resources are doing, what the processes are doing and then the resource for the database itself and then the virtual IP that we are using for the application to connect to the database. So, in case of any failure in the current node that is the primary for example, it will be moved to the secondary node. So, the virtual IP and all the resources we can see the constraints that have been created.

So, those two are the constraints needed for a correct workflow, for correct performance of cluster for SAP HANA. So, we have a constraint that tells you that the topology of the database. So, all the status we have mentioned and all the health stages of any statistics of the database needs to be started before the database resource in itself is started and then a collocation constraint that means that the database, the primary of the database needs to be in the same node as the virtual IP. So, the virtual IP will follow every time there's a failover of the database resource, it will follow it to the node where it's failed over to so that way we'll make sure that the application will be attaching and connecting to the right IP.

So, if we imagine that you move only the hana database to the other node but the VIP, the virtual IP stays in this node. So the application will try to speak to this node and there will be no data here that's why we need to put this constraint there. If we go to the other node we will see the same because it's when you run commands in a pacemaker cluster the result is the same in either node. So, let's just check it. We need to run all the commands for pacemaker as root and we will say the same here. Okay! So, now we are going to try to fail over manually the resources and see that that actually works. Okay! So, let's move for example the vip and since we've seen that the vip needs to be in the same node as the master of the database, the database will follow. So, let's just fail it over.

Okay! It's moving it. If we now take a look at the stages again, well here we could see that the vip was on hana1 as well as the hana primary resource. Okay the database primary resource. Let's look at it now. So, now we can see that the vip has failed over to hana2 right! and the database is being moved, failed over, that's why it says promoting.

So, hana2 is being promoted to primary. Okay! Let's check again.

Still promoting actually we can just watch the stages and we see how it changed.

So now the database is already accessible. It's just taking over the system replication as primary but the application has not lost connection because as soon as the virtual IPs is failed over it will reestablish the connection. So, as we said before when we were talking about the solution, the near zero downtime solution. Since the application side of the SAP installation has this connection suspension feature, it will be transparent for the application for the user this failover, so that's fine and now we have the master promoted in the secondary and then the former secondary in hana2. So that's how easily we can create a cluster with for SAP HANA with Pacemaker and Ansible and as I said we can create just a big pipeline of the provisioning, the installation of SAP HANA, the installation of S/4HANA, enablement or activation of HANA System Replication and finally the creation of the cluster. So, just with one click, we can get all this done. So, I hope this has been very helpful, mainly for the SAP Basis administrators like me because yeah as I said this saves lot of time and apart from that lots of misconfigurations because for example if your experience with Pacemaker or clustering in general, you know that there are a lot of of parameters that can be slightly changed from one installation to the other. So, the best way to ensure this doesn't happen is to have a template for that and repeat it. So, you will make sure that the clusters will be created with exactly the same characteristics and you won't find any issues. Okay! So, here we can see for example if there has been any kind of slight disconnection to the monitor of the monitor agent. It will pop up or it will show failed resource actions but that's a temporary disconnection of the monitor.

When this is solved this will be cleaned up or you can clean up manually but normally you have a list of failed resource actions even if it's been corrected for the administrators to be able to see what's gone wrong or what type of these conditions there have been etc.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).