Workload Automation Foundations: High Availability
Keep critical jobs running reliably even when your environment lets you down.
High Availability has one goal. Keep jobs executing reliably even when your infrastructure fails. Whether the failure is brief, like a network outage, or longer, like a server failure, you need to know that your jobs won’t just get dropped. You need automation that is resilient.
Join our technical team as they take you through a 20-minute session focused on high availability in workload automation.
We’ll cover:
- Active/Passive vs. Active/Active
- HA vs. DR
- HA Architecture On-Prem and Cloud
- Value of Agents
- Logging during failure events
This webinar is the second in a series on Workload Automation Foundations. Be sure to look for the rest of our WLA Foundations webinar series, coming soon.
Thank you everyone for joining us for today’s webinar, Workload Automation Foundations High Availability. My name is David Kluskiewicz and I’m joined by Rob Newman, a senior member of our technical support team here in the Workload Automation business at Help Systems.
Today, we’re going to look at high availability or HA for short, so this is the aspect of automation most of us don’t want to deal with, but one that we all should be prepared for. Enterprise workload automation solutions, as we’ll demonstrate today with JAMS, enable you to define a very predictable set of actions should your infrastructure be impacted by a critical event. So, like a disc failure, a network outage, power failure, or maybe physical damage.
We’ll start with a quick overview of the different types of high availability. Then, we’ll look specifically at how you can prepare an automation environment to be ready for an event. And finally, we’ll walk you through an outage scenario, and examine what could happen to your jobs, and how an enterprise solution can mitigate any problems. So, there’s a chat window over on the right-hand side of your screen and at any time during the webinar, if you think of a question, you don’t have to wait until the end, just post them there. We’ll be monitoring that chat for questions as we go. And if we don’t get to your question right away, don’t worry we’ll take some time at the end of our webinar to answer as many of those as we can. With that, I’ll hand it over to Rob.
Thanks, David. So, high availability is a safety net. In workload automation, high availability is where you have a secondary scheduler on standby should an event compromise your primary scheduler. In a highly available configuration, it’s your secondary scheduler that sits on standby in a passive mode. It is always ready to go when your active node goes down. The secondary scheduler maintains a heartbeat connection with the primary scheduler to monitor its health. By health, we mean that the passive node can communicate with the JAMS scheduler on the active node.
Three missed heartbeats is considered a failover event, by default JAMS evaluates the primary scheduler’s health with this heartbeat every 60 seconds. And for most businesses that is sufficient, but you can adjust that heartbeat frequency to trigger a failover much more quickly, if you need to. And with high availability in JAMS all jobs will continue to run successfully to completion. The active passive model is easy to configure, it requires less overhead than other failover models. In active/passive configurations, the primary scheduler manages 100% of the jobs, while the secondary scheduler waits on standby. Active/active is the other type of failover model. It’s worth mentioning because it can be useful for businesses that need to scale out very quickly, however, it can be difficult to configure.
High availability is your first line of defense. It provides redundancy, and eliminates that single point of failure. High availability is duplicate schedulers that act as one. Whereas, disaster recovery is a process that brings up a whole new instance, or infrastructure. One that mirrors your entire production infrastructure. Its own schedulers, its own database, its own clients, et cetera, it even has its own HA or high availability scheduler as well.
Disaster recovery, as the name implies, provides processing in the event of a disaster like your whole data center goes offline, or the whole zone in your cloud goes down. It replicates your entire environment and provides redundancy. It’s akin to an insurance policy you probably won’t need, but if you do, you want your environment made whole as quickly as possible. So, disaster recovery is an offsite alternate location that is not affected when disaster hits your primary location. It is often cold, meaning not actively running, yet it is a mirror, or a snapshot, or a restore of your production infrastructure just waiting until that disaster happens and recovery is needed. Now, how closely your disaster recovery environment matches your production environment at any given time is completely dependent upon your business needs. For some, a nightly snapshot while suffice. For others, a hot backup created every minute is necessary. And others may actually require real-time mirroring.
Now, high availability is generally contained within a single site. That site could be a physical data center located on premise, or a site leased from a provider, or even in the Azure, AWS, or Google Cloud spaces. For example, your primary and secondary JAMS service could both be located in the Connecticut data center, your disaster recovery infrastructure, however, would be physically located at a separate site, say, a remote data center or a different zone within your cloud space. Therefore, with both your primary and your secondary JAMS servers located in Connecticut, you would want your data recovery infrastructure to reside someplace else, say, Nevada. So, should a disaster happen in Connecticut your disaster recovery infrastructure in Nevada would be activated. And, as a best practice, we recommend that you have both high availability and disaster recovery to cover all possible scenarios.
So, looking more broadly, don’t neglect the importance of the agents in your workload automation. Running your jobs on agents in a highly available configuration prevents job loss. In fact, if you’re not running all your jobs on agents, then you’re not truly high available. Should the scheduler go down the jobs that are executing by those agents will continue to run, so running your jobs on agents will provide zero downtime for those jobs, and will allow them to be successfully run to completion. So, let’s say, the primary crashes due to a hardware failure, when the secondary scheduler takes over scheduling it will pick up the monitoring of those jobs, the capturing of the logs and statistics, and the completion results for all of those running jobs, and all of the agents. And any jobs that were pending submission during a failover event will be submitted by the secondary scheduler, and no job loss will occur.
So, let’s take a brief look at a typical high availability event, so you know what to expect. The services on your primary scheduler are running as expected. Now, let’s just say it’s something simple, like a intermittent network blip, something that we could just add a retry to the job that we’ll be able to handle. Then, in that case, the secondary scheduler just stands by. However, how about a disc failure on the primary schedule that will interrupt the schedule? And if the issue is persistent after those three heartbeat failures, then the secondary scheduler takes over. So, what does the secondary scheduler do? Well, it first queries the database for all of those executing jobs, then it queries the agents for the status of each of those jobs. And, within seconds, it has a complete picture of all the jobs as they were at the time of that failure, as well as any jobs that need to be submitted immediately, and even in the near future.
Now, here’s where I want to emphasize the use of agents to execute jobs. If you were to run jobs directly on that primarily scheduler, which in this case had some issue, then you would lose those jobs. Whereas, jobs running on agents will not be affected by a scheduler crash.
Okay so, what are the key takeaways from today’s presentation? There are two methods of configuring high availability, active/passive, and active/active. Active/passive is relatively simple to set up and will protect your automated jobs from most issues. Active/active is another method, but requires complex configuration both to the scheduling infrastructure, and the machines, and the applications on which the jobs execute.
Disaster recovery differs from high availability in that it creates and syncs not only a redundant scheduler, but every element of automation. Another database, another set of clients, another set of agents, et cetera. High availability is generally implemented within the same site. And agents are a key component of high availability as they ensure that the individual jobs can be executed and monitored by either the primary or the secondary scheduler. With centralized workload automation, the handling of failover events is predictable. Jobs can continue to execute successfully, and you can return control to the primary scheduler as soon as you’re confident that the issue has been resolved
That’s everything I have from a HADR perspective. Dave, do we have any questions?
Yeah, it’s really clear. Thanks Rob. Just a couple here. So when the failure event is over, how does one switch back to the primary scheduler?
Oh, that’s a great question. So, we have some PowerShell cmdlets that will allow you to failover or fail back, depending upon the situation needed. And it is a manual process, and we do that on purpose because you never really know how quickly you’re going to get the primary backup. It may take a couple of reboots, it may take installing a new piece of hardware. It may take some time, so it’s manual to allow you to be confident that the primary schedule is fixed, and up and running, and then you can fail back when the time is appropriate.
Got you, okay. Well, let’s get around to a different question here. So one person asked, “If I have an emergency patch on my primary services, they’re aware there’s going to be a failover event, can you failover manually to use the secondary scheduler?”
Oh, absolutely. That’s an absolutely perfect time to use those PowerShell cmdlets that we have to manually failover to your secondary. Then, you can patch your primary server, use those same cmdlets to fail back to the primary schedule. And then, at that point, if you need to patch your secondary scheduler as well, then, go right ahead, yes.
Got you, okay. Another question, “What happens to the job logs for the jobs that are executing from the secondary schedule? Do I need to collect those, if I’m going to do an audit, and I need to have a consolidated view of all the jobs, regardless of which schedule they were run on?”
Oh no. JAMS has access to all the logs, whether they’re from the primary or the secondary scheduler. We have a concept called the common log location, which will store your log on a remote server at a UNC path, the DFS link, a NAS located.
Rob, would you just repeat that? Your audio broke up a little bit there.
Oh, sure. I’m sorry about that. Absolutely. JAMS has complete access to all the logs from both the primary and the secondary scheduler. It could be a NAS location, a DFS link or a UNC path where those logs are stored, so that no matter which scheduler needs access to them, they both always have access to those log files.
Great. Okay. All right, that’s all we got for questions. Thanks Rob very much. And thank you everybody for attending. There will be a recording of this webinar sent out afterwards. And if you think of any other questions afterwards, feel free to send them to us. We’ll see you next time. Thanks a lot.