Workload Automation Foundations: Alerts & Notification
Workload Automation Foundations: Alerts & Notification
Never be in the dark again! Get (or distribute) clear, concise notifications about your jobs. In this on-demand webinar we will show you how to configure alerts and notifications that are useful to the people that receive them. Notify on failure, success, waiting on a dependency, or any number of conditions that could occur during a job’s execution.
Join our technical team as they take you through a 20-minute session focused on alerting.
We’ll Cover:
- Job conditions that can trigger alerts
- Adding Dynamic content to alerts
- Sending alerts via email, phone, and text
- Sending job alerts to ChatOps, monitoring products, and service management apps
All right, thank you everyone for joining us for today’s webinar, Workload Automation Foundations Alerts & Notifications. My name is David Kluskiewicz and I’m joined by Rob Newman, a senior member of our technical support team here in the workload automation business at Fortra.
Today, we’re going to look at alerts and notifications. Once you configured your jobs in a scheduler, you’ll want to have alerts in place so that you can respond quickly to issues that might impact your SLAs. So when it comes to job scheduling, fire drills are terribly inefficient. Then again so is inbox full of miscellaneous alerts that you know you’re going to have to sift through and prioritize in order to ensure that those jobs eventually run successfully.
So today we’ll look at some best practices for setting up your job alerts. And hopefully giving you some ideas for making the alerts that you do receive relevant, manageable, and useful, so you can respond and get back to higher priority projects. There’s a chat window over on the right hand side of your screen. So at any time during the webinar, if you think of a question, don’t wait till the end, post them there. I’ll be monitoring that. And if we don’t get your question right away, don’t worry, we’ll take some time at the end of the webinar to answer as many of those as we can. With that, I’ll hand it over to Rob.
Thanks, David. So today you may be generating alerts on one of those native tools for scheduling, and probably facing some challenge. Those tools include things like Test Scheduler, SQL Agent, and Cron. So with Test Scheduler, aside from only being able to detect success or failure, you’re on your own for adding an alert for a job, you essentially have to write scripts to send alerts via emails to system center operations manager, SCOM, and JIRA or any other ticketing system. And again, with things like SQL Agent, the only conditions it can detect are success and failure. It can’t use reg ex, those regular expression patterns to detect more nuanced events. Now, SQL Agent does have a mail utility, but do you really want the high cost, often critical database handling email, and using up your precious SQL resources? I think not.
And then there’s Cron. Cron is a very bare bones, and requires both complex syntax to define the event, and a separate script to generate an alert. It’s not a fan favorite by any means. What you really want is a no code system for job alerts with native built-in alerting capabilities that will save you a lot of time in the long run. There are many events in your environment like job failures and job successes. There are also a lot of events that even though they aren’t officially a failure, you want to be alerted to before it becomes a problem. These events include when a job stalls out, or takes way too long to run. With proper alerting you don’t have to sit around waiting for a failure notice. With proper alerting you can instead act before issues become bigger problems, before the customer even notices.
There’s quite a number of events that could happen on your system that you need to be aware of. Jobs sometimes run 200% longer than they normally do, or they were successful way too quickly. And we need to know about these things. A job scheduler can alert you to all of these conditions. Success and failure doesn’t always have a nice, neat bow on it, nor does it always fit a specific model. A centralized scheduler has the functionality to detect more than just success and failure, and can fire an alert for many events. Most workload automation tools will let you set any level of escalating alerts. So if the first group doesn’t respond, it gets sent to the second group, then the third group and so on. This helps you ensure your alerts don’t just get dumped in an email box and ignored.
A centralized workload automation solution lets you configure a group of recipients that applies to every alert on every job within say a specific folder. From there, you can apply your escalation policies, so each group of stakeholders gets the alert until the job stops failing. A centralized workload automation solution makes this even easier when you can. Pre-configure the groups in your tools, so it doesn’t require special attention when they change in your environment. The goal is to go beyond the basic alert that says, “Something’s broken.” Instead, you want to make sure you not only get that alert to the right people, but also provide them with the information they need to address the issue. Centralized distribution lists and on-call lists make it easy to keep the right people informed. You can even segregate the alerts and the notifications based on folder structure, ensuring they’re sent to the appropriate groups. Most organizations have a directory system like Active Directory, or some type of LDAP authentication that can be leveraged.
A centralized workload automation solution makes it easy to leverage lists already defined. And any change in membership automatically updates the alert’s recipients. Plus you can send an email to any SMTP server with an email address, whether it’s personal, or a group, whether it’s Lotus notes, or exchange, or even some generic in-house email system. It’s also important to be able to maintain an audit trail of those alerts, especially in your mission critical applications. You often need to be able to show that alerts were sent, and you need to make sure those alerts get noticed. Alerts are all retained within each job for proper auditing measures. And this helps demonstrate that you are giving proper attention to your scheduling issues and that you have a solid action plan to resolve them quickly.
Making alerts more useful starts with deciding how you want to receive those alerts. With a centralized workload automation solution similar information can be delivered through email or even text messages, or you may prefer send to send alerts to a chat product such as Teams or Slack. How about a service management tool like ServiceNow, or even ticketing systems such as JIRA or Remedy. Structured data is not only useful to the recipient, but it can be parsed and use directly in your ticketing system to respond to alerts. You can easily map job status, log information, and other important details to most common ticketing applications. Though less common, we also support teams that use SCOM and SNMP traps as a way to process job alerts. And the use of markdown formatting can help make the content very clear to that recipient. The phone notifications that just point you back to the schedule… Yeah, that’s not really useful. So what really happened? You want… No, actually you need that relevant information directly in the alerts so you have everything you need to address the alert right at your fingertips.
This could include things like the log file, the agent name, the exit code, and even specific values from the log file. Or how about a full stack trace? To really make notifications useful they should also include some recovery instructions, say a link to some documentation, or if you need some extra help, how about some directions on where that recipient can go to get that extra help? The biggest benefit to alerts with a centralized workload automation solution is creating alert templates for groups of jobs that are organized in a folder structure. You can set these up so a specific group of jobs receive the same formats for all of their alerts. These templates can include the, from address, the team that gets notified, and even the common recovery instructions, or anything else that is relevant for the group of jobs.
Those templates will also be inherited down to all of the jobs in that folder, so you don’t have to constantly configure new alerts. However, you can always override or add additional information if needed. You want the templates on the development folder to only send alerts to your development group. And when you promote those jobs to the QA folder, it inherits the QA template so that the QA team can get the information they need without having to get all of the dev alerts, and so on up the chain for production. This way your CEO is not getting the outage notification for the job while it’s in the dev environment, but only when the job gets to production.
When you start getting too many alerts, you’re naturally going to start ignoring them, or creating inbox rules to filter them out, and we don’t want that. We want every alert you received to be something that actually deserves your attention. Alerts are not a substitute for centralized monitoring, especially with things like task scheduler, or other native tools. People use alerts from those tools to collect job statuses. They can’t be logged into every machine at once, so they use email to keep tabs on jobs. A good workload automation tool will monitor your job status, highlight the jobs that need your attention in one centralized place without sending out alerts.
So we’ve got some alerts here. It looks like one of them is running at 1900% complete. Yeah, that’s a runaway, so technically the job hasn’t failed yet, but we need to know about that. Let’s set up an alert for that. We’ve got another job that has been stalled, it’s waiting on a prerequisite that could be for a file, or another job, or some other prerequisite. It once again, hasn’t failed, but hey, we need an alert on that. And we’ve also got a sequence here that is halted. And yes, we have alerts for those as well. And it’s very easy to visually identify in that monitor exactly what needs attention, and what doesn’t.
And what about those small issues that you know can be mitigated without a fire drill, a centralized workload automation solution can do things like retry a job on a failure. It can even retry the job a couple of times before raising the fail flag, if a brief table lock, or a network interruption was all that stood between your job starting and completing successfully, then a retry can eliminate it an alert. Another way to avoid unnecessary alerts is through recovery jobs. A good scheduler will allow you to include a recovery job to correct the root cause, and we try the job before sending alert. Again, one more alert that can be avoided.
Okay. So what are some of the key takeaways from this presentation? Current alerts from Test Scheduler, SQL Agent, and Cron merely detect success or failure, and often require complex scripting to generate alerts. Alert events go beyond mere job failures or success, and there’re many other events that aren’t officially a failure that you can be alerted to before they become bigger problems. Setting up escalating alerts, and even the audit trail ensures that the right people are seeing the proper alerts at the right time, and that they are responding. Default notifications are useless, and proper alerts should include as much relevant information to resolve problems as possible. You know you’re going to start to ignore those alerts if there are too many, and a centralized workload automation solution can automatically resolve problems that would usually require an alert like retrying a job, or starting recovery job. So with that, Dave, do we have any questions?
Great. Thanks Rob. Yeah, just two questions here. One is my ops team is managing most of these kinds of alerts with JIRA. Do we have to go back into JAMS, or is there some way we can respond to our failure alerts through JIRA?
Oh, you can have JIRA absolutely respond to events and take action within JAMS as well. That’s a great question. JIRA does have the ability to execute some external scripts, and those external scripts can respond to the alert, can retry the job, it can even start another job, and kick it off as needed as well. So yes, absolutely.
All right. And another question is, I have some additional requirements to notate things for our overnight operators, the emails, can we include hyperlinks to… I’m assuming documents, or other instructions for people when they receive an alert?
Yes. The email is completely customizable, you can add any number or level of documentation that you want to it, including the log file or additional files. And it actually uses the markdown language, so inserting things like hyperlinks is absolutely possible, as well as formatting it in a view that makes it very easy to find the information and get the direction you need to go resolve that alert, or resolve that problem.
Got you, all right. Good, that’s all the questions we got. Rob, thanks for the overview. This is great, and hope you found this useful. We’ve got other workload automation foundations videos on our website, and you’ll get a recording of this afterwards. So thanks everyone for joining us, and thanks Rob.
Thank you.
See you next time.