Waiting indefinitely, whether for the coffee you ordered 5 minutes ago, or for the date who is an hour late, causes one to question whether the expected outcome will ever happen. A well-designed Workload Automation strategy minimizes this type of uncertainty by establishing policies for runaways – jobs that should have completed within a set time limit.
One Question. One Policy.
If you’re looking into centralized workload automation solutions, you’ve probably reached a point where you can’t possibly review the functionality of every job in the schedule. You could dig through log files to get an idea of average elapsed times, but that could require hours of work, especially if you’re currently running jobs with more than one scheduling platform.
The runaway property of a job enables you to ask one simple question:
And, once that question is answered, you can immediately apply a policy that marks a job execution as unsuccessful as soon as its elapsed time exceeds a specified time limit.
Exact Time Values vs. Relative Time Values
In some cases, we can confidently set a time limit on a particular job, e.g. “10 minutes”. If your job doesn’t complete successfully in 10 minutes, it fails. But, what about jobs whose completion times vary, say between 20 and 40 minutes? It can be difficult to say with absolute certainty whether a completion time of 42 minutes is a true outlier or if it’s just a minor anomaly. “Runaway Elapsed Percent” is another way to fine tune an enterprise’s schedule. By creating a policy based on the typical run time of job, you can cause jobs to fail when they exceed their average completion time by some percentage, e.g. “50% longer than the average completion time”.
Runaway Action – “I’m Not Dead Yet”
Runaways aren’t always failures, even though most of them indicate that there is some larger issue preventing a job from completing successfully. The default option doesn’t have to be panic. Here are 3 ways you can mitigate runaway jobs with additional steps:
- Send Notifications – Whether it’s sent to you or to a sysadmin who can assist you, a prompt notification gives you ample time to troubleshoot a runaway and keep business users informed.
- Retry – Maybe a file was missing or there was a brief interruption in the application your job needed to access. Retrying a runaway job can give it one more opportunity to succeed, this time with new conditions.
- Run a Recovery Job – If you’ve repeatedly employed the same series of tasks to resolve the runaway status of a particular job, you’re on your way to leveraging a recovery job. To give critical jobs resiliency, add triggers that fire off a specific recovery job when the original job runs for too long.
Business growth, network connectivity, resource availability, and other factors can gradually increase a job’s run time to a breaking point. Staying ahead of the breaking point ensures that you minimize downtime, eliminate fire drills, and maintain business continuity.