The Retry – A Simple Hedge Against Infrequent Workload Failures

If at First You Don’t Succeed…

One of the simplest Workload Automation tactics is the retry. The retry is, by no means, a permanent solution for a failing job. But, if your entire schedule of critical batch processes is made better by a successful execution – albeit later than you would have liked – then the retry is an effective way to keep workloads on the right path in the short run.

Retry - A Workload Automation Concept

You often can’t afford to chase down every failure with the same degree of effort, especially when you are managing hundreds or thousands of executions per day. Retries can compensate for conditions that are beyond your control. Short-term issues such as network outages, file transfer delays, database locks, and maintenance windows can cause a job to fail one minute, but then succeed only a minute later. It can be time consuming to trace every path.

When you encounter issues that are infrequent and inconsistent a retry lets you and your team stay focused. Even the mental energy required to review a single job failure is worth something. Why waste that energy if success is only a retry away?

What the Retry Often Reveals

Retries often create useful data points for troubleshooting repeated job failures. If, for example, you run a job at 9:00 AM and it fails, then you retry it at 10:00 AM and it succeeds, and this repeats every day, you know that there’s a good chance the root cause is systemic change between 9:00 AM and 10:00 AM. If you leverage an enterprise automation solution, such as JAMS, you’ll have an even greater advantage – log files that detail the execution of your retried jobs, and history tables that show the specific times when retries were required.

Notice a Pattern?

Once you know the root cause that’s preventing your job from executing successfully every single time, you are ready to automate smarter. The natural next step, toward consistently reliable automation, is the dependency.