Proactive Job Monitoring in an Imperfect World


Let’s face it: not every one of your jobs runs on a perfect schedule—your schedule isn’t filled with jobs that run on the hour, every hour, every time. What’s a busy system administrator to do?

Set up job monitor statuses, that’s what.

Real businesses are complex and dynamic environments, in which job schedules are based on dependencies and prerequisites occurring on various servers throughout your data center. For example, if your job queues are full and your system is very busy, you could run into a case where a particular job doesn’t even get started. You could activate it and throw it into the queue, but the agent it’s supposed to run on is very busy. Should you monitor the agent and the job stream until you know your job is ready to start?

Automated job monitoring for job overruns, underruns, or late starts can alleviate your busy team from spending valuable time on manually monitoring the job stream.

Understanding Job Underruns

A job underrun is probably the simplest case in job monitoring because the job has to actually start. Once it is running, if it completes too soon, you can send a message out saying that it was an underrun. If a job completes too soon, it could be an error, but not necessarily. For example, if you know that a job usually runs for at least an hour, you could set an underrun monitor for 59 or 60 minutes. If it completes before that, you might get a notification, at which point you can see whether there was a problem or not.

If there is a more drastic underrun, you might have more of an indication that something failed downstream. For example, you might have a job that was supposed to go to a database and get all the records of a certain type, process them, and change them into another format. The job spins through the database in less than a minute because there was no data! A job underrun in this situation would help you realize that something is wrong downstream with your database.

However, job underruns do not always indicate a problem. You could simply have a day when there’s no clutter on the network and everything is running quickly. The point here is this: know your jobs! If you know the normal behavior of each job in your workflow, you’re better able to understand at what thresholds you should be notified of underruns or other job statuses.

In this way, job monitoring is extremely useful for proactive system monitoring, helping you to find problems faster before they affect your entire environment.

Monitoring Job Overruns

The same concept applies to overruns that applies to underruns: we are not monitoring jobs while they’re sitting in the job queue. Rather, the job scheduler monitors the job once it starts running. In the case of the overrun, you might expect a job to run within ten minutes and you’ve set an overrun monitor of 30 minutes. If the job runs longer than the designated 30 minutes, you can decide to either fail the job or get a notification about it.

For example, you might have a job that is supposed to run for only a short period of time, but it ran very long. You might’ve been collecting data from a database and it was bad data. But if the job takes forever and completes using this bad data, it will continue on to the next step in the workflow, thereby infecting the rest of the system with infected data. In this way, a job overrun monitor would be helpful because you’d know that something was awry if you expected the job to complete relatively quickly. You can clear out the bad data and restart your process so that the rest of your workflow goes smoothly.

How to Handle Late Starts

JAMS provides two options for jobs that start late. If the job is scheduled at 1 o’clock and it takes more than hour for it to start running, you can kick off a late start. You can also set a “must start by” time: if a job doesn’t start by 10:00 a.m., you can receive a notification or end the job.

Typically, you would put a late start job monitor on a job that would have a scheduled time, but you can also have a late start monitor on a job that is strictly reactive. To achieve this with JAMS, you can use one of the settings that the job needs to start by X time or within a certain time. If it’s within a certain time of the scheduled time, it will go by the moment that it gets kicked off by reactivity.

Proactive Management of Your Job Schedule

Ultimately, job monitoring is all about proactive management of your production job streams. With the right enterprise job scheduler and job monitors in place, you will have the assurance that your job schedule will run smoothly—and if it doesn’t, you’ll be notified with enough time to prevent large disruptions and to make necessary changes to keep the schedule on track.

Job monitoring acts as a secondary defensive measure. You might not know you need job monitoring until something actually goes wrong or you notice anomalies with certain jobs and their run times. Once that happens, you can add job monitors for the assurance that those jobs will run within the correct parameters in the future.

Successful job monitoring asks just one thing of the system administrator: know your jobs in and out. If you are aware of the nuances of your job schedule, you’ll know that you’ll want an underrun monitor for job A, but an overrun monitor of 30 minutes for job B.