Cron heartbeat monitoring

A cron check is a heartbeat URL that watches a scheduled job. Your job pings the URL when it runs, and BoxWatch alerts you when it doesn't. This works for cron, systemd timers, Kubernetes CronJobs, GitHub Actions, Windows Task Scheduler — anything that can run curl.

How it works

You give the check an expected interval (say, every 24 hours) and a grace period (say, 5 minutes). Your job hits its unique ping URL every time it runs successfully. BoxWatch keeps a timer.

If interval + grace elapses without a success ping, you get a missed alert. If your job explicitly tells BoxWatch it failed, you get a failed alert. There are two more states (covered below) for jobs that signal start-but-never-finish.

The slug in the ping URL is a UUIDv4 — random, unguessable, and treated as the secret. No auth header, no API key. Just curl the URL.

Creating a check

Sign in and go to Dashboard → Checks → New Check.
Give it a name (e.g. "Nightly Postgres backup").
Set the interval — how often the job is supposed to run.
Set the grace period — how long after the interval BoxWatch waits before firing a missed alert. Default is 5 minutes.
Optionally link it to a server. Linking lets the check inherit that server's maintenance windows.
Optionally set a max duration to enable long-running-job alerts.
Pick which alert types you want (missed, fail, stuck, long).
Click Create. You'll land on the detail page with your three ping URLs ready to copy.

ℹ

Grace period is capped at half the interval (floor(interval / 2)). For a 60-second interval, max grace is 30 seconds. The API will reject configurations outside that bound.

The ping URL

Each check has one slug. From that slug you get four ping endpoints:

GET/ping/:slug

Auth: none

Job succeeded. Updates last_ping_at and last_success_at. If a /start came in earlier for this run, BoxWatch records the duration.

GET/ping/:slug/start

Auth: none

Job started. Optional, but recommended — it enables stuck-job detection and duration tracking.

GET/ping/:slug/fail

Auth: none

Job failed. Exit code not recorded.

GET/ping/:slug/fail/:code

Auth: none

Job failed with an explicit exit code (integer 0–255). Recorded in the ping history.

All four endpoints accept GET, POST, or HEAD. If you POST a body, the first 10 KB are captured and shown in the ping history — handy for stashing a short log tail or a JSON status blob.

The response is always 200 OK with {"ok": true} — even for invalid slugs, to prevent enumeration.

Examples

Plain cron line

0 3 * * * /opt/backup.sh && curl -fsS https://api.boxwatch.app/ping/abc123-...

Runs the backup at 3 AM. If backup.sh exits zero, the curl runs and BoxWatch records a success.

Cron line with exit-code reporting

0 * * * * /opt/cleanup.sh; curl -fsS https://api.boxwatch.app/ping/abc123-.../fail/$?

Note the ; instead of && — this fires the ping regardless of exit status, and $? carries the exit code into the URL. BoxWatch interprets exit 0 as success and any non-zero code as fail.

Wrapped script with start, success, and fail signaling

*/15 * * * * curl -fsS https://api.boxwatch.app/ping/abc123-.../start && \
  ( /opt/sync.sh && curl -fsS https://api.boxwatch.app/ping/abc123-... \
                 || curl -fsS https://api.boxwatch.app/ping/abc123-.../fail/$? )

This sends /start when the job kicks off, then /success or /fail/$? on the way out. You get duration tracking and stuck-job detection in exchange for slightly noisier cron lines.

systemd timer

A systemd OnFailure= hook is the cleanest way to wire the fail ping:

[Service]
ExecStart=/opt/backup.sh
ExecStartPost=/usr/bin/curl -fsS https://api.boxwatch.app/ping/abc123-...
OnFailure=boxwatch-fail.service

Kubernetes CronJob

spec:
  containers:
    - name: backup
      image: my-backup-image
      command:
        - /bin/sh
        - -c
        - "/opt/backup.sh && curl -fsS https://api.boxwatch.app/ping/abc123-..."

💡

Use curl -fsS (or curl --fail --silent --show-error) so a transient ping failure doesn't print noise to your cron mail, but does show up if the ping URL itself is unreachable.

Alert types

There are four alert types. Each one is toggleable per-check.

Missed

No success ping (or fail ping) received within interval + grace of the last terminal ping. This is the most common alert — your job didn't run, or it ran but couldn't reach BoxWatch.

Failing

The most recent terminal ping was a /fail. The job ran, but it told BoxWatch something went wrong. Stays in failing until the next /success clears it.

Stuck

A /start ping came in, but no matching /success or /fail arrived within interval + grace of the start. The job started and never finished — process killed, deadlock, infinite loop.

Running long

A /start ping came in, and max_duration_seconds has elapsed without a terminal ping, but it hasn't been long enough to count as stuck yet. Useful for "this backup usually takes 10 minutes — tell me if it's been running for 30." Disabled by default; requires max_duration_seconds to be set.

Grace period

Grace is the slack you allow before "late" becomes "missed." Default is 5 minutes (300 seconds), which works well for jobs that run every few minutes or longer. For very tight schedules (every minute), drop it. For jobs with variable runtime (a backup that's bigger on Mondays), raise it.

A reasonable rule of thumb: grace should cover normal jitter, not normal runtime variance. If your backup sometimes takes 4 hours and sometimes takes 6, your interval should be 8 hours, not "6 hours plus 2 hours of grace."

The maximum grace is half the interval, enforced by the API.

Anti-storm: how alerts are deduped

You get one alert per state transition, not one per missed cycle.

When a check transitions from up to missed, an alert fires once. The check stays in missed (with alerted_state = 'missed' stored alongside) until something changes. Subsequent monitor ticks see the alerted state already set and stay quiet.

When a success ping arrives and the check transitions back to up, BoxWatch clears the alerted state. The next time it fails, a fresh alert fires.

If a maintenance window is open on the linked server during the transition, the alert is suppressed but the alerted state is still recorded. That prevents a backlog burst when the window closes — by the time alerts resume, the state machine already considers the alert "delivered."

There is also a per-tick safety cap: if more than 50 checks transition into bad states in a single 60-second monitor tick (e.g. the API was offline for hours and just came back), BoxWatch logs an "alert storm suppressed" warning and rolls the rest into the digest. Single guard, cheap insurance.

Maintenance windows

Linking a check to a server means the check inherits that server's maintenance windows. While a window is open, the check still tracks its state — you'll see it transition in the dashboard — but no alerts are dispatched.

For ad-hoc pauses (e.g. you're rewriting the backup script and don't want noise for a week), use the per-check pause toggle instead. Paused checks are excluded from the monitor entirely.

Ping history & retention

Each check keeps its last 100 pings. New pings push out old ones — there's no separate cleanup job. On the detail page you'll see the 25 most recent, with type, exit code, duration, source IP, user agent, and a preview of any POSTed body.

The 100-ping limit is a hard cap, not a plan setting. It applies to every plan.

Account limits

Every account gets unlimited cron checks, free. Existing checks keep working regardless of any other account changes.

Troubleshooting

Alerts are too noisy

Bump grace. If you're getting missed alerts because the job legitimately runs late sometimes, your grace is too small. Open the check and raise it.

The check shows "down" but my job is running

Make sure curl is exiting cleanly. Use curl -fsS so failures are visible. Check the cron user's PATH and any firewall blocking outbound to api.boxwatch.app. From the server's terminal:

curl -fsS https://api.boxwatch.app/ping/YOUR-SLUG

You should get {"ok":true}. If you don't, that's your problem.

Cron heartbeat monitoring

How it works

Creating a check

The ping URL

Examples

Plain cron line

Cron line with exit-code reporting

Wrapped script with start, success, and fail signaling

systemd timer

Kubernetes CronJob

Alert types

Missed

Failing

Stuck

Running long

Grace period

Anti-storm: how alerts are deduped

Maintenance windows

Ping history & retention

Account limits

Troubleshooting

Alerts are too noisy

The check shows "down" but my job is running

I got an alert but the job actually ran

I deleted a check and the alert never resolved

My job has multiple steps — should I ping for each?

API