title: "Cron heartbeat monitoring" description: "Watch any scheduled job — cron, systemd timer, K8s CronJob, GitHub Actions — by having it ping a URL." last_updated: "2026-05-24"
Cron heartbeat monitoring
A cron check is a heartbeat URL that watches a scheduled job. Your job pings the URL when it runs, and BoxWatch alerts you when it doesn't. This works for cron, systemd timers, Kubernetes CronJobs, GitHub Actions, Windows Task Scheduler — anything that can run curl.
How it works
You give the check an expected interval (say, every 24 hours) and a grace period (say, 5 minutes). Your job hits its unique ping URL every time it runs successfully. BoxWatch keeps a timer.
If interval + grace elapses without a success ping, you get a missed alert. If your job explicitly tells BoxWatch it failed, you get a failed alert. There are two more states (covered below) for jobs that signal start-but-never-finish.
The slug in the ping URL is a UUIDv4 — random, unguessable, and treated as the secret. No auth header, no API key. Just curl the URL.
Creating a check
- Sign in and go to Dashboard → Checks → New Check.
- Give it a name (e.g. "Nightly Postgres backup").
- Set the interval — how often the job is supposed to run.
- Set the grace period — how long after the interval BoxWatch waits before firing a missed alert. Default is 5 minutes.
- Optionally link it to a server. Linking lets the check inherit that server's maintenance windows.
- Optionally set a max duration to enable long-running-job alerts.
- Pick which alert types you want (missed, fail, stuck, long).
- Click Create. You'll land on the detail page with your three ping URLs ready to copy.
Grace period is capped at half the interval (floor(interval / 2)). For a 60-second interval, max grace is 30 seconds. The API will reject configurations outside that bound.
The ping URL
Each check has one slug. From that slug you get four ping endpoints:
Job succeeded. Updates last_ping_at and last_success_at. If a /start came in earlier for this run, BoxWatch records the duration.
Job started. Optional, but recommended — it enables stuck-job detection and duration tracking.
Job failed. Exit code not recorded.
Job failed with an explicit exit code (integer 0–255). Recorded in the ping history.
All four endpoints accept GET, POST, or HEAD. If you POST a body, the first 10 KB are captured and shown in the ping history — handy for stashing a short log tail or a JSON status blob.
The response is always 200 OK with {"ok": true} — even for invalid slugs, to prevent enumeration.
Examples
Plain cron line
0 3 * * * /opt/backup.sh && curl -fsS https://api.boxwatch.app/ping/abc123-...Runs the backup at 3 AM. If backup.sh exits zero, the curl runs and BoxWatch records a success.
Cron line with exit-code reporting
0 * * * * /opt/cleanup.sh; curl -fsS https://api.boxwatch.app/ping/abc123-.../fail/$?Note the ; instead of && — this fires the ping regardless of exit status, and $? carries the exit code into the URL. BoxWatch interprets exit 0 as success and any non-zero code as fail.
Wrapped script with start, success, and fail signaling
*/15 * * * * curl -fsS https://api.boxwatch.app/ping/abc123-.../start && \
( /opt/sync.sh && curl -fsS https://api.boxwatch.app/ping/abc123-... \
|| curl -fsS https://api.boxwatch.app/ping/abc123-.../fail/$? )This sends /start when the job kicks off, then /success or /fail/$? on the way out. You get duration tracking and stuck-job detection in exchange for slightly noisier cron lines.
systemd timer
A systemd OnFailure= hook is the cleanest way to wire the fail ping:
[Service]
ExecStart=/opt/backup.sh
ExecStartPost=/usr/bin/curl -fsS https://api.boxwatch.app/ping/abc123-...
OnFailure=boxwatch-fail.serviceKubernetes CronJob
spec:
containers:
- name: backup
image: my-backup-image
command:
- /bin/sh
- -c
- "/opt/backup.sh && curl -fsS https://api.boxwatch.app/ping/abc123-..."Use curl -fsS (or curl --fail --silent --show-error) so a transient ping failure doesn't print noise to your cron mail, but does show up if the ping URL itself is unreachable.
Alert types
There are four alert types. Each one is toggleable per-check.
Missed
No success ping (or fail ping) received within interval + grace of the last terminal ping. This is the most common alert — your job didn't run, or it ran but couldn't reach BoxWatch.
Failing
The most recent terminal ping was a /fail. The job ran, but it told BoxWatch something went wrong. Stays in failing until the next /success clears it.
Stuck
A /start ping came in, but no matching /success or /fail arrived within interval + grace of the start. The job started and never finished — process killed, deadlock, infinite loop.
Running long
A /start ping came in, and max_duration_seconds has elapsed without a terminal ping, but it hasn't been long enough to count as stuck yet. Useful for "this backup usually takes 10 minutes — tell me if it's been running for 30." Disabled by default; requires max_duration_seconds to be set.
Grace period
Grace is the slack you allow before "late" becomes "missed." Default is 5 minutes (300 seconds), which works well for jobs that run every few minutes or longer. For very tight schedules (every minute), drop it. For jobs with variable runtime (a backup that's bigger on Mondays), raise it.
A reasonable rule of thumb: grace should cover normal jitter, not normal runtime variance. If your backup sometimes takes 4 hours and sometimes takes 6, your interval should be 8 hours, not "6 hours plus 2 hours of grace."
The maximum grace is half the interval, enforced by the API.
Anti-storm: how alerts are deduped
You get one alert per state transition, not one per missed cycle.
When a check transitions from up to missed, an alert fires once. The check stays in missed (with alerted_state = 'missed' stored alongside) until something changes. Subsequent monitor ticks see the alerted state already set and stay quiet.
When a success ping arrives and the check transitions back to up, BoxWatch clears the alerted state. The next time it fails, a fresh alert fires.
If a maintenance window is open on the linked server during the transition, the alert is suppressed but the alerted state is still recorded. That prevents a backlog burst when the window closes — by the time alerts resume, the state machine already considers the alert "delivered."
There is also a per-tick safety cap: if more than 50 checks transition into bad states in a single 60-second monitor tick (e.g. the API was offline for hours and just came back), BoxWatch logs an "alert storm suppressed" warning and rolls the rest into the digest. Single guard, cheap insurance.
Maintenance windows
Linking a check to a server means the check inherits that server's maintenance windows. While a window is open, the check still tracks its state — you'll see it transition in the dashboard — but no alerts are dispatched.
For ad-hoc pauses (e.g. you're rewriting the backup script and don't want noise for a week), use the per-check pause toggle instead. Paused checks are excluded from the monitor entirely.
Ping history & retention
Each check keeps its last 100 pings. New pings push out old ones — there's no separate cleanup job. On the detail page you'll see the 25 most recent, with type, exit code, duration, source IP, user agent, and a preview of any POSTed body.
The 100-ping limit is a hard cap, not a plan setting. It applies to every plan.
Plan caps
| Plan | Max cron checks |
|---|---|
| Hobby | 20 |
| Pro | 100 |
| Team | Unlimited |
| Scale | Unlimited |
Creating a check past your plan cap returns HTTP 402 plan_limit_reached from the API and shows an upgrade prompt in the dashboard. Existing checks keep working.
Troubleshooting
Alerts are too noisy
Bump grace. If you're getting missed alerts because the job legitimately runs late sometimes, your grace is too small. Open the check and raise it.
The check shows "down" but my job is running
Make sure curl is exiting cleanly. Use curl -fsS so failures are visible. Check the cron user's PATH and any firewall blocking outbound to api.boxwatch.app. From the server's terminal:
curl -fsS https://api.boxwatch.app/ping/YOUR-SLUGYou should get {"ok":true}. If you don't, that's your problem.
I got an alert but the job actually ran
Check the ping history on the detail page. If the most recent ping is a /start with no terminal, the job started but didn't finish (or didn't manage to curl on its way out). If the most recent ping is a /fail, look at the exit code.
I deleted a check and the alert never resolved
Deleting a check cascades and removes the ping history. There's no "all clear" notification — the alert state just disappears with the check. If you need the alert to formally resolve, send a /success ping before deleting.
My job has multiple steps — should I ping for each?
No. One check = one job. If you need fine-grained tracking, create separate checks for each step. The slug is the dedup key.
API
If you'd rather manage checks programmatically, see the API reference. Detailed cron-check API docs are coming soon.