Synthetic uptime checks

Synthetic uptime checks are user-defined probes — HTTP requests, TCP connections, TLS certificate expiry — that BoxWatch runs from your servers. Most uptime services probe from their own cloud-hosted network. BoxWatch dispatches checks to the agents you've already installed.

That choice has two practical consequences:

You can probe services that aren't reachable from the public internet. Verify Redis on 10.0.0.5:6379 from web-01. Verify a private status endpoint behind your VPC's firewall. No tunneling, no exposing the service.
You're already paying for the servers. There's no per-check billing. Your fleet's geographic distribution becomes your monitoring topology for free.

The trade-off: probe locations are wherever your servers happen to be. If you want a probe from "Tokyo," you need a server in Tokyo.

Check types

Three kinds in v1:

HTTP — curl a URL, check status code and (optionally) the response body.
TCP — Open a TCP connection to host:port and verify it succeeds within a timeout.
TLS expiry — Connect via TLS to host:port and report when the cert expires.

There's no DNS resolution check in v1.

Adding a check

In the dashboard, go to Uptime → New check. Pick a type, target, and the servers it should run on. Save.

Or use the API:

POST/uptime-checks

Auth: bearer

curl -X POST https://api.boxwatch.app/uptime-checks \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Main site",
    "check_type": "http",
    "target": "https://example.com/health",
    "expected_status_codes": "200-299",
    "max_latency_ms": 2000,
    "body_contains": "OK",
    "follow_redirects": 1,
    "timeout_seconds": 10,
    "probe_server_ids": [42, 43, 44]
  }'

A check must have at least one probe server. The request body's probe_server_ids is the list of servers (by ID) that will run the probe. Up to 100 entries per check.

Probe servers and multi-vantage

When you assign a check to N servers, it runs N times per cycle — once on each. Each server reports its own result independently. The API aggregates the results into a single check status.

Aggregation logic:

up — every probe is OK.
degraded — at least one probe is down, but not a strict majority. The dashboard shows yellow; no alert fires.
down — strict majority of probes are down (more than half). After two consecutive down aggregations, an alert fires.

The two-tick flap guard is intentional. A single failed probe doesn't wake you up — a sustained majority outage does.

Schedule

Checks run on the agent's heartbeat cadence. The push interval for every account is 5 minutes, so each probe server runs its assigned checks every 5 minutes.

There's no sub-tick scheduling in v1. The aggregate uses whatever the latest result from each probe is.

Alert types

Each toggle is per-check.

Down

Fires after two consecutive aggregated down observations. Default: on. Clears silently on the first non-down aggregate. (Set alert_on_recovery to also send a recovery notification.)

Recovery

Sends a "back up" notification on the first non-down aggregate after a down alert was sent. Default: off.

TLS cert expiring

For tls_expiry and HTTPS http checks, fires when the cert is within tls_warn_days_before_expiry days of expiring (default 14). Fires once when the threshold is first crossed; doesn't re-alert until the cert is renewed and re-crosses back.

HTTP options

The richest check type. Optional fields, all validated:

Field	Type	Default	Notes
`expected_status_codes`	string	`"200-299"`	Comma-separated list or ranges. e.g. `"200,201"` or `"200-299,301"`. Codes must be 100-599.
`max_latency_ms`	int	unset	If set, response slower than this counts as a failed probe (`error_kind: latency_high`). 1-60000.
`body_contains`	string	unset	Literal-string match (`grep -F`), no regex. 1-500 chars. Failed match → `error_kind: body_mismatch`.
`follow_redirects`	0/1	1	When 0, a `301`/`302` counts according to the status-code rule.
`timeout_seconds`	int	10	1-60.

Custom request headers and request bodies for POST checks are on the roadmap. v1 uses GET with default headers.

TCP and TLS-expiry examples

TCP

curl -X POST https://api.boxwatch.app/uptime-checks \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Redis from web tier",
    "check_type": "tcp",
    "target": "10.0.0.5:6379",
    "timeout_seconds": 5,
    "probe_server_ids": [42]
  }'

A TCP check is purely "did the connect succeed within the timeout?" No payload exchange.

TLS expiry

curl -X POST https://api.boxwatch.app/uptime-checks \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "example.com cert",
    "check_type": "tls_expiry",
    "target": "example.com:443",
    "tls_warn_days_before_expiry": 21,
    "probe_server_ids": [42]
  }'

The agent connects via TLS, reads the certificate, and reports cert_days_left. The "probe failed" state is reserved for connection failures — a cert that's still valid but close to expiring is a separate alert.

Account limits

Every account gets unlimited uptime checks, free. The cap is account-wide (not per-server). A single check that probes from 20 servers counts as one check against the cap.

API reference

GET/uptime-checks

Auth: bearer

List all checks with denormalized current status, probe count, and plan cap.

POST/uptime-checks

Auth: bearer

Create a check. probe_server_ids is required (array, 1-100 entries).

GET/uptime-checks/:id

Auth: bearer

Detail view. Returns the check, the list of probe servers, the latest result from each probe, the last 100 results combined, and a 24h uptime percentage.

{
  "check": { "id": 5, "name": "Main site", "last_status": "up", "..." },
  "probe_servers": [
    { "id": 42, "hostname": "web-01" },
    { "id": 43, "hostname": "web-02" }
  ],
  "per_probe_latest": [
    { "server_id": 42, "hostname": "web-01", "ok": 1, "status_code": 200, "latency_ms": 142 },
    { "server_id": 43, "hostname": "web-02", "ok": 1, "status_code": 200, "latency_ms": 156 }
  ],
  "uptime_pct_24h": 99.83
}

PUT/uptime-checks/:id

Auth: bearer

Update mutable fields and/or probe_server_ids (which fully replaces the assignment list).

DELETE/uptime-checks/:id

Auth: bearer

Cascade-deletes the check, its probe assignments, and all stored results.

Why agent-side probing?

Three honest reasons:

No probe-traffic tax. Cloud-provider monitoring services charge for the egress they generate, then bill you for it. Your agents already exist and already make HTTPS calls home; an extra probe is cheap.
Internal-network reach. A SaaS probe network can't see your private subnets. Your agents already live inside them. A tcp check against 10.0.0.5:6379 is trivial from web-01 and impossible from anyone else.
Honest geography. "Probed from 14 regions" is mostly theater if the regions don't match where your users are. Probes from your actual production servers are the closest possible proxy for what your actual users experience.

Troubleshooting

"Check is failing but the URL loads fine in my browser"

First, check whether the probe server has jq installed. Uptime checks need jq to parse the agent's config cache; without it, the agent skips them entirely.

which jq || sudo apt-get install -y jq

Then look at the per-probe error_kind in the dashboard. The most common reasons:

timeout — the server can't reach the URL at all. DNS, firewall, or the service is genuinely down.
http_status — the URL returned a code not in expected_status_codes. Adjust the codes or fix the endpoint.
body_mismatch — your body_contains string isn't in the response. Common when health endpoints change format.

"TLS expiry says expired but the cert is valid"

Clock skew on the probe server. Run timedatectl status and ensure NTP is on:

sudo timedatectl set-ntp true

The TLS check compares the cert's notAfter to the local clock. A server 48 hours behind real time will flag a cert that expires today as already expired.

"I want different schedules for different checks"

Sub-tick scheduling isn't in v1. All servers run on the same 5-minute push interval.

"I deleted a server but its old probe results are still showing"

uptime_probe_results cascade-deletes from servers, so removing a server cleans up its probe history. If you're seeing stale data, refresh the dashboard.