title: "Process monitoring" description: "Watch specific processes on each server β get alerted when nginx, postgres, or your queue worker crashes, restarts, or runs hot." last_updated: "2026-05-24"
Process monitoring
Process monitoring lets you name specific processes that should be running on each server. The agent collects per-process CPU and memory data on every heartbeat, and BoxWatch fires an alert when a process disappears, restarts, or crosses a CPU or memory threshold.
This is host-side, not application-side. There's no SDK to import, no callback to register. You configure "watch nginx on web-01" in the dashboard, and the agent on web-01 does the rest.
How it works
- You tell BoxWatch a process name to watch on a server (e.g.,
nginx,postgres,redis-server). - On the next heartbeat, the API tells the agent which processes to look at. The agent caches the list at
/opt/boxwatch/processes.cache. - Each tick, the agent runs
pgrep -x <name>for each watched process. It sums CPU and RSS across all matching PIDs and records the oldest start time. - The aggregated snapshot is sent in the heartbeat's
processesarray. - The API compares the new sample against the previous one and fires an alert on any of the four state transitions described below.
Matching is exact name (pgrep -x). No wildcards, no command-line substring search. This is deliberate β it keeps the agent fast and the security surface small. If you want to watch a Python process, watch python; if you have multiple Python processes, BoxWatch will aggregate them all under that name.
Multi-instance processes (a worker pool, an Nginx with N workers) are summed into one row. Count is the number of running PIDs; CPU and RSS are totals.
Adding a watched process
In the dashboard, open the server detail page and find the Processes panel. Click + Watch new, type the process name, and pick which alerts to enable.
A new watched process starts in pending state. On the next heartbeat (up to 60 minutes on Hobby, 5 minutes on Pro/Team, 1 minute on Scale), the agent picks up the new config and starts reporting.
Over the API:
curl -X POST https://api.boxwatch.app/servers/42/processes \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"process_name": "nginx",
"alert_on_down": 1,
"alert_on_restart": 0,
"alert_on_cpu": 1,
"cpu_threshold_pct": 75,
"alert_on_memory": 0
}'{
"process": {
"id": 17,
"server_id": 42,
"process_name": "nginx",
"enabled": 1,
"alert_on_down": 1,
"alert_on_restart": 0,
"alert_on_cpu": 1,
"alert_on_memory": 0,
"cpu_threshold_pct": 75.00,
"memory_threshold_pct": 25.00,
"last_sample_at": null,
"last_count": null,
"alerted_state": null,
"status": "pending"
}
}Process name validation rejects whitespace, newlines, and shell metacharacters. Specifically, any of the following characters cause a 400 validation error:
; | & $ ` \ " ' < > ( ) { }
Names that are 1-100 characters and look like a normal pgrep -x argument are accepted.
Alert types
There are four alert types, all toggleable per watched process.
Process down
Fires when the previous heartbeat had count > 0 and the new heartbeat has count = 0. Once alerted, the state is sticky β every subsequent zero-count heartbeat does not re-alert. Only state transitions trigger.
Process restarted
Fires when count > 0, the oldest start time has moved forward since the previous heartbeat, and a restart alert is enabled. Detection is automatic from oldest_start_unix β no agent-side state required.
There's one suppression rule: if a process was already in the down state and recovers with a new PID, the recovery looks like a restart by definition, so the restart alert is suppressed. The earlier down alert was the real signal.
Process CPU high
Fires when aggregate CPU (sum across all PIDs with this name) crosses cpu_threshold_pct. Default threshold is 50%. The alert clears the next sample that drops below the threshold.
Process memory high
Fires when aggregate RSS as a percent of host RAM crosses memory_threshold_pct. Default threshold is 25%. Clears the same way.
Priority order: a single heartbeat can match more than one condition, but only one alert fires. Priority is down > restarted > cpu_high > memory_high.
Anti-storm protections
- State is sticky. A process that's been down for 12 hours doesn't generate 720 alerts on Pro's 5-minute tick. It generates one. The state machine only fires on transitions.
- Per-heartbeat cap. The heartbeat handler dispatches at most 10 process alerts per server per heartbeat. If you have 50 watched processes and they all die at once (host outage), you get the first 10. A summary log entry is written for the rest.
- Downβrestart suppression (described above) keeps a single incident from producing two alerts.
Plan limits
Watched processes are capped per server, not per account:
- All plans Hobby β 10 watched processes per server
- Pro+ Pro β 50 watched processes per server
- Team+ Team β 100 watched processes per server
- Scale only Scale β unlimited
Hitting the cap returns HTTP 402 with scope: "per_server" and the current count/limit. Downgrading a plan never deletes existing watched processes; you'll just be blocked from adding new ones until you're back under the cap.
API reference
Lists all watched processes for a server, with denormalized latest sample values and the plan cap.
Add a watched process. Fields:
| Field | Type | Default | Notes |
|---|---|---|---|
process_name | string | required | 1-100 chars. Matched via pgrep -x. |
enabled | 0/1 | 1 | Set to 0 to pause without deleting. |
alert_on_down | 0/1 | 1 | |
alert_on_restart | 0/1 | 0 | |
alert_on_cpu | 0/1 | 0 | |
cpu_threshold_pct | decimal | 50.00 | 0.01-100. |
alert_on_memory | 0/1 | 0 | |
memory_threshold_pct | decimal | 25.00 | 0.01-100. Of host RAM. |
Detail view. Returns the watched-process row plus up to 100 most recent samples and recent_restarts (count of unique restart events in the last 24 hours).
Update mutable fields. Same field list as POST minus process_name is fixed (delete and re-add to rename).
Remove a watched process. Cascade-deletes its samples.
Why not full process-tree introspection?
BoxWatch isn't an APM. We deliberately watch only what you say to watch, by name. No tree walking, no command-line parsing, no auto-discovery. The result is:
- The agent stays under 1000 lines of bash.
- A new watched process appears on the next heartbeat without any code change.
- The data volume is bounded by your configured list, not the entire host's process table.
If you want full process-tree introspection, BoxWatch can run alongside an APM tool β they don't conflict.
Troubleshooting
"I added nginx but the dashboard says pending for ages"
The agent picks up new watched processes on its next heartbeat after the API tells it about them. That's one tick of cron β but on Hobby (60-minute push interval), it can take up to an hour. Pro/Team is 5 minutes; Scale is 1 minute.
Force a manual run if you don't want to wait:
sudo /opt/boxwatch/agent.sh"Process is running but BoxWatch says it's missing"
Check the exact name pgrep -x is matching against:
pgrep -x nginxIf that returns nothing but pgrep nginx (without -x) does, the binary's basename isn't nginx. Watch the actual name. Common gotchas: python3 vs. python, node vs. the wrapper script that exec's it.
"Spurious restart alerts every hour"
The process is restarting on its own. Check systemctl status <service> and look at the restart count. The agent isn't generating the events β your service manager is.
"Old agent" banner on the server detail page
Process monitoring requires agent v2.0+. Re-run the install command to upgrade:
curl -sL https://boxwatch.app/install.sh | bash -s YOUR_AGENT_KEY"macOS server isn't reporting process data"
The collection block uses /proc/<pid>/stat, which is Linux-only. macOS and BSD agents don't report process samples in v1.
Process monitoring is included on every plan. There's no "Pro+" gate β even free accounts can watch up to 10 processes per server. The plan affects the cap, not whether the feature exists.