ConnectWatch — Status & Alerts Guide

How probing works

ConnectWatch runs a probe cycle on a fixed interval — every 5 minutes on Enterprise, 15 on Pro, 60 on Starter. Each probe connects to your server, performs a real operation (SSH login, file write, HTTP request), measures how long it takes, and records the result.

Interval fires

→

Connect to server

→

Run operation

→

✓ Record result

→

Update dashboard

Interval fires

→

Connection fails

→

Open incident

→

Send alert

Important: ConnectWatch does not just ping your server. It performs a real operation — SSH login, writing a 4KB test file, reading it back, deleting it. This catches silent failures like a mounted filesystem going read-only, or SSH keys that have been revoked.

A server can respond to ping and still fail a filesystem probe if the NFS mount has gone stale.

Probe statuses

Every probe result has one of three statuses:

OK

The probe connected and completed the operation within the expected time threshold.

SSH: Connected, uploaded file, got confirmation
Filesystem: Wrote 4KB, read it back, deleted — within warn threshold
HTTPS: Got 2xx response within timeout

Degraded

The probe succeeded but took longer than your configured warn threshold. The operation completed — it's just slow.

Cause: High I/O load, network congestion, EFS burst credit exhaustion, NFS latency spike
Threshold: Set per connector in env-config — default 100ms for FS, 500ms for SSH

Failed

The probe could not complete the operation. This opens an incident immediately.

Causes: Server unreachable, SSH key rejected, permission denied, disk full, filesystem read-only, connection timeout

No data

The connector is enabled but has not run a probe yet, or the environment was just created.

When: First probe hasn't fired yet (wait one interval), or the connector was just toggled on
Action: Wait for the next probe cycle. Use On-Demand → Diagnostic to trigger one immediately.

Environment health states

Each environment card on your dashboard shows a combined health state across all its connectors.

State	Badge	Meaning	What triggers it
Healthy	✓ Healthy	All probes are passing within their thresholds	Every enabled connector returned `success: true` and latency below warn threshold
Degraded	⚠ Degraded	All probes are completing but one or more are slow	At least one probe latency exceeded its warn threshold but no probe failed outright
Incident	✗ Incident	One or more probes have failed and an incident is open	Any probe returned `success: false`
No data	— No data	No probes have run yet for this environment	New environment or no connectors enabled

Degraded ≠ Down. A degraded environment is still working — all operations are completing. But slower-than-expected performance often predicts an upcoming failure. On AWS EFS, latency spikes are a common early warning sign of burst credit exhaustion.

Incidents — P1 and P2

An incident opens automatically the first time a probe fails. ConnectWatch does not wait for a second failure — one failed probe is enough. This is intentional: filesystem failures are rarely transient.

P1 — Critical

P1 Critical

Your HTTPS endpoint is unreachable. This means your service is likely down for end users.

Triggered by: HTTPS probe failure
Alerts: Email + Slack + PagerDuty immediately
Response target: <15 minutes

P2 — Warning

P2 Warning

A connectivity or filesystem probe has failed. Your service may still be partially working but infrastructure is degraded.

Triggered by: SSH, Filesystem, S3, Azure, SFTP, GCS probe failure
Alerts: Email + Slack
Response target: <1 hour

Incident lifecycle

Probe fails

First failure for this probe type in this environment

↓

Incident opens

Status set to open · timestamp recorded

↓

Alerts sent

Email / Slack / PagerDuty — once per incident, not per probe

↓

Probes continue

Subsequent failures do NOT re-alert — incident stays open

↓

Probe succeeds

Incident auto-resolves · duration calculated

↓

Recovery alert

Email / Slack / PagerDuty resolve notification sent

One alert per incident. If your filesystem probe fails every 5 minutes for 2 hours, you get one alert when it opens and one when it recovers — not one every 5 minutes. This prevents alert fatigue.

Alert channels

Channel	When	Plan	Configure
Email	Every incident open + resolve	All plans	Dashboard → Branding & Alerts → Alert email
Slack	Every incident open + resolve	Pro + Enterprise	Dashboard → Branding & Alerts → Slack webhook URL
PagerDuty	P1 incidents only (HTTPS failures)	Enterprise	Dashboard → Branding & Alerts → PagerDuty integration key
Custom webhook	Every incident open + resolve	Enterprise	Env-config → Webhook connector

Probe types explained

Probe	What it does	What it catches	Severity
SSH	Opens SSH connection, uploads a small test file via SFTP, verifies receipt	SSH service down, key revoked, firewall change, disk full	P2
Filesystem	SSHes in, runs `dd` to write 4KB, reads it back, deletes it — measures each step	NFS/EFS mount stale, read-only filesystem, permission change, I/O latency spike, disk full	P2
HTTPS	HTTP GET to your configured health endpoint, checks status code and response time	Service down, certificate expired, load balancer failure, DNS failure	P1
S3	Uploads a test object, verifies ETag, deletes it	IAM permission change, bucket policy change, S3 regional outage	P2
Azure Blob	Uploads test blob to container, verifies, deletes	Connection string expired, container deleted, Azure outage	P2
SFTP	Connects via SFTP, uploads test file to configured path	SFTP service down, key rejected, upload path missing	P2
GCS	Uploads test object to GCS bucket using service account key	Service account key expired, bucket permissions changed	P2

Common error messages

Error message	Cause	Fix
`Permission denied — check that user has write access to /mnt/…`	Probe user lost write permission to the probe path	`sudo chown probeuser /mnt/efs/connectwatch-probe`
`No space left on device`	Filesystem is full	Free disk space. Regular probe needs only a few KB.
`Filesystem is mounted read-only`	Linux remounted the FS read-only after an I/O error	Check `dmesg` for I/O errors. Remount: `mount -o remount,rw /mnt/efs`
`SSH authentication failed — check private key matches authorized_keys`	SSH key was changed or revoked on the server	Re-run key setup. See FS probe setup guide.
`Connection timed out`	Firewall blocking port 22, server unreachable, wrong IP	Check security group / firewall rules allow ConnectWatch egress IPs on port 22.
`Path not found: /mnt/efs/… — create the directory first`	Probe directory was deleted or never created	`mkdir -p /mnt/efs/connectwatch-probe && chown probeuser /mnt/efs/connectwatch-probe`
`Filesystem write timed out`	Filesystem I/O is hung — NFS server unreachable, EFS mount hanging	Check NFS server health. May need to unmount and remount the filesystem.
`Unexpected status 503`	HTTPS probe got a non-2xx response — service or load balancer is down	Check your application and load balancer logs.

What to do when something fails

Filesystem probe failed

1. Check the error message in the incident — it tells you exactly what failed.
2. SSH into the server manually and try: touch /mnt/efs/connectwatch-probe/test && rm /mnt/efs/connectwatch-probe/test
3. If permission denied → sudo chown probeuser /mnt/efs/connectwatch-probe
4. If disk full → free space, then wait for the next probe cycle.
5. If mount is hung → unmount and remount the filesystem.
6. Once fixed, the next probe cycle will auto-resolve the incident.

SSH probe failed

1. Try SSHing manually: ssh -i /path/to/key probeuser@your-server-ip
2. If connection refused → check SSH service: systemctl status ssh
3. If permission denied → check ~probeuser/.ssh/authorized_keys still has the key.
4. If timeout → check firewall / security group allows port 22 from ConnectWatch.

HTTPS probe failed (P1)

1. Try the URL in your browser immediately.
2. Check your application logs for errors.
3. Check your load balancer health checks.
4. Check SSL certificate expiry: openssl s_client -connect yourdomain.com:443
5. Check DNS: dig yourdomain.com

Environment showing Degraded

Degraded means slow, not down. Common causes:
• EFS burst credit exhaustion — run the 1GB on-demand test to confirm, then switch to Provisioned Throughput.
• NFS server under load — check NFS server CPU and I/O metrics.
• Network congestion — check latency between app server and storage server.
• High I/O on the server — another process is saturating disk I/O.

Incidents auto-resolve. You don't need to manually close an incident. As soon as the next probe succeeds, ConnectWatch marks it resolved and sends a recovery alert. You just need to fix the underlying issue.