ConnectWatch / Status & Alerts Guide
Dashboard Back to dashboard
Back to dashboard

Status & Alerts Guide

Everything ConnectWatch shows on your dashboard — what each status means, when an incident opens, when you get alerted, and what to do about it.

On this page
How probing works Probe statuses — OK, Degraded, Failed Environment health states Incidents — P1 and P2 Alert channels Probe types explained Common error messages What to do when something fails

How probing works

ConnectWatch runs a probe cycle on a fixed interval — every 5 minutes on Enterprise, 15 on Pro, 60 on Starter. Each probe connects to your server, performs a real operation (SSH login, file write, HTTP request), measures how long it takes, and records the result.

Interval fires
Connect to server
Run operation
✓ Record result
Update dashboard
Interval fires
Connection fails
Open incident
Send alert
Important: ConnectWatch does not just ping your server. It performs a real operation — SSH login, writing a 4KB test file, reading it back, deleting it. This catches silent failures like a mounted filesystem going read-only, or SSH keys that have been revoked.

A server can respond to ping and still fail a filesystem probe if the NFS mount has gone stale.

Probe statuses

Every probe result has one of three statuses:

OK
OK
The probe connected and completed the operation within the expected time threshold.
SSH: Connected, uploaded file, got confirmation
Filesystem: Wrote 4KB, read it back, deleted — within warn threshold
HTTPS: Got 2xx response within timeout
Degraded
Degraded
The probe succeeded but took longer than your configured warn threshold. The operation completed — it's just slow.
Cause: High I/O load, network congestion, EFS burst credit exhaustion, NFS latency spike
Threshold: Set per connector in env-config — default 100ms for FS, 500ms for SSH
Failed
Failed
The probe could not complete the operation. This opens an incident immediately.
Causes: Server unreachable, SSH key rejected, permission denied, disk full, filesystem read-only, connection timeout
No data
No data
The connector is enabled but has not run a probe yet, or the environment was just created.
When: First probe hasn't fired yet (wait one interval), or the connector was just toggled on
Action: Wait for the next probe cycle. Use On-Demand → Diagnostic to trigger one immediately.

Environment health states

Each environment card on your dashboard shows a combined health state across all its connectors.

StateBadgeMeaningWhat triggers it
Healthy ✓ Healthy All probes are passing within their thresholds Every enabled connector returned success: true and latency below warn threshold
Degraded ⚠ Degraded All probes are completing but one or more are slow At least one probe latency exceeded its warn threshold but no probe failed outright
Incident ✗ Incident One or more probes have failed and an incident is open Any probe returned success: false
No data — No data No probes have run yet for this environment New environment or no connectors enabled
Degraded ≠ Down. A degraded environment is still working — all operations are completing. But slower-than-expected performance often predicts an upcoming failure. On AWS EFS, latency spikes are a common early warning sign of burst credit exhaustion.

Incidents — P1 and P2

An incident opens automatically the first time a probe fails. ConnectWatch does not wait for a second failure — one failed probe is enough. This is intentional: filesystem failures are rarely transient.

P1 — Critical
P1 Critical
Your HTTPS endpoint is unreachable. This means your service is likely down for end users.
Triggered by: HTTPS probe failure
Alerts: Email + Slack + PagerDuty immediately
Response target: <15 minutes
P2 — Warning
P2 Warning
A connectivity or filesystem probe has failed. Your service may still be partially working but infrastructure is degraded.
Triggered by: SSH, Filesystem, S3, Azure, SFTP, GCS probe failure
Alerts: Email + Slack
Response target: <1 hour

Incident lifecycle

Probe fails
First failure for this probe type in this environment
Incident opens
Status set to open · timestamp recorded
Alerts sent
Email / Slack / PagerDuty — once per incident, not per probe
Probes continue
Subsequent failures do NOT re-alert — incident stays open
Probe succeeds
Incident auto-resolves · duration calculated
Recovery alert
Email / Slack / PagerDuty resolve notification sent
One alert per incident. If your filesystem probe fails every 5 minutes for 2 hours, you get one alert when it opens and one when it recovers — not one every 5 minutes. This prevents alert fatigue.

Alert channels

ChannelWhenPlanConfigure
Email Every incident open + resolve All plans Dashboard → Branding & Alerts → Alert email
Slack Every incident open + resolve Pro + Enterprise Dashboard → Branding & Alerts → Slack webhook URL
PagerDuty P1 incidents only (HTTPS failures) Enterprise Dashboard → Branding & Alerts → PagerDuty integration key
Custom webhook Every incident open + resolve Enterprise Env-config → Webhook connector

Probe types explained

ProbeWhat it doesWhat it catchesSeverity
SSH Opens SSH connection, uploads a small test file via SFTP, verifies receipt SSH service down, key revoked, firewall change, disk full P2
Filesystem SSHes in, runs dd to write 4KB, reads it back, deletes it — measures each step NFS/EFS mount stale, read-only filesystem, permission change, I/O latency spike, disk full P2
HTTPS HTTP GET to your configured health endpoint, checks status code and response time Service down, certificate expired, load balancer failure, DNS failure P1
S3 Uploads a test object, verifies ETag, deletes it IAM permission change, bucket policy change, S3 regional outage P2
Azure Blob Uploads test blob to container, verifies, deletes Connection string expired, container deleted, Azure outage P2
SFTP Connects via SFTP, uploads test file to configured path SFTP service down, key rejected, upload path missing P2
GCS Uploads test object to GCS bucket using service account key Service account key expired, bucket permissions changed P2

Common error messages

Error messageCauseFix
Permission denied — check that user has write access to /mnt/… Probe user lost write permission to the probe path sudo chown probeuser /mnt/efs/connectwatch-probe
No space left on device Filesystem is full Free disk space. Regular probe needs only a few KB.
Filesystem is mounted read-only Linux remounted the FS read-only after an I/O error Check dmesg for I/O errors. Remount: mount -o remount,rw /mnt/efs
SSH authentication failed — check private key matches authorized_keys SSH key was changed or revoked on the server Re-run key setup. See FS probe setup guide.
Connection timed out Firewall blocking port 22, server unreachable, wrong IP Check security group / firewall rules allow ConnectWatch egress IPs on port 22.
Path not found: /mnt/efs/… — create the directory first Probe directory was deleted or never created mkdir -p /mnt/efs/connectwatch-probe && chown probeuser /mnt/efs/connectwatch-probe
Filesystem write timed out Filesystem I/O is hung — NFS server unreachable, EFS mount hanging Check NFS server health. May need to unmount and remount the filesystem.
Unexpected status 503 HTTPS probe got a non-2xx response — service or load balancer is down Check your application and load balancer logs.

What to do when something fails

Filesystem probe failed

1. Check the error message in the incident — it tells you exactly what failed.
2. SSH into the server manually and try: touch /mnt/efs/connectwatch-probe/test && rm /mnt/efs/connectwatch-probe/test
3. If permission denied → sudo chown probeuser /mnt/efs/connectwatch-probe
4. If disk full → free space, then wait for the next probe cycle.
5. If mount is hung → unmount and remount the filesystem.
6. Once fixed, the next probe cycle will auto-resolve the incident.

SSH probe failed

1. Try SSHing manually: ssh -i /path/to/key probeuser@your-server-ip
2. If connection refused → check SSH service: systemctl status ssh
3. If permission denied → check ~probeuser/.ssh/authorized_keys still has the key.
4. If timeout → check firewall / security group allows port 22 from ConnectWatch.

HTTPS probe failed (P1)

1. Try the URL in your browser immediately.
2. Check your application logs for errors.
3. Check your load balancer health checks.
4. Check SSL certificate expiry: openssl s_client -connect yourdomain.com:443
5. Check DNS: dig yourdomain.com

Environment showing Degraded

Degraded means slow, not down. Common causes:
EFS burst credit exhaustion — run the 1GB on-demand test to confirm, then switch to Provisioned Throughput.
NFS server under load — check NFS server CPU and I/O metrics.
Network congestion — check latency between app server and storage server.
High I/O on the server — another process is saturating disk I/O.
Incidents auto-resolve. You don't need to manually close an incident. As soon as the next probe succeeds, ConnectWatch marks it resolved and sends a recovery alert. You just need to fix the underlying issue.