Status & Alerts Guide
Everything ConnectWatch shows on your dashboard — what each status means, when an incident opens, when you get alerted, and what to do about it.
How probing works
ConnectWatch runs a probe cycle on a fixed interval — every 5 minutes on Enterprise, 15 on Pro, 60 on Starter. Each probe connects to your server, performs a real operation (SSH login, file write, HTTP request), measures how long it takes, and records the result.
A server can respond to ping and still fail a filesystem probe if the NFS mount has gone stale.
Probe statuses
Every probe result has one of three statuses:
Filesystem: Wrote 4KB, read it back, deleted — within warn threshold
HTTPS: Got 2xx response within timeout
Threshold: Set per connector in env-config — default 100ms for FS, 500ms for SSH
Action: Wait for the next probe cycle. Use On-Demand → Diagnostic to trigger one immediately.
Environment health states
Each environment card on your dashboard shows a combined health state across all its connectors.
| State | Badge | Meaning | What triggers it |
|---|---|---|---|
| Healthy | ✓ Healthy | All probes are passing within their thresholds | Every enabled connector returned success: true and latency below warn threshold |
| Degraded | ⚠ Degraded | All probes are completing but one or more are slow | At least one probe latency exceeded its warn threshold but no probe failed outright |
| Incident | ✗ Incident | One or more probes have failed and an incident is open | Any probe returned success: false |
| No data | — No data | No probes have run yet for this environment | New environment or no connectors enabled |
Incidents — P1 and P2
An incident opens automatically the first time a probe fails. ConnectWatch does not wait for a second failure — one failed probe is enough. This is intentional: filesystem failures are rarely transient.
Alerts: Email + Slack + PagerDuty immediately
Response target: <15 minutes
Alerts: Email + Slack
Response target: <1 hour
Incident lifecycle
open · timestamp recorded
Alert channels
| Channel | When | Plan | Configure |
|---|---|---|---|
| Every incident open + resolve | All plans | Dashboard → Branding & Alerts → Alert email | |
| Slack | Every incident open + resolve | Pro + Enterprise | Dashboard → Branding & Alerts → Slack webhook URL |
| PagerDuty | P1 incidents only (HTTPS failures) | Enterprise | Dashboard → Branding & Alerts → PagerDuty integration key |
| Custom webhook | Every incident open + resolve | Enterprise | Env-config → Webhook connector |
Probe types explained
| Probe | What it does | What it catches | Severity |
|---|---|---|---|
| SSH | Opens SSH connection, uploads a small test file via SFTP, verifies receipt | SSH service down, key revoked, firewall change, disk full | P2 |
| Filesystem | SSHes in, runs dd to write 4KB, reads it back, deletes it — measures each step |
NFS/EFS mount stale, read-only filesystem, permission change, I/O latency spike, disk full | P2 |
| HTTPS | HTTP GET to your configured health endpoint, checks status code and response time | Service down, certificate expired, load balancer failure, DNS failure | P1 |
| S3 | Uploads a test object, verifies ETag, deletes it | IAM permission change, bucket policy change, S3 regional outage | P2 |
| Azure Blob | Uploads test blob to container, verifies, deletes | Connection string expired, container deleted, Azure outage | P2 |
| SFTP | Connects via SFTP, uploads test file to configured path | SFTP service down, key rejected, upload path missing | P2 |
| GCS | Uploads test object to GCS bucket using service account key | Service account key expired, bucket permissions changed | P2 |
Common error messages
| Error message | Cause | Fix |
|---|---|---|
Permission denied — check that user has write access to /mnt/… |
Probe user lost write permission to the probe path | sudo chown probeuser /mnt/efs/connectwatch-probe |
No space left on device |
Filesystem is full | Free disk space. Regular probe needs only a few KB. |
Filesystem is mounted read-only |
Linux remounted the FS read-only after an I/O error | Check dmesg for I/O errors. Remount: mount -o remount,rw /mnt/efs |
SSH authentication failed — check private key matches authorized_keys |
SSH key was changed or revoked on the server | Re-run key setup. See FS probe setup guide. |
Connection timed out |
Firewall blocking port 22, server unreachable, wrong IP | Check security group / firewall rules allow ConnectWatch egress IPs on port 22. |
Path not found: /mnt/efs/… — create the directory first |
Probe directory was deleted or never created | mkdir -p /mnt/efs/connectwatch-probe && chown probeuser /mnt/efs/connectwatch-probe |
Filesystem write timed out |
Filesystem I/O is hung — NFS server unreachable, EFS mount hanging | Check NFS server health. May need to unmount and remount the filesystem. |
Unexpected status 503 |
HTTPS probe got a non-2xx response — service or load balancer is down | Check your application and load balancer logs. |
What to do when something fails
Filesystem probe failed
2. SSH into the server manually and try:
touch /mnt/efs/connectwatch-probe/test && rm /mnt/efs/connectwatch-probe/test3. If permission denied →
sudo chown probeuser /mnt/efs/connectwatch-probe4. If disk full → free space, then wait for the next probe cycle.
5. If mount is hung → unmount and remount the filesystem.
6. Once fixed, the next probe cycle will auto-resolve the incident.
SSH probe failed
ssh -i /path/to/key probeuser@your-server-ip2. If connection refused → check SSH service:
systemctl status ssh3. If permission denied → check
~probeuser/.ssh/authorized_keys still has the key.4. If timeout → check firewall / security group allows port 22 from ConnectWatch.
HTTPS probe failed (P1)
2. Check your application logs for errors.
3. Check your load balancer health checks.
4. Check SSL certificate expiry:
openssl s_client -connect yourdomain.com:4435. Check DNS:
dig yourdomain.comEnvironment showing Degraded
• EFS burst credit exhaustion — run the 1GB on-demand test to confirm, then switch to Provisioned Throughput.
• NFS server under load — check NFS server CPU and I/O metrics.
• Network congestion — check latency between app server and storage server.
• High I/O on the server — another process is saturating disk I/O.