Appearance
Monitoring & Observability
Running a healthy node means watching more than “is the process up.” Validators and RPC nodes can stay online while still being operationally unhealthy: behind the cluster, not voting, not serving RPC correctly, or quietly filling disks.
What to monitor first
Validator signals
Watch these continuously:
- node visible in gossip
- catchup lag relative to cluster head
- vote-account credits and delinquency state
- snapshot / replay progress during restart
- CPU, RAM, and disk saturation
- ledger / accounts disk free space
- process restarts and crash loops
RPC signals
Watch these continuously:
- RPC response latency
- slot lag versus a trusted reference RPC
- WebSocket subscription stability
- account-index memory pressure
- request volume and error rate
- disk growth and snapshot sync status
First-response commands
These are the fastest useful checks when something feels wrong.
bash
# Confirm local version
solana cluster-version
agave-validator --version
# Check node visibility
solana gossip | grep <IDENTITY_PUBKEY>
solana validators | grep <IDENTITY_OR_VOTE_PUBKEY>
# Check catchup state
solana catchup <IDENTITY_PUBKEY>
# Inspect vote-account health
solana vote-account <VOTE_ACCOUNT_PUBKEY>
# Inspect local logs
journalctl -u zink-validator -fFor RPC nodes:
bash
curl -s http://127.0.0.1:8899/health
solana --url http://127.0.0.1:8899 block-height
solana --url http://127.0.0.1:8899 slotWhat healthy looks like
Validator
A healthy validator usually shows:
- stable presence in
solana gossip - catchup that reaches cluster head or stays near it
- a vote account that continues accruing credits
- no repeated crash-loop or full snapshot rebuild pattern in logs
RPC node
A healthy RPC node usually shows:
- low slot lag relative to a trusted cluster reference
- stable
getHealthresponses - consistent block-height progression
- acceptable application latency under real traffic
Suggested alerting categories
Critical
Page immediately for:
- validator process down
- node missing from gossip unexpectedly
- vote account stops advancing
- RPC health endpoint failing
- ledger / accounts disk critically low
Warning
Investigate soon for:
- catchup lag growing over time
- unusually high RPC latency
- snapshot download loops
- memory pressure or swap activity
- sudden spikes in disk usage
Infrastructure metrics worth exporting
If you run Prometheus, Grafana, or another metrics stack, track at least:
- CPU utilization
- memory utilization
- NVMe read / write throughput and latency
- filesystem free space
- network throughput and packet loss
- process restart count
- RPC success / error rates
- slot lag versus a trusted reference endpoint
Watchtower and external checks
agave-watchtower is useful on a separate machine because it tells you whether the validator still looks healthy from the outside, not just from inside the box.
A sensible lightweight pattern is:
- local logs and system metrics on the validator host
- one external watcher or monitoring node off-box
- alerts that fire when the validator disappears from gossip or the vote account stops advancing
Log strategy
Keep logs centralized or at least easy to query.
Recommended pattern:
- systemd-journal for on-host review
- central shipping to your log platform for retention and search
- alerts on repeated panic, snapshot, repair, or RPC health failures
Useful command:
bash
journalctl -u zink-validator --since "30 minutes ago"Operator dashboard checklist
A practical dashboard should answer these six questions quickly:
- Is the node up?
- Is it visible to peers?
- Is it caught up?
- Is it voting?
- Is it serving traffic?
- Is it running out of disk or memory?
Zink-specific
Zink does not currently require a custom observability stack. Start with standard Linux + Agave operator monitoring, then layer in Zink-specific dashboards or runbooks as the network team publishes them.