Skip to content

Monitoring & Observability

Running a healthy node means watching more than “is the process up.” Validators and RPC nodes can stay online while still being operationally unhealthy: behind the cluster, not voting, not serving RPC correctly, or quietly filling disks.

What to monitor first

Validator signals

Watch these continuously:

  • node visible in gossip
  • catchup lag relative to cluster head
  • vote-account credits and delinquency state
  • snapshot / replay progress during restart
  • CPU, RAM, and disk saturation
  • ledger / accounts disk free space
  • process restarts and crash loops

RPC signals

Watch these continuously:

  • RPC response latency
  • slot lag versus a trusted reference RPC
  • WebSocket subscription stability
  • account-index memory pressure
  • request volume and error rate
  • disk growth and snapshot sync status

First-response commands

These are the fastest useful checks when something feels wrong.

bash
# Confirm local version
solana cluster-version
agave-validator --version

# Check node visibility
solana gossip | grep <IDENTITY_PUBKEY>
solana validators | grep <IDENTITY_OR_VOTE_PUBKEY>

# Check catchup state
solana catchup <IDENTITY_PUBKEY>

# Inspect vote-account health
solana vote-account <VOTE_ACCOUNT_PUBKEY>

# Inspect local logs
journalctl -u zink-validator -f

For RPC nodes:

bash
curl -s http://127.0.0.1:8899/health
solana --url http://127.0.0.1:8899 block-height
solana --url http://127.0.0.1:8899 slot

What healthy looks like

Validator

A healthy validator usually shows:

  • stable presence in solana gossip
  • catchup that reaches cluster head or stays near it
  • a vote account that continues accruing credits
  • no repeated crash-loop or full snapshot rebuild pattern in logs

RPC node

A healthy RPC node usually shows:

  • low slot lag relative to a trusted cluster reference
  • stable getHealth responses
  • consistent block-height progression
  • acceptable application latency under real traffic

Suggested alerting categories

Critical

Page immediately for:

  • validator process down
  • node missing from gossip unexpectedly
  • vote account stops advancing
  • RPC health endpoint failing
  • ledger / accounts disk critically low

Warning

Investigate soon for:

  • catchup lag growing over time
  • unusually high RPC latency
  • snapshot download loops
  • memory pressure or swap activity
  • sudden spikes in disk usage

Infrastructure metrics worth exporting

If you run Prometheus, Grafana, or another metrics stack, track at least:

  • CPU utilization
  • memory utilization
  • NVMe read / write throughput and latency
  • filesystem free space
  • network throughput and packet loss
  • process restart count
  • RPC success / error rates
  • slot lag versus a trusted reference endpoint

Watchtower and external checks

agave-watchtower is useful on a separate machine because it tells you whether the validator still looks healthy from the outside, not just from inside the box.

A sensible lightweight pattern is:

  • local logs and system metrics on the validator host
  • one external watcher or monitoring node off-box
  • alerts that fire when the validator disappears from gossip or the vote account stops advancing

Log strategy

Keep logs centralized or at least easy to query.

Recommended pattern:

  • systemd-journal for on-host review
  • central shipping to your log platform for retention and search
  • alerts on repeated panic, snapshot, repair, or RPC health failures

Useful command:

bash
journalctl -u zink-validator --since "30 minutes ago"

Operator dashboard checklist

A practical dashboard should answer these six questions quickly:

  • Is the node up?
  • Is it visible to peers?
  • Is it caught up?
  • Is it voting?
  • Is it serving traffic?
  • Is it running out of disk or memory?

Zink-specific

Zink does not currently require a custom observability stack. Start with standard Linux + Agave operator monitoring, then layer in Zink-specific dashboards or runbooks as the network team publishes them.

Zink is a general-purpose SVM network for programs, operators, and bridge integrations.