Monitoring & Observability

Running a healthy node means watching more than “is the process up.” Validators and RPC nodes can stay online while still being operationally unhealthy: behind the cluster, not voting, not serving RPC correctly, or quietly filling disks.

What to monitor first

Validator signals

Watch these continuously:

node visible in gossip
catchup lag relative to cluster head
vote-account credits and delinquency state
snapshot / replay progress during restart
CPU, RAM, and disk saturation
ledger / accounts disk free space
process restarts and crash loops

RPC signals

Watch these continuously:

RPC response latency
slot lag versus a trusted reference RPC
WebSocket subscription stability
account-index memory pressure
request volume and error rate
disk growth and snapshot sync status

First-response commands

These are the fastest useful checks when something feels wrong.

bash

# Confirm local version
solana cluster-version
agave-validator --version

# Check node visibility
solana gossip | grep <IDENTITY_PUBKEY>
solana validators | grep <IDENTITY_OR_VOTE_PUBKEY>

# Check catchup state
solana catchup <IDENTITY_PUBKEY>

# Inspect vote-account health
solana vote-account <VOTE_ACCOUNT_PUBKEY>

# Inspect local logs
journalctl -u zink-validator -f

For RPC nodes:

bash

curl -s http://127.0.0.1:8899/health
solana --url http://127.0.0.1:8899 block-height
solana --url http://127.0.0.1:8899 slot

What healthy looks like

Validator

A healthy validator usually shows:

stable presence in solana gossip
catchup that reaches cluster head or stays near it
a vote account that continues accruing credits
no repeated crash-loop or full snapshot rebuild pattern in logs

RPC node

A healthy RPC node usually shows:

low slot lag relative to a trusted cluster reference
stable getHealth responses
consistent block-height progression
acceptable application latency under real traffic

Suggested alerting categories

Critical

Page immediately for:

validator process down
node missing from gossip unexpectedly
vote account stops advancing
RPC health endpoint failing
ledger / accounts disk critically low

Warning

Investigate soon for:

catchup lag growing over time
unusually high RPC latency
snapshot download loops
memory pressure or swap activity
sudden spikes in disk usage

Infrastructure metrics worth exporting

If you run Prometheus, Grafana, or another metrics stack, track at least:

CPU utilization
memory utilization
NVMe read / write throughput and latency
filesystem free space
network throughput and packet loss
process restart count
RPC success / error rates
slot lag versus a trusted reference endpoint

Watchtower and external checks

agave-watchtower is useful on a separate machine because it tells you whether the validator still looks healthy from the outside, not just from inside the box.

A sensible lightweight pattern is:

local logs and system metrics on the validator host
one external watcher or monitoring node off-box
alerts that fire when the validator disappears from gossip or the vote account stops advancing

Log strategy

Keep logs centralized or at least easy to query.

Recommended pattern:

systemd-journal for on-host review
central shipping to your log platform for retention and search
alerts on repeated panic, snapshot, repair, or RPC health failures

Useful command:

bash

journalctl -u zink-validator --since "30 minutes ago"

Operator dashboard checklist

A practical dashboard should answer these six questions quickly:

Is the node up?
Is it visible to peers?
Is it caught up?
Is it voting?
Is it serving traffic?
Is it running out of disk or memory?

Zink-specific

Zink does not currently require a custom observability stack. Start with standard Linux + Agave operator monitoring, then layer in Zink-specific dashboards or runbooks as the network team publishes them.

Monitoring & Observability ​

What to monitor first ​

Validator signals ​

RPC signals ​

First-response commands ​

What healthy looks like ​

Validator ​

RPC node ​

Suggested alerting categories ​

Critical ​

Warning ​

Infrastructure metrics worth exporting ​

Watchtower and external checks ​

Log strategy ​

Operator dashboard checklist ​

Related pages ​

Monitoring & Observability

What to monitor first

Validator signals

RPC signals

First-response commands

What healthy looks like

Validator

RPC node

Suggested alerting categories

Critical

Warning

Infrastructure metrics worth exporting

Watchtower and external checks

Log strategy

Operator dashboard checklist

Related pages