Appearance
Troubleshooting
This page is organized by symptom. Start with the behavior you see, run the smallest checks first, and only then move into invasive actions like rebuilding snapshots or changing flags.
Validator fails immediately on first boot
Checks
bash
journalctl -u zink-validator --since "10 minutes ago"
agave-validator --version
cat /proc/sys/vm/max_map_count
ulimit -nCommon causes
- missing or unreadable identity keypair
- incorrect file ownership or permissions
- missing Linux tuning (
vm.max_map_count, file limits, memlock) - wrong validator version for the cluster
- bad flag syntax in the launch script
What to do
- confirm the
soluser can read the identity keypair and write logs - re-check the startup script line by line with
agave-validator --help - verify systemd
LimitNOFILEandLimitMEMLOCKare set - compare the local validator version to the operator-provided cluster version
Node does not appear in gossip
Checks
bash
solana gossip | grep <IDENTITY_PUBKEY>
ss -tulpn | grep agave-validator
journalctl -u zink-validator --since "10 minutes ago"Common causes
- wrong
--entrypoint - blocked UDP/TCP traffic in the
--dynamic-port-range - wrong identity key configured
- host networking, NAT, or public-IP issue
- process never started successfully
What to do
- verify firewall rules for the configured dynamic port range
- confirm the identity public key matches the intended node
- confirm the cluster entrypoints are current
- inspect logs for bind failures or early startup exit
Node will not catch up
Checks
bash
solana catchup <IDENTITY_PUBKEY>
free -h
iostat -xz 1
journalctl -u zink-validator --since "30 minutes ago"Common causes
- disks too slow
- insufficient RAM
- old or bad snapshot
- bandwidth bottleneck
- wrong cluster or genesis-hash mismatch
What to do
- verify local hardware really meets the published floor
- confirm
--expected-genesis-hashis correct - check for repeated snapshot replay or repair loops in logs
- check whether ledger/accounts disks are saturating
Validator is up but the vote account is not progressing
Checks
bash
solana vote-account <VOTE_ACCOUNT_PUBKEY>
solana validators | grep <VOTE_ACCOUNT_OR_IDENTITY_PUBKEY>
solana stakes <VOTE_ACCOUNT_PUBKEY>Common causes
- wrong vote account configured
- authority mismatch
- node too far behind to vote reliably
- onboarding / delegation issue on a permissioned cluster
What to do
- confirm the configured vote account matches onboarding records
- confirm the node is near cluster head
- verify the vote account has the expected authorities
- coordinate with Zink operators if validator-set admission is still pending
RPC node responds slowly or returns stale data
Checks
bash
curl -s http://127.0.0.1:8899/health
solana --url http://127.0.0.1:8899 slot
solana slotCommon causes
- slot lag versus cluster head
- overloaded account indexes
- insufficient RAM or disk throughput
- too much client traffic for a single node
What to do
- compare the local slot to a trusted reference RPC
- reduce unnecessary indexes
- move heavy workloads to separate RPC nodes
- place a proxy or load balancer in front of multiple nodes if traffic volume justifies it
Snapshot sync or restart loops
Checks
bash
journalctl -u zink-validator --since "1 hour ago" | tail -200
lsblk
df -h /mnt/ledger /mnt/accountsCommon causes
- corrupt snapshot or ledger state
- not enough free disk space
- WAL / replay failures
- version mismatch with cluster
What to do
- confirm enough free space exists on ledger and accounts volumes
- confirm local
agave-validatorversion matches cluster expectations - review whether
--wal-recovery-mode skip_any_corrupted_recordis appropriate for your situation - only rebuild local state after less destructive diagnostics fail
Wrong cluster / wrong endpoint mistakes
This one is embarrassingly common.
Checks
bash
solana config get
solana cluster-versionWhat to do
- verify the CLI is pointed at the intended Zink cluster
- verify your application config uses the same RPC URL you think it does
- verify browser wallets are not silently connected to a different network
- verify the validator startup script is using the current Zink bootstrap bundle instead of copied old values
Zink recommendation
When debugging production incidents, check cluster targeting first. A surprising amount of “bad data” is just a tool or wallet pointed at the wrong chain.
Clock drift or time-sync weirdness
Checks
bash
timedatectl statusCommon causes
- NTP disabled
- host clock drift after suspend / resume or VM host issues
What to do
- enable a reliable time-sync service such as
systemd-timesyncdorchrony - correct time drift before chasing consensus symptoms that may only be side effects
Before escalating
Gather:
- node identity pubkey
- vote account pubkey, if applicable
- exact cluster / RPC URL
- current validator version
- recent log excerpt
- output from
solana catchup,solana gossip, andsolana vote-accountwhere relevant
That saves a lot of back-and-forth if you need help from the Zink team.