Skip to content

Troubleshooting

This page is organized by symptom. Start with the behavior you see, run the smallest checks first, and only then move into invasive actions like rebuilding snapshots or changing flags.

Validator fails immediately on first boot

Checks

bash
journalctl -u zink-validator --since "10 minutes ago"
agave-validator --version
cat /proc/sys/vm/max_map_count
ulimit -n

Common causes

  • missing or unreadable identity keypair
  • incorrect file ownership or permissions
  • missing Linux tuning (vm.max_map_count, file limits, memlock)
  • wrong validator version for the cluster
  • bad flag syntax in the launch script

What to do

  • confirm the sol user can read the identity keypair and write logs
  • re-check the startup script line by line with agave-validator --help
  • verify systemd LimitNOFILE and LimitMEMLOCK are set
  • compare the local validator version to the operator-provided cluster version

Node does not appear in gossip

Checks

bash
solana gossip | grep <IDENTITY_PUBKEY>
ss -tulpn | grep agave-validator
journalctl -u zink-validator --since "10 minutes ago"

Common causes

  • wrong --entrypoint
  • blocked UDP/TCP traffic in the --dynamic-port-range
  • wrong identity key configured
  • host networking, NAT, or public-IP issue
  • process never started successfully

What to do

  • verify firewall rules for the configured dynamic port range
  • confirm the identity public key matches the intended node
  • confirm the cluster entrypoints are current
  • inspect logs for bind failures or early startup exit

Node will not catch up

Checks

bash
solana catchup <IDENTITY_PUBKEY>
free -h
iostat -xz 1
journalctl -u zink-validator --since "30 minutes ago"

Common causes

  • disks too slow
  • insufficient RAM
  • old or bad snapshot
  • bandwidth bottleneck
  • wrong cluster or genesis-hash mismatch

What to do

  • verify local hardware really meets the published floor
  • confirm --expected-genesis-hash is correct
  • check for repeated snapshot replay or repair loops in logs
  • check whether ledger/accounts disks are saturating

Validator is up but the vote account is not progressing

Checks

bash
solana vote-account <VOTE_ACCOUNT_PUBKEY>
solana validators | grep <VOTE_ACCOUNT_OR_IDENTITY_PUBKEY>
solana stakes <VOTE_ACCOUNT_PUBKEY>

Common causes

  • wrong vote account configured
  • authority mismatch
  • node too far behind to vote reliably
  • onboarding / delegation issue on a permissioned cluster

What to do

  • confirm the configured vote account matches onboarding records
  • confirm the node is near cluster head
  • verify the vote account has the expected authorities
  • coordinate with Zink operators if validator-set admission is still pending

RPC node responds slowly or returns stale data

Checks

bash
curl -s http://127.0.0.1:8899/health
solana --url http://127.0.0.1:8899 slot
solana slot

Common causes

  • slot lag versus cluster head
  • overloaded account indexes
  • insufficient RAM or disk throughput
  • too much client traffic for a single node

What to do

  • compare the local slot to a trusted reference RPC
  • reduce unnecessary indexes
  • move heavy workloads to separate RPC nodes
  • place a proxy or load balancer in front of multiple nodes if traffic volume justifies it

Snapshot sync or restart loops

Checks

bash
journalctl -u zink-validator --since "1 hour ago" | tail -200
lsblk
df -h /mnt/ledger /mnt/accounts

Common causes

  • corrupt snapshot or ledger state
  • not enough free disk space
  • WAL / replay failures
  • version mismatch with cluster

What to do

  • confirm enough free space exists on ledger and accounts volumes
  • confirm local agave-validator version matches cluster expectations
  • review whether --wal-recovery-mode skip_any_corrupted_record is appropriate for your situation
  • only rebuild local state after less destructive diagnostics fail

Wrong cluster / wrong endpoint mistakes

This one is embarrassingly common.

Checks

bash
solana config get
solana cluster-version

What to do

  • verify the CLI is pointed at the intended Zink cluster
  • verify your application config uses the same RPC URL you think it does
  • verify browser wallets are not silently connected to a different network
  • verify the validator startup script is using the current Zink bootstrap bundle instead of copied old values

Zink recommendation

When debugging production incidents, check cluster targeting first. A surprising amount of “bad data” is just a tool or wallet pointed at the wrong chain.

Clock drift or time-sync weirdness

Checks

bash
timedatectl status

Common causes

  • NTP disabled
  • host clock drift after suspend / resume or VM host issues

What to do

  • enable a reliable time-sync service such as systemd-timesyncd or chrony
  • correct time drift before chasing consensus symptoms that may only be side effects

Before escalating

Gather:

  • node identity pubkey
  • vote account pubkey, if applicable
  • exact cluster / RPC URL
  • current validator version
  • recent log excerpt
  • output from solana catchup, solana gossip, and solana vote-account where relevant

That saves a lot of back-and-forth if you need help from the Zink team.

Zink is a general-purpose SVM network for programs, operators, and bridge integrations.