Auto-Failover SDK for Zcash lightwalletd β€” Multi-endpoint resilience for wallets & apps

Application Owners

@Emilio983

Organization Details

Organization Name: The Social Mask Development Team

How did you learn about ZCG?
Through Zcash Community Forum while researching privacy payment solutions and developing our Social Mask project. During development, we identified critical infrastructure gaps that repeatedly impact wallet reliability and user experience across the Zcash ecosystem.

Requested Amount & Funding Structure

Required (Base) Track: $38,000 USD
Optional (Node add-on) Track: $10,000 USD
Grand Total (if optional is activated): $48,000 USD

We are presenting a two-track budget. ZCG may fund the Required (Base) track alone, which fully delivers the SDK and core outcomes, or optionally activate the Node add-on to sponsor a lightwalletd test/staging node and extended chaos/monitoring capabilities. All milestones therefore show two amounts: the mandatory base and the optional add-on. If the add-on is not activated, all Required deliverables remain achievable and independently useful.

Timeline: 16 weeks (4 months) of focused development

Executive Summary

We propose a drop-in, non-custodial Auto-Failover SDK for lightwalletd that brings Electrum-like multi-server resilience to Zcash. The SDK provides transparent multi-endpoint routing, geo-latency optimization, and chaos-tested reliability (p95 failover ≀ 3 seconds), while never handling keys or custody.

Milestone 1 ships JS/TS SDK, CLI, and chaos harness; Milestone 2 adds geo-routing, a status widget, and one additional binding (PHP or Python); Milestone 3 completes a Rust binding, integration with public ecosystem health monitors, and advanced documentation.

We gate on SLOs (availability and failover times), publish packages across major ecosystems, and offer non-blocking adoption incentives. The result is a foundational reliability layer that makes existing infrastructure grants more valuable and lowers risk for every future Zcash wallet, merchant plugin, and dapp.

Project Overview

Project Title: Auto-Failover SDK for Zcash Lightwalletd β€” Multi-endpoint resilience

Category: Infrastructure

Team Information

Project Lead: Emilio Navarro MejΓ­a

Role: Lead Developer & Technical Architect

Background: Founder of NordClip.com and Nortedu.com with extensive experience in blockchain development across Zcash, Algorand, and Ethereum. Previously presented the Social Mask decentralized social network proposal to ZCG. Our team has hands-on experience creating wallets for Algorand and implementing privacy technologies across blockchain platforms.

Responsibilities: SDK architecture design, multi-language implementation, security implementation, failover logic, health monitoring systems, WebZjs integration, documentation, and ecosystem coordination.

Core Team Member:

  • Oswaldo Navarro β€” Finance & Operations Manager
    15+ years of experience in accounting and financial management. Official legal representative with government credentials. Manages project budget, milestone payments, legal compliance, vendor contracts, financial records, and KYC coordination with ZCG.

Extended Development Team:

Our core team is supported by a rotating group of experienced blockchain developers who contribute based on availability and personal passion for privacy technology and freedom of expression. Many team members work on these projects as a hobby and personal mission rather than primary employment. The upcoming holiday season provides dedicated time for focused development, allowing our team to deliver high-quality work on this proposal and other initiatives we believe will significantly benefit the Zcash ecosystem.

This approach reflects our genuine motivation: we’re not primarily profit-driven, but rather committed to advancing privacy technology and supporting our personal vision of digital freedom of expression. Team members balance this work with other employment, contributing their expertise when available.

Letters of Intent: We have secured two preliminary letters of intent: one from a major Zcash wallet project (available to ZCG upon request) confirming interest in testing the SDK during Milestone 2, and one from a lightwalletd endpoint operator willing to participate in chaos testing and provide operational feedback.

Project Details

Background Evidence

The Zcash ecosystem has faced documented infrastructure challenges:

  • ECC Emergency Mode (2022): Extended wallet sync times and service degradation documented in ECC incident reports
  • Lightwalletd block resets: GitHub issue documenting mainnet.lightwalletd.com resetting latest block height, causing repeated sync failures
  • Zecwallet outage: Complete server shutdown requiring emergency stop-gap grant to restore service
  • Nighthawk shutdown (2024): Infrastructure discontinuation with 30-day notice, reducing available public endpoints

Today, few public endpoints remain operational, and availability varies. This single-point-of-failure architecture leaves users unable to access funds when default endpoints fail. Most users lack the technical knowledge to manually reconfigure wallet settings.

(Evidence links provided in Appendix)

Problem Statement

Zcash lightwalletd infrastructure operates as a single-endpoint architecture with no automatic failover mechanisms. When endpoints experience downtime, wallets become completely non-functional. Current manual server switching requires technical knowledge most users don’t possess, creating critical ecosystem fragility.

Why Existing Solutions Don’t Solve This:

  • Infrastructure hosting grants fund server operationβ€”valuable but a different layer
  • Server-side load balancing requires trust in single operators and doesn’t solve client-side coordination
  • Manual endpoint lists in wallets shift burden to non-technical users
  • Bitcoin’s Electrum solved this years ago with automatic multi-server failover. Zcash has no comparable client-side solution.

Proposed Solution

Core Solution: Client-Side Auto-Failover SDK

We’re building a client-side SDK that provides transparent multi-endpoint management with automatic failover, requiring minimal code changes for existing wallet integrations. The SDK operates at the networking layer onlyβ€”no custody of keys or fundsβ€”and makes all existing lightwalletd endpoints more reliable through intelligent routing.

Scope Control: Milestone 1 delivers JS/TS SDK, CLI, and chaos test harness. Additional language bindings (PHP/Python, then Rust) move to Milestones 2–3 to de-risk execution while preserving full roadmap.

Key Technical Components:

  1. Transparent Multi-Endpoint Management
  • Health-checked pool of lightwalletd servers across multiple operators
  • Automatic failover when primary endpoint fails or degrades (p95 ≀ 3 seconds)
  • Exponential backoff + circuit breaker patterns prevent thundering herd
  • Drop-in compatibility with existing WebZjs and wallet SDK code
  1. Geographic Latency Optimization
  • Routes to fastest available endpoint automatically based on measured response times
  • Improves sync performance 200-500ms with zero configuration required
  1. Intelligent Anomaly Detection
  • Identifies misbehaving endpoints (block height resets, stale data)
  • Temporarily excludes outliers even if they pass basic health checks
  • Cross-validates chain state across multiple servers
  1. Security Model
  • Transport security via TLS (optional certificate pinning)
  • Client-side idempotency tokens and txid de-duplication ensure no double submissions during failover
  • Idempotency & No Double-Send: The SDK maintains a short-window local txid cache to prevent duplicate transaction submissions during failover. Retries are performed without re-sending already confirmed transactions, and the SDK does not interfere with wallet fee logic
  • The SDK never handles keys or custodyβ€”networking layer only
  • Zero trust in individual endpoints beyond correct chain state
  1. Observable Metrics & Developer Experience
  • Per-endpoint success rates, average latencies, historical uptime percentages
  • Embeddable HTML/JS status widget showing real-time endpoint health
  • CLI tool for operators to test, monitor, and troubleshoot endpoint health
  • Integration with Hosh monitoring service for ecosystem-wide coordination

Implementation Example (JavaScript/TypeScript):

import { ZcashFailover } from β€˜@zcash/auto-failover’; const client = new ZcashFailover({ endpoints: [β€˜https://zec.rocks:443’, β€˜https://mainnet.lightwalletd.com:9067’], strategy: β€˜latency-optimized’ }); // Existing WebZjs code works with minimal changes // Note: getProxyURL() returns a local adapter URL from the SDK (no separate proxy required) const wallet = new WebWallet(β€œmain”, client.getProxyURL(), 1);

Architecture Diagram:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     WALLET APPLICATION LAYER                     β”‚
β”‚         (Zashi, Ywallet, WordPress Plugin, Web Wallet)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β”‚ Standard gRPC/WebZjs API
                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AUTO-FAILOVER SDK LAYER                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚              INTELLIGENT ROUTING ENGINE                    β”‚ β”‚
β”‚  β”‚  β€’ Health monitoring    β€’ Latency optimization             β”‚ β”‚
β”‚  β”‚  β€’ Circuit breakers     β€’ Anomaly detection                β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚              SECURITY & RELIABILITY LAYER                  β”‚ β”‚
β”‚  β”‚  β€’ TLS validation       β€’ Idempotency control              β”‚ β”‚
β”‚  β”‚  β€’ Certificate pinning  β€’ Duplicate prevention             β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                    β”‚                    β”‚
        β–Ό                    β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Endpoint A   β”‚    β”‚  Endpoint B   β”‚    β”‚  Endpoint C   β”‚
β”‚  zec.rocks    β”‚    β”‚  mainnet.lwd  β”‚    β”‚  community    β”‚
β”‚  Status: 🟒   β”‚    β”‚  Status: 🟑   β”‚    β”‚  Status: 🟒   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Endpoint Registry & Abuse Policy

Endpoint Registration:
Endpoints are added via signed pull requests with TLS verification and operator contact information. Each endpoint entry includes:

  • TLS certificate fingerprint
  • Operator contact (email or forum handle)
  • Geographic region hint
  • Minimum supported protocol version

Neutral Governance & Overrides:
The SDK ships with a signed default registry maintained via public pull requests. Any integrator can override or extend the endpoint list locally (env/config). Delisting criteria include stale/rollback heights, invalid TLS, or abusive behavior. Operators receive notice and a seventy-two-hour remediation window before delisting, except in critical security cases.

Abuse Handling:

  • Rate-limits per client to prevent endpoint exhaustion
  • Temporary blacklisting for misbehaving endpoints (configurable thresholds)
  • Transparent delisting process with notification to operator
  • Community governance for endpoint registry changes

Privacy & Telemetry

Privacy of Metrics: The status widget and SDK metrics are opt-in by default, aggregated, and short-retention. No IP collection; operators can self-report health status via signed JSON. All telemetry is client-controlled and can be disabled completely.

Data Minimization:

  • No user transaction data logged or transmitted
  • Endpoint health checks use synthetic test queries
  • Aggregated statistics only (no individual client tracking)

Deliverables

  1. Open-Source SDK Packages:
  • JavaScript/TypeScript package published to npm (@zcash/auto-failover) β€” Milestone 1
  • PHP package published to Packagist (zcash/auto-failover) β€” Milestone 2
  • Python package published to PyPI (zcash-failover) β€” Milestone 2 alternative
  • Rust crate published to crates.io (zcash-failover) β€” Milestone 3
  • License: MIT. The repository will include LICENSE, SECURITY.md (responsible disclosure and security.txt), CONTRIBUTING.md, and CODE_OF_CONDUCT.md
  1. CLI Tool:
  • Standalone binary for Linux, macOS, Windows
  • Endpoint health testing and monitoring
  • JSON export for integration with existing monitoring systems
  1. Integration Examples:
  • WebZjs wrapper with drop-in compatibility
  • WordPress payment plugin integration
  • React/Vue web wallet examples
  • Serverless function examples (Vercel, Cloudflare Workers)
  1. Embeddable Status Widget:
  • Customizable HTML/JS component showing real-time endpoint health
  • Zero dependencies (vanilla JavaScript)
  1. Documentation:
  • API reference for all languages
  • Integration guides for major wallet frameworks
  • Security best practices and threat modeling
  • Operator guide for publishing endpoints
  • Troubleshooting guides
  1. Chaos Engineering Test Suite:
  • Simulates outages, lag, block resets, network partitions
  • Performance benchmarks comparing single-endpoint vs multi-endpoint
  • Integration test suite for major wallet libraries
  1. Compatibility Matrix (published in M2):
  • Runtime compatibility (browser/Node.js)
  • gRPC-web proxy expectations
  • WebZjs versions and integration notes for major wallets
  • The SDK never handles keys or fee logic; it only manages network routing and retries

Dependencies

Technical Dependencies:

No Conflicting Dependencies: This project has zero dependencies on other pending grant proposals. The SDK can be built and delivered independently.

Technical Approach & Milestones

Gated by SLOs, Not Partners: We gate milestones on measurable SLOs (p95 failover ≀ 3s; 14-day SLO β‰₯ 99%). Adoption (PRs, wrappers, production confirmations) is tracked and showcased but non-blocking for milestone payments.


Milestone 1 β€” Core SDK Foundation (Weeks 1-5)

Required Deliverables:

  • JavaScript/TypeScript SDK with WebZjs compatibility
  • CLI tool for health testing and monitoring
  • Chaos engineering test harness (synthetic failure scenarios)
  • Core documentation (API reference, quick-start guide)
  • Automated test suite with >80% code coverage
  • Demo video showing failover in action

Optional (Node add-on) Deliverables:

  • Spin-up of lightwalletd test/staging node (compute + NVMe + TLS)
  • Initial dataset for controlled chaos testing
  • Basic observability dashboards (Grafana/Prometheus)
  • Egress/bandwidth credits for testing traffic
  • Initial chaos scenarios (latency injection, endpoint rotation)

Acceptance Criteria (Gating - Required Track):

  • p95 failover ≀ 3 seconds (chaos test confirmed)
  • SDK overhead ≀ 50ms during normal operation (benchmark published)
  • Core code coverage β‰₯ 80%
  • CLI successfully tests multiple endpoints and exports JSON
  • Documentation enables <15-minute integration
  • Public demo video posted to forum

Funding:

  • Required: $15,000
  • Optional (Node add-on): $5,000
  • Milestone 1 Total (if optional activated): $20,000

Milestone 2 β€” Optimization & Second Binding (Weeks 6-10)

Required Deliverables:

  • Geographic latency routing implementation
  • Embeddable status widget (customizable themes)
  • PHP binding (Composer package) OR Python binding (pip package)
  • Security hardening (certificate pinning, enhanced idempotency)
  • Performance benchmarks comparing single vs. multi-endpoint
  • Security documentation (threat model, mitigation strategies)
  • Packages published to npm + (Packagist or PyPI)

Optional (Node add-on) Deliverables:

  • Extended node operation + observability (months 2-3)
  • Chaos credits (advanced latency injection, block-reset simulations)
  • Alarms and SLO tracking infrastructure
  • Multi-region latency testing
  • Bandwidth/egress for sustained testing

Acceptance Criteria (Gating - Required Track):

  • 14-day SLO β‰₯ 99% (controlled test environment)
  • Geographic routing demonstrates 200-500ms improvement for distant users
  • Benchmarks published showing <50ms overhead
  • Second language binding published to package registry
  • Status widget displays real-time health in live demo
  • Security documentation reviewed by β‰₯2 independent developers

SLO Measurement Methodology: We compute a rolling fourteen-day SLO using SDK telemetry from the chaos harness and synthetic clients across at least three regions. p95 failover time is measured under controlled endpoint faults (latency injection, drop, block-reset). Overhead is the median added client-side latency versus direct single-endpoint calls. Raw metrics and scripts will be published in the repo.

Adoption Tracking (Non-Blocking):

  • Open PRs to major wallets
  • Integration examples published
  • Developer feedback collected via forum
  • Publish LOI excerpts (with consent)

Funding:

  • Required: $12,000
  • Optional (Node add-on): $2,500
  • Milestone 2 Total (if optional activated): $14,500

Milestone 3 β€” Ecosystem Completion (Weeks 11-16)

Required Deliverables:

  • Rust binding (crates.io) OR remaining language binding from M2
  • Integration with public ecosystem health monitors operational
  • Advanced documentation (troubleshooting, operator handbook, security guide)
  • Runbooks for common failure scenarios
  • All packages published (npm, Packagist/PyPI, crates.io)
  • Public dashboard showing SDK adoption metrics
  • β‰₯3 production wallet/plugin integrations showcased (demos, PRs, or announcements)

Optional (Node add-on) Deliverables:

  • Node operation through project completion
  • Multi-region resilience testing
  • Final chaos testing report with reproducible scenarios
  • Operator experience guide based on real node operation
  • Extended monitoring and incident response data

Acceptance Criteria (Gating - Required Track):

  • Public demo of SDK handling real endpoint failover
  • All language bindings published to package registries
  • Integration with public ecosystem health monitors operational (health data exchange confirmed)
  • Documentation receives positive feedback from β‰₯3 independent developers
  • Runbooks validated by at least one endpoint operator
  • Dashboard live at public URL

Adoption Showcase (Non-Blocking for Payment, Required for Evidence):

  • β‰₯3 production integrations demonstrated via:
    • Wallet/plugin PRs merged or open
    • Public announcements from projects
    • Demo videos of SDK in production environments
    • GitHub stars, package downloads, forum testimonials

Funding:

  • Required: $11,000
  • Optional (Node add-on): $2,500
  • Milestone 3 Total (if optional activated): $13,500

Payment Schedule

Milestone Timeline Required Track Optional (Node add-on) Milestone Total (if optional)
Milestone 1 Weeks 1-5 $15,000 $5,000 $20,000
Milestone 2 Weeks 6-10 $12,000 $2,500 $14,500
Milestone 3 Weeks 11-16 $11,000 $2,500 $13,500
TOTAL 16 weeks (4 months) $38,000 $10,000 $48,000

Optional Node Rationale

The Optional (Node add-on) track sponsors a controlled lightwalletd test/staging node and extended chaos/monitoring infrastructure. This is not required to achieve the Required deliverablesβ€”the SDK can be fully developed and tested against existing public endpoints. However, the node accelerates development, improves reproducibility of edge cases, and reduces risk by providing:

Infrastructure Components:

  • Compute: VPS (16GB RAM, 4 vCPU, 400GB NVMe) for 4 months
  • NVMe Storage: High-speed SSD for blockchain data and reindexing tests
  • Bandwidth/Egress: Sustained testing traffic and multi-client simulations
  • Observability: Monitoring dashboards (Grafana/Prometheus), log aggregation
  • Backups & Snapshots: State preservation for reproducible failure scenarios
  • TLS Certificates: Proper HTTPS setup for realistic testing
  • Chaos Tooling: Latency injection, network partition simulation, block-reset scenarios

Why Optional:

  • All Required outcomes (SDK functionality, SLOs, packages) are achievable using existing public endpoints
  • The node is a development accelerator, not a production dependency
  • Controlled environment enables safer and faster iteration on failover logic
  • Reproducible chaos scenarios improve documentation and operator guides

If the Node add-on is not activated, the team will adapt testing strategies to rely exclusively on public endpoints and synthetic failure modes, with slightly longer iteration cycles.

Budget Breakdown

Required Track ($38,000):

Category Description Amount
Core Development SDK implementation (JS/TS, PHP/Python, Rust), circuit breakers, health monitoring $20,000
Testing & QA Chaos engineering suite, performance benchmarks, integration tests $6,000
Documentation API reference, integration guides, security docs, operator handbook $4,500
CLI & Tooling CLI tool, status widget, monitoring integrations $3,500
Integration Support Developer relations, community support, forum engagement $3,000
Project Management Milestone coordination, reporting, KYC compliance $1,000
Subtotal $38,000

Optional Track ($10,000):

Category Description Amount
Test Node Infrastructure VPS compute (4 months), NVMe storage, bandwidth/egress $3,500
Observability & Monitoring Grafana/Prometheus, log aggregation, alerting $1,500
Chaos Engineering Credits Advanced chaos tooling (latency injection, partition simulation) $2,000
DevOps & Maintenance Node setup, configuration, ongoing operation (4 months) $2,500
Backups & Certificates State snapshots, TLS certificates, domain setup $500
Subtotal $10,000

Risk Assessment

Risk 1: Low Adoption by Wallet Developers

Mitigation:

  • Direct outreach to major wallet teams during development for feedback
  • Make integration trivial with drop-in compatibility
  • Track PRs, demos, and package downloads
  • Showcase integrations in milestone reports (non-blocking for payment)

Risk 2: Endpoint Behavior Variance

Mitigation:

  • Extensive testing across all known public endpoints
  • Configurable health check thresholds and compatibility modes
  • Document known endpoint quirks and recommended configurations

Risk 3: Security Vulnerabilities in Failover Logic

Mitigation:

  • TLS transport security and optional certificate pinning
  • Client-side idempotency tokens and txid de-duplication built-in from day one
  • Public code review on GitHub before production recommendations
  • SDK operates at networking layer only (no key custody reduces attack surface)

Risk 4: Timeline Delays

Mitigation:

  • Conservative 16-week (4-month) timeline with buffer
  • Phased scope (M1 focuses on JS/TS only, reducing complexity)
  • Team capacity reserved for Q1 2026
  • Milestone-based payments allow for scope adjustments

Post-Grant Maintenance Plan

Maintenance Commitment: Three months of post-grant support and security patches included in the grant budget.

Release Policy:

  • Semantic versioning (semver) for all packages
  • LTS (Long-Term Support) for minor versions with critical security patches
  • Security patches released within seventy-two hours for critical vulnerabilities

Support Channels:

  • GitHub Issues for bug reports and feature requests
  • Zcash Community Forum for integration support and announcements
  • SECURITY.md with responsible disclosure policy and security.txt contact

Branching Strategy:

  • main branch for stable releases
  • develop branch for ongoing development
  • Feature branches with pull request reviews
  • Automated CI/CD testing before merge

Success Metrics

Primary Success Metrics (Milestone Gating):

  1. Milestone 1: p95 failover ≀ 3s; overhead ≀ 50ms; core coverage β‰₯ 80%; demo video
  2. Milestone 2: 14-day SLO β‰₯ 99%; benchmarks published; second binding live
  3. Milestone 3: Public failover demo; all packages published; Hosh operational

Secondary Success Metrics (Tracked, Non-Blocking for Payment):

  • β‰₯3 wallet/plugin integrations showcased (PRs, demos, announcements)
  • β‰₯5 independent developers provide positive feedback
  • Package downloads and GitHub engagement
  • Community forum reports improved reliability

Long-Term Success Indicators:

  • SDK becomes de facto standard for Zcash wallet development
  • Existing infrastructure grants become more valuable due to intelligent coordination
  • Community forum reports fewer β€œwallet won’t sync” complaints

Why This Grant Matters to Zcash

Foundational Infrastructure That Unlocks Innovation:

The Zcash ecosystem needs reliable infrastructure before ambitious applications can thrive. By solving the lightwalletd reliability problem, we enable an entire category of future innovation. When developers know infrastructure won’t break their MVPs, they’ll build on Zcash. When users trust wallets won’t lose access to their funds, adoption grows.

Bitcoin proved the model years ago with Electrum’s multi-server architecture. Their documentation states: β€œElectrum servers are decentralized and redundant. Your wallet is never down.” Zcash deserves the same reliability standard.

Ecosystem Multiplication Effect:

Every grant that builds on Zcash infrastructure becomes more valuable when the infrastructure is reliable. This SDK is a force multiplierβ€”it makes existing infrastructure grants more valuable by enabling intelligent client-side coordination and makes future application grants lower-risk by providing a reliable foundation.

Supporting Documents

Appendix: Evidence Links

Infrastructure Incident Documentation:

  1. ECC Emergency Mode (2022):
  1. Lightwalletd Block Height Resets:
  1. Zecwallet Server Outage:
  1. Nighthawk Wallet Infrastructure Shutdown (2024):

Contact:
Emilio Navarro MejΓ­a
Email: hi@socialmask.org
GitHub: Emilio983 (Emilio Navarro Mejia) Β· GitHub
Forum: @Emilio983

Fyi, it is built in the tonic client library for grpc in rust.

I’d rather use that than depend on a failover infrastructure that becomes the single point of failure.

5 Likes

thx a ton, @hanh β€” that tonic::transport::Channel::balance_channel link is super helpfull.

quick heads-up so I don’t give the wrong impression: we’re not building a centralized failover service. it’s a client-side sdk; every wallet/app keeps its own endpoint list + policy. no coordinator, no shared service, so no new single point of failure. the β€œnode add-on” in the budget is just for staging/chaos tests, not for routing anyone’s traffic.

how this fits w/ what already exists in rust
β€’ we’ll lean on tonic/tower for pooling/balancing and wrap it in an idiomatic layer.
β€’ on top we add zcash-aware health: cross-checking block height, catching stale/reset servers, circuit-breaker + backoff, a bit of geo-latency pickin’.
β€’ we add send-flow safety (idempotency tokens / de-dupe) so a failover during sendTransaction doesn’t accidentally double-send.
β€’ same semantics beyond rust β€” JS/TS (WebZjs), PHP, Python β€” where there isn’t a real equivalent rn, so wallets, plugins, and serverless apps get the same reliability.

this isn’t only for wallets; it’s also for merchant plugins, bots, backends, and serverless funcs that talk to lightwalletd.

happy to tweak the shape if you’ve got a cleaner tower layer pattern you prefer β€” open to any pointers

Thank you for your comment, @hanh. This is important, because as far as I can tell, this infrastructure only makes sense if wallet developers want to use it. I would like to clarify the same thing from the Zingo and Zashi teams. @nuttycom, @zancas would you want integrating this thing of that proposal?

1 Like

$875 / month for a VPS (16GB RAM, 4 vCPU, 400GB NVMe) sounds very expensive. Could you provide more details on this?

2 Likes

We priced the node on PRQ’s β€œPackage 3D – Single Octa Core” (Intel Xeon 8-core, 16 GB RAM, 2Γ—500 GB SATA RAID-1, 1 Gbit unmetered).
At today’s rate, 7,500 SEK β‰ˆ 785 USD/month.

We budgeted 875 USD/month to be realistic in production. The ~90 USD difference covers:
β€’ normal FX swings in SEK/USD (about Β±5%),
β€’ local VAT/taxes and provider fees where applicable,
β€’ small transaction/egress overages during chaos tests.

So the 875 USD figure reflects the real, tax-inclusive operating cost for running this test node in PRQ’s privacy-oriented environment, not padding.


Thanks @hanh and @artkor β€” really appreciate the pushback.

Totally agree Rust devs can lean on tonic::transport for channel failover. What I’m proposing is a pure client-side SDK that brings comparable behavior to JS/TS (WebZjs), PHP, and Python (and a Rust helper for teams that want a uniform API). There’s no coordinator, no shared proxy, no cloud β€” every app ships its own endpoint list + policy, so there’s no new single point of failure.

A few design bits to clarify intent:
β€’ Multi-endpoint health + latency pick with circuit-breakers and backoff.
β€’ p95 failover ≀ 3s target under chaos tests; normal overhead ≀ 50ms.
β€’ Idempotency + de-dupe on sendTransaction to prevent double-spend during retries.
β€’ Works for wallets, merchant plugins, bots, and serverless backends that talk to lightwalletd β€” not just wallets.

Why this matters: in my own Zcash integrations I’ve hit flaky endpoints and ended up running full nodes just to keep moving. That’s expensive and pushes smaller teams away. Other ecosystems treat client-side failover as standard reliability tooling; Zcash devs outside Rust should have that too.

If Zashi or Zingo are open to trying a tiny pilot once M1 lands, I’ll tailor the API to your needs. Also happy to take must-have criteria (e.g., specific hooks, metrics, or error semantics) so we build the thing you’d actually use. Thanks again for sanity-checking the direction.

1 Like

Thanks @hanh and @artkor for the comment really good point about Rust.
If you’re already in Rust, tonic::transport::Channel::balance_channel does a great job providing client-side balancing and fast reconnection when an endpoint drops.

What we’re proposing isn’t to replace that, but to extend the same reliability model to other runtimes where it doesn’t exist β€” mainly JS/TS (WebZjs, browser/Node/serverless), PHP, and Python, plus a small Rust helper for teams who want a consistent API across stacks.
It’s entirely client-side: no proxies, no coordinator, every app ships with its own endpoint list and policy.


How it improves reliability beyond balance_channel

1. Multi-endpoint intelligence
Instead of just reconnecting, the SDK actively measures latency, height sync, and response health, always routing to the fastest and most up-to-date node. It uses weighted scoring, exponential backoff, and circuit breakers β€” keeping p95 failover under 3 seconds, and normal overhead below 50 ms.

2. Zcash-aware safety
Unlike generic load balancers, this SDK understands lightwalletd. It adds built-in idempotency and transaction de-duplication, cross-checking the chain tip to avoid double-spends during retries, and supports optional certificate pinning for better integrity.

3. Developer experience & visibility
Includes an embeddable status widget, a CLI tool for endpoint diagnostics, JSON export for monitoring, and even a chaos-testing harness so teams can simulate node outages safely. It also integrates easily with ecosystem health dashboards.

4. Lower cost & friction
No extra infra, no single-point coordinator β€” the app itself manages its connection set. This reduces the need for teams to run their own full nodes just for reliability, which saves significant costs and bandwidth.

5. Ecosystem coverage beyond Rust
This isn’t only for wallets β€” it works for merchant plugins, marketplace backends, paywalls, bots, and analytics tools. Any project talking to lightwalletd can gain stable, multi-endpoint resilience with zero extra servers.


Why it matters

Many Zcash developers don’t build in Rust.
Right now, their options are to tolerate flaky endpoints or pay to host redundant nodes.
This SDK makes reliable access to Zcash easier and cheaper, bringing wallet-grade reliability to the wider app ecosystem β€” web apps, bots, backends, and marketplaces.

If Zashi or Zingo want to try a small pilot after M1, I’ll happily tailor the API to your must-have hooks, metrics, and error semantics.
And if Rust already covers your workflow, great β€” this SDK just fills the gap for everyone else who doesn’t have balance_channel available.

Thanks again for the thoughtful discussion β€” this feedback helps shape something that benefits the whole ecosystem. from. @1337bytes @nuttycom @zancas @alchemydc @GGuy

Thank you for submitting your proposal. Following a thorough review by the ZCG and a period for community feedback on the forum, the committee has decided not to move forward with this proposal.

We sincerely appreciate the time and effort you invested in your application and encourage you to stay involved and continue contributing to the Zcash community.