Zecwallet Infra Funding for increased traffic

Applicant background

Zecwallet is a desktop and mobile Zcash wallet with full support for shielded and transparent transactions. Over the last month or so, traffic to the Zecwallet’s LightwalletD servers has increased ~4x (probably correlated with the price of Zcash) as old and new users refresh their wallets.

This has put a large amount of strain on the Zecwallet server infra, and Zecwallet needs to add a new server to cope with the increased traffic without which I’m worried the infra will melt if there is another large influx of users.

Motivation and overview

Zecwallet’s LightwalletD server served ~25 million blocks yesterday, with peak traffic exceeding 50,000 blocks/min (Note this is all server bandwidth usage, Zecwallet clients don’t have any logging, so it is hard to estimate how many new / existing users of Zecwallet) which is approx 4x what it was at the start of Jan.

This grant is based on the previous maintenance grant, but simplified down to only the basics needed to keep the lights on.

It features 3 work items:

  1. Add a new LightwalletD server in Asia

Add a new infra setup (LightwalletD server + zcashd fullnode *2) in a primary-backup setup in a Asia AWS region. This will need 2x t3.xlarge machines (Zecwallet’s LightwalletD has higher memory and disk requirements, please see previous grant application )

This work item also includes the devops needed to setup the new region

  1. Zecwallet Fullnode

To protect against any eclipsing attacks, the zcashd nodes powering the LightwalletD run with maxconnections=500. A side effect of this is that the nodes are connected to a lot of new Zecwallet Fullnode users who are syncing the blockchain from scratch, and it ends up serving a lot of bandwidth to new Zecwallet Fullnode users.

Because of this issue, some nodes are downloading vastly more data than needed, probably causing a large corresponding increase in Zecwallet’s server bandwidth costs. This was fixed here

Need to investigate and patch this in the embedded Zecwallet Fullnode zcashd node, which should help reduce server traffic costs and ease some infra pressure. (Note, this is just a hunch that this issue is causing a large amount of bandwidth to be used, I need to investigate to be sure first)

  1. Fix checkpoints

New Zecwallet users need to sync from a known ‘checkpoint’ to catch up to the latest block height on a lightclient. To help with decentralization, Zecwallet has so far embedded the checkpoint (which contains the block hash and the sapling commitment root) directly into Zecwallet’s code so that the users don’t have to trust the LightwalletD server.

A side effect of this is that Zecwallet needs regular releases to update the embedded checkpoint. The last Zecwallet release was a over a month ago, which causes new users to download ~50k blocks just to catch up to the chain tip. This is unnecessary, and can probably be optimized away.

Technical approach

For (1), we already have deployed infrastructure, so the technical approach is just to copy the deployment to a new AWS region and reconfigure the DNS, SSl etc…

AWS Server costs

This section describes funds that are spent directly on AWS, which hosts the Lightwallet infrastructure

  • USD 2,000 / month

  • 6x AWS large instances (across 2 regions) for zcashd and LightwalletD (primary + backup)

  • 1x AWS t2.small instance for load balancing, website and other static hosting

  • AWS bandwidth costs, including zcashd serving blocks

  • SSL certs, App store developer fees and other miscellaneous one-time costs

This grant requests funding for 6 months of server costs.

For (2), we’ll investigate bandwidth usage on both client and server with the patch, measure bandwidth usage, and deploy a new release if it reduces bandwidth usage

For (3), we’ll add a new checkpoint to the lightclients, make new desktop + mobile releases, and look at pulling in the work done by the ECC wallet team to use the sapling root delivered from zcashd → ligthwalletD → Zecwallet lightclient.

I estimate this is about 5 days worth of work (for all 3 items)

Execution risks

Since this is pretty small scope and the work items are well understood, the execution risk is minimal. There is a larger risk that Zecwallet’s LightwalletD is centralized, and needs to become compatible (at least on the API level) with the ECC’s lightwalletD so that users can easily switch between providers.

Downsides

This grant will help ease the infra/bandwidth costs a bit, and they also make progress in reducing costs in the long run, but more can be done to address centralization and decreasing costs.

Evaluation plan

Success of this project is mainly measured around:

  • Lightclient infrastructure uptime >99.5%

Budget and justification

Code and Devops :

  • USD 7,500 (@USD 187.5/hr)

  • USD 12,000 Cloud infrastructure (USD 2,000 * 6 months)

10 Likes

I’m very much in favor of funding this and I think we should be able to make a decision on this within days. Our meeting is on Tuesday evening and I hope we can decide even sooner than that.

2 Likes

+1 for immediate funding.

3 Likes

One question on this (and it’s more for my curiosity than anything else and isn’t necessary for grant approval, so a response is not urgent and this question isn’t blocking anything) does the ECC work you’re referring to automatically generate the checkpoint in the release process?

Or is there a proposal for setting the checkpoint to the latest block, on the fly, on startup, rather than during the release process?

Having to release weekly to give users a good first time startup experience of fast syncing (and keep bandwidth costs from going up) isn’t ideal, but also weakening the security properties of the light client isn’t okay either. What’s the latest thinking from you and ECC on the best way to do this?

zcashd now has an RPC that allows LightwalletD to read the sapling root of the latest block, so the LightwalletD can now generate the checkpoint (which is a triplet of (block number, block hash, sapling root)) on the fly, for each block.

This opens up some good options.

  1. We could have new wallets query multiple lightwalletD servers, and if they agree on what the root is at a block, use that.
  2. Get the sapling root from a “trusted” location (instead of from the binary), and compare that against the LightwalletD’s root
  3. Get the n-1 checkpoint, followed by the last n-1 compact blocks, and calculate the latest checkpoint, which should match what the LightwalletD reports
  4. Temporarily trust the LightwalletD’s checkpoint, which will allow the user to start using the wallet immediately. In the background, fetch the blocks from the last known checkpoint, and verify that the LightwalletD is telling us the correct root. Since most new wallets need to receive some funds first before spending, this gives the wallets a few minutes to verify the checkpoints without having to make the user wait.

Some combination of these will likely help, and I can hopefully work with the other wallet developers/ECC’s wallet team to come up with an approach that works.

4 Likes

@adityapk00 I am pleased to inform you that ZOMG has approved your proposal in an emergency session majority vote. You should be contacted by the Zcash Foundation soon for disbursement of funds.

Big spikes in new users are a great “problem” for Zcash to have, and ZOMG is supportive of the work you are doing for the community!

Congratulations

19 Likes

Thank you @ZOMG

I will get to work right away.

16 Likes

This response was really helpful, thanks! And I’m looking forward to hearing about what you come up with!

1 Like

Update on the work items in this grant:

1. Add new LightwalletD servers

This work item is done. My testing showed that moving to bigger machines has higher performance (than adding new servers), so we now run 2 high-network ops and NVMe servers (main + backup), which can easily handle the increased traffic, and give us plenty of headroom for growth.

2. Zecwallet Fullnode

zcashd v4.2.0 really does seem to consume a lot of bandwidth for syncing nodes from scratch, and @LarryRuane’s patch helps cuts bandwidth use dramatically (~50% in my testing!).
As a result of these tests, I patched in this fix and have released new versions of Zecwallet Fullnode for Linux, Win and Mac, which should improve sync performance considerably (and have a nice side effect of reducing server bandwidth usage too).

The next release of zcashd should have this fix as well, and when it is released, I’ll make another release of Zecwallet Fullnode.

3. Checkpoints

New versions of the Zecwallet Lite were released for desktop, iOS and Android, helping the sync speed and bandwidth usage for new users. Work on fixing this for the long term (using some mechanism of automatic checkpoints) is ongoing.

Other observations:

  • zcashd + lightwalletD seem to run on ARM64 as well, and since these instances are cheaper (at least on AWS, cheaper by ~40%), there is potential to cut server costs by moving to ARM64 instances. But more testing is needed to understand capacity.
  • The API incompatibility between ECC’s LightwalletD and Zecwallet’s LightwalletD is causing considerably user confusion, especially for users that want to run their own infra. This should be addressed (hopefully before Orchard/Halo 2), which will also improve the decentralization of the network.
10 Likes

Thanks for the update Aditya!

This seems actually really important both to user experience and infrastructure costs, but it’s delicate because it affects security properties.

For others following along, what’s at stake here is that, until this is addressed the amount of time first time users need to wait before their Zcash wallet is working increases steadily each day.

So a user who gets the app the day of the release will have this amazing experience of everything firing up almost instantly, but a user who gets the app 1 month after the release will have the painful experience of waiting quite a while before they can do much.

I’m glad you’re working on this and I’m curious what ECC thinks is the best solution.

I think #4 in your list above (starting from the latest block but going back and checking more blocks in the background) is a really cool solution and could be really useful for viewing keys too. (When working with viewing keys you’re often in a situation where there are transactions that happened before first sync that the user still cares about—because that’s why they imported the viewing key—but you don’t want to make them wait to sync tons of blocks before they can do anything.)

1 Like

Since your own time seems like the most limited resource, it might be best to spend your time on other things than reducing server costs. But it’s up to you of course, and I can see situations where lower server costs will save time or improve UX in other ways!

I’m still very much in favor of funding this, and still have some questions!

1 Like