OpenSnell

Performance

Where the Go implementation stood vs. the official C/libuv binary, what we changed, and where we land today.

OpenSnell is benchmarked against the official snell-server v5.0.1 — the closed-source binary that ships behind the Surge client.

Method

Two co-located Linux hosts. One runs both servers on different ports so the upstream link, kernel, and CDN cooldown apply equally; the other runs two snell-client instances pointed at each. All traffic goes through SOCKS5 via curl --socks5-hostname to the same upstream URL.

Three phases, each run sequentially (never simultaneously) with a several-second pause between subjects so the upstream CDN doesn't throttle one side of the comparison:

  1. Latency — 50 sequential requests to a tiny endpoint (cloudflare.com/cdn-cgi/trace, ~200 B response). Measure time_connect, TTFB, and total via curl -w.
  2. Concurrent throughput — N = 2, 4, 8 parallel downloads of a 10 MB file. Measure aggregate MB/s = total bytes ÷ wall clock.
  3. Packet inspectiontcpdump server-side during a single 10 MB download per variant; count full TCP segments vs. empty ACKs.

What the official binary actually is

The official snell-server-v5.0.1-linux-amd64 was disassembled: 1.2 MB, statically linked, section headers stripped. String analysis shows it is built with GCC, links libuv (the same async-I/O library curl and Node.js use), and uses OpenSSL's AES-NI GCM implementation (the distinctive GCM module for x86_64 string is present).

In short: C/C++ + libuv + OpenSSL. That matters because libuv runs the whole proxy on a single event-loop thread — no per-connection goroutine, no GMP scheduling, no GC.

Initial finding (OpenSnell v1.0.1)

MetricOpenSnell v1.0.1Official v5.0.1Δ
TTFB medianwithin noisewithin noise~0
Single-stream throughputtiedtied~0
N = 8 concurrent throughput6.49 MB/s8.46 MB/s−30 %
Empty ACKs over a 10 MB transfer14441084+33 %

Single-stream and latency were already on par with the official server. The gap was concentrated in concurrent throughput.

Root cause

v4Reader.readFrame() deserialised every snell frame with two distinct io.ReadFull calls — one for the 23-byte AEAD'd frame header, one for padding + payload + tag — and the underlying net.Conn was being read directly, with no userspace buffering. At a typical frame size of ~1.5 KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000 recv() syscalls per direction.

Two things follow from that:

  1. Empty ACKs. Linux delays ACKs when an application drains the receive buffer in big bursts, but issues them more aggressively when the buffer is drained through many small reads. Two syscalls per frame == many small reads == defeat delayed-ACK == ~33 % more empty ACKs on the wire than the C reference.
  2. Concurrent throughput. Each snell connection runs two goroutines (one per direction). At N = 8 concurrent SOCKS5 sessions that is 16 goroutines, each doing thousands of small syscalls and trading off through Go's runtime scheduler. libuv pays none of that — its single epoll-driven thread can absorb new TCP data at full rate.

Fix

One line:

// components/snell/v4.go — initReader()
c.r = &v4Reader{Reader: bufio.NewReaderSize(c.Conn, 64*1024), aead: aead}

A 64 KB read-side buffer pulls ~40 max-sized snell frames into userspace per recv(), cutting syscalls on the read path by roughly ~90×. This is a wire-format-transparent change: the v4 frame parser still sees the exact same byte stream, just delivered through fewer syscalls.

After OpenSnell v1.0.2

MetricOpenSnell v1.0.2Official v5.0.1Δ
TTFB median17.9 ms17.1 ms+4.7 %
TTFB p9525.4 ms24.5 ms+3.7 %
N = 2 throughput43.48 MB/s44.44 MB/s−2.2 %
N = 8 throughput47.34 MB/s48.19 MB/s−1.8 %
Empty ACKs over a 10 MB transfer25962343+10.8 %

The concurrent throughput gap collapsed from −30 % to −1.8 %, and the empty-ACK excess dropped from +33 % to +10.8 %. The remaining ~11 % ACK excess and ~2 % throughput delta is plausibly attributable to Go's runtime overhead vs. a hand-written C event loop — and below the noise floor of any realistic workload.

Takeaway

On Surge's published wire (snell v5), OpenSnell's snell-server runs at roughly 98 % of the official C reference under concurrency and is indistinguishable in latency. The bufio fix is +9/−1 lines in components/snell/v4.go — a useful reminder that profiling the read path (and not just application logic) is where most of the gap to a native C/libuv implementation lives.

On this page