Skip to content

smp-server: support namespaces#1784

Open
shumvgolove wants to merge 31 commits into
masterfrom
sh/smp-namespace
Open

smp-server: support namespaces#1784
shumvgolove wants to merge 31 commits into
masterfrom
sh/smp-namespace

Conversation

@shumvgolove
Copy link
Copy Markdown
Collaborator

No description provided.

@shumvgolove shumvgolove force-pushed the sh/smp-namespace branch 2 times, most recently from 13bb6f5 to a6b960b Compare May 22, 2026 09:48
Field additions in NameRecord will be gated by SMP version bumps,
matching the IDS QIK precedent at Protocol.hs:1912-1979.
The separate nrVersion field created a redundant version axis.

Add unNameOwner and unNameLink so downstream consumers can read
back the underlying bytes/text without the bare constructors.
Add Resolver party, RSLV command, NAME response, NameRecord (with
NameOwner/NameLink newtypes + smart ctors), LookupKey, and version
constant namesSMPVersion = 20 in Transport.hs.

Bump current/proxied SMP versions 19/18 -> 20.

Includes minimal stub clauses in vc/processCommand so the build passes;
proper forwarded gate and handler land in steps 2-3.
Add Bool forwarded arg to verifyTransmission/verifyLoadedQueue/
verifyQueueTransmission so vc SResolver (RSLV _) can distinguish direct
(reject CMD PROHIBITED) from forwarded (verify, defer auth to handler).
Direct path passes False; forwarded path (rejectOrVerify) passes True.
Add Cmd SResolver (RSLV _) to the forwarded-command whitelist in
rejectOrVerify so RSLV can be processed inside PFWD.

Relocate incStat from Server.hs to Server/Stats.hs and export it
so the upcoming Resolver module can use it.

processCommand still returns ERR AUTH for RSLV — real handler lands
once Env, Stats, and Resolver subtree are in place (steps 4-7).
Add Server/Names.hs skeleton (NamesConfig, RpcAuth, NamesEnv,
ResolveError, newNamesEnv, closeNamesEnv, stub resolveName). Real
implementation lands in step 5.

Add namesConfig field to ServerConfig and namesEnv field to Env.
newEnv constructs NamesEnv when [NAMES] enable: on, and refuses
startup when combined with allowSMPProxy unless
allow_dangerous_colocation is set. closeServer now closes the
NamesEnv on shutdown.

Add [NAMES] section parser (readNamesConfig) with parseEthAddr
(0x-prefixed 40 hex, EIP-55 check deferred to step 5) and
parseRpcAuth (bearer / basic). Default INI template carries the
[NAMES] block (enable: off) with documented keys.
Add Server/Names/Eth/SNRC.hs: Keccak-256 namehash (NOT SHA3-256), SNRC
selector, encodeGetRecord, and bounded ABI primitives (decodeWord256Int64,
decodeAddress, decodeString, decodeStringArray) enforcing the 8 safety
invariants. decodeGetRecord is a placeholder that maps non-zero owners
to Nothing pending the Part 1 contract ABI; the primitives are
production-ready.

Add Server/Names/Eth/RPC.hs: EthRpcEnv with http-client Manager,
http-client-tls for HTTPS, QSem-bounded concurrency, brReadSome-bounded
response body, Basic/Bearer Authorization, JSON-RPC envelope, and
scrubUrl that strips userinfo from log strings.

Add Server/Names/Resolver.hs: NamesConfig, ResolveError, NamesEnv with
HashPSQ cache (TTL + byte cap + FIFO eviction) and inflight TMap.
resolveName coalesces concurrent identical requests under E.mask so the
TMap is always cleaned up; fetchOnceTimed bounds RPCs by rpcTimeoutMs.

Refactor Server/Names.hs into a thin façade re-exporting from Resolver
(types/functions) and Eth.RPC (RpcAuth) — keeps the import surface
stable for Env/STM.hs and Main.hs while the implementation modules
break the cyclic import.

Cabal: add http-client, http-client-tls, network-uri, psqueues deps;
expose the three new Server.Names.* modules in the !flag(client_library)
block.
Add NameResolverStats and NameResolverStatsData sub-records to
ServerStats with seven counters: reqs, succ, notFound, ethErrs,
cacheHits, cacheMiss, disabled. Provide newNameResolverStats,
get/getReset/setNameResolverStats and a StrEncoding instance keyed by
"rslvStats:" with backwards-compatible parsing (zero defaults).

Append the seven counters to the daily CSV stats log and emit a
simplex_smp_names_* Prometheus metric block.
Pass rslvCacheHits and rslvCacheMiss IORefs from ServerStats.rslvStats
into NamesEnv at construction so the resolver's cache hit/miss counters
share storage with the periodic stats exporter (no double bookkeeping).

Replace the ERR-AUTH stub in processCommand with the real RSLV handler:
increment rslvReqs, dispatch to resolveName when namesEnv is Just, map
Right -> NAME / NotFound -> ERR AUTH / other Left -> ERR AUTH while
bumping the respective counters.
Add tests/SMPNamesTests.hs covering the highest-risk surface:

  - NameRecord wire encoding round-trip, negative-expiry rejection,
    combined channel+contact list cap, max-payload size budget.
  - LookupKey 64-byte parser cap; mkNameOwner / mkNameLink length
    invariants; smart-constructor / accessor round-trip.
  - Keccak-256 reference vectors (empty, "abc"); assert Keccak ≠
    SHA3-256; namehash of "" and "eth" matches ENS vectors; selector
    and encodeGetRecord byte layout.
  - ABI primitive bounds: each of decodeWord256Int64,
    decodeAddress, decodeString, decodeStringArray fails its
    relevant safety invariant without crashing.
  - decodeGetRecord: zero-owner returns Nothing, truncated buffer
    returns AbiTruncated.
  - Resolver: stub-backed resolveName routes zero-owner to NotFound,
    counts as a miss; concurrent identical lookups don't crash.

Tests run via `cabal test --test-option=--match="Names resolver
tests"` and complete in ~5ms.

Defer to follow-up: ForwardedRslvSpec (PFWD round-trip), MockRpcSpec
(fake HTTP server), AbiSpec fixture-based encoder, StartupGuardSpec,
UrlValidationSpec, EipChecksumSpec — all of which need either a
running server harness or pinned binary fixtures.

Derive Eq for ResolveError so tests can `shouldBe` against
constructors.
Bump SMP protocol version 19 -> 20 (header line, abstract, version
history list). Add "Resolver commands" subsection to the SMP commands
section with:

  - RSLV / NAME ABNF and byte layout for NameRecord (display name,
    owner, channel + contact links capped at 8 combined, optional
    admin contact, expiry, isTest).
  - Forwarded-only access via PFWD — direct RSLV rejected with
    CMD PROHIBITED.
  - Error semantics — every failure collapses to ERR AUTH; per-cause
    stats are out of band.
  - Note that the backing store is implementation-defined (the
    reference impl uses an Ethereum SNRC contract).

Update "Router security requirements" to spell out the names-role
outbound HTTP threat: the lookup key reaches the Ethereum endpoint,
so operators MUST run their own endpoint and MUST NOT co-locate the
names role with the SMP proxy role (RSLV cache miss can serialise
other forwarded commands).
SMP server gains public-namespace resolution (SMP protocol v20).
The names role is disabled by default; configure [NAMES] to enable.
The hard refusal in newEnv assumed allowSMPProxy was a real on/off
flag, but it's hardcoded to True in Server/Main.hs:606 — there's no
[PROXY] enable key. So every smp-server with [NAMES] enable: on would
fail to start unless the operator also set allow_dangerous_colocation,
which made that flag mandatory rather than an opt-in.

Demote to a one-time startup warning. The performance footgun is real
(slow RSLV cache miss can head-of-line-block other forwarded commands
on the same proxy-relay session) but it's a soft problem the operator
should be told about, not a hard refusal blocking trivial deployments.

Tighten back to a refusal once [PROXY] gains a real enable toggle or
forkForwardedCmd makes the head-of-line concern moot.
The flag served no purpose once the hard refusal became a warning —
operators couldn't change behaviour, only silence the message. Remove
it from NamesConfig, INI parser, default template, plan, and tests.
The warning now fires unconditionally for any [NAMES] enable: on
deployment co-located with the proxy role (i.e. every smp-server,
since proxy is always on).
decodeString returns raw bytes; the module header invariant 8 promised
UTF-8 validation via decodeUtf8' returning AbiBadUtf8, which was never
implemented. Add decodeUtf8Text :: ... -> Either AbiError Text that
composes decodeString + decodeUtf8' and returns AbiBadUtf8 on invalid
input. This will be used by the eventual NameRecord decoder for
displayName / link / admin fields; AbiBadUtf8 becomes a real return
path instead of a phantom constructor.
The previous leader-side catch used E.catch on SomeException, which
silently converted AsyncCancelled / ThreadKilled into a normal
Left EthHttpErr result and let the leader thread continue running.
Supervisors that try to cancel an outstanding eth_call (server
shutdown, client disconnect) had no way to actually abort the lookup.

Rewrite the leader path with E.try + SomeAsyncException detection:
on async exception, fill the leader's TMVar with EthHttpErr so waiters
unblock, remove the inflight entry to drop the cache pollution, then
rethrow so the supervisor sees cancellation actually take effect.
Synchronous exceptions still collapse to EthHttpErr as before.
The constant 4096 made cacheMaxBytes effectively a 16384-entry count cap
under defaults, regardless of actual content size. A maximal NameRecord
(8 × 1024-byte links + texts) is ~17 KB real memory, so operators who
set cacheMaxBytes = 64 MB could see ~280 MB actual usage.

Sum the encoded byte lengths of the variable-length fields (displayName,
links, admin contact/email) plus a 256-byte fixed overhead for the
wrapper (PSQ node + ByteString headers + key copy). Tight enough that
cacheMaxBytes becomes a meaningful upper bound; still cheap to compute
on insert.
Two pieces of dead code from the public-names branch:

- unLookupKey was exported but had zero callers; destructure via the
  constructor (RSLV (LookupKey key)) where needed. Delete.

- scrubUrl was exported but had zero callers; useful enough to wire in
  rather than delete. Add a logInfo line at newEnv when [NAMES] enable
  is on, showing the configured endpoint with userinfo stripped — so
  operators get auditable visibility without leaking basic-auth
  credentials into logs.
Server/Main.hs:hexDecode and Names/Eth/RPC.hs:decodeHex were byte-for-byte
identical, plus Main's version was partial-by-contract (caller had to
pre-validate length; B.index s 1 crashes on odd input). Export the
validating fromHex from Eth/RPC, rewrite parseEthAddr to use it, drop the
local hexDecode and the now-redundant isHex predicate.

Also clean up trailing imports unused after earlier refactors:
- Env/STM.hs no longer needs closeNamesEnv (used only in Server.hs)
- Main.hs no longer needs Control.Applicative.<|> (parseEthAddr was the
  only consumer)
- Resolver.hs fetchOnce never used the as-pattern env binding
The plan promised a URL validator for [NAMES] ethereum_endpoint that
rejects userinfo, query, fragment, requires an explicit port, and only
permits http/https schemes. Was not implemented — the previous code
passed the raw string straight to http-client's parseRequest, which
accepts userinfo (silently honoured: credentials leak via Host header
and logs).

Add validateUrl using network-uri's parseAbsoluteURI. Reject:
- non-http(s) schemes
- empty host
- userinfo (use rpc_auth instead — separate concern, separate parsing,
  no header-injection surface)
- missing explicit port (a config like "http://localhost" silently
  defaults to :80 when Reth listens on :8545 — that's a likely-typo)
- query string
- URL fragment

Also reject https://non-loopback without rpc_auth — almost always a
misconfiguration. Loopback exempt so http://127.0.0.1:8545 without auth
remains the recommended same-host topology.

Surfaces validation failures at startup as fail-fast error with the
specific reason ("[NAMES] ethereum_endpoint: ..."), rather than the
silent SSRF/credential-leak hazard before.
Previous test stubbed ethCall as a fast non-blocking function and
asserted callCount >= 1 — which a broken impl that fires one ethCall
per caller also satisfies. The test passed against any implementation.

Block the stub on a TMVar gate so the leader's eth_call cannot return
before the 7 waiters race to register at the inflight TMap. Fire 8
concurrent resolveName calls in a background async, sleep 50 ms to let
the waiters attach, then release the gate. Assert callCount == 1 — only
the leader hit the RPC.

Now the test genuinely fails if coalescing breaks.
decodeWord256Int64 rejected values with any non-zero byte in the high
24 bytes, but accepted values 2^63 .. 2^64-1 in the low 8 bytes — those
silently decode to *negative* Int64 because the Int64 type is signed.
Downstream length math (decodeString len, decodeStringArray cnt) sees
the negative value, compares len > cap as False, off + 32 + len as
not-overflowing, then B.take clamps negatives to 0 and returns empty
bytes instead of failing. A malicious or buggy SNRC contract could
null out string and array fields with no AbiOversized / AbiTruncated
signal.

Add the sign-bit check on byte (off + 24). Both invariants are now
necessary and together sufficient for a safe Int64 cast.

Add tests covering the sign-bit case and the max representable
positive value (Int64.maxBound = 0x7FFF...FFFF).
Infura-style endpoints like https://mainnet.infura.io/v3/<API_KEY>
embed the secret in the URL path. They were accepted by the previous
validator (path wasn't checked), and the startup logInfo line scrubbed
only userinfo — not path — so the API key would land in journald.

Reject any path beyond / at config time. Operators with a path-style
endpoint must move the credential into rpc_auth (Basic or Bearer)
instead. Loopback endpoints (http://127.0.0.1:8545) naturally have
empty/"/" paths, so the same-host topology is unaffected.

Also flips the URL leak vector promised to be defended-against when
scrubUrl was wired into the startup log.
- CHANGELOG bullet 6 referenced allow_dangerous_colocation which was
  dropped in 7f94f49, and the hard refusal was demoted to a warning
  in f9269be. Rewrite the bullet to match current behaviour.

- The [NAMES] INI template's snrc_address example was the zero address.
  An operator uncommenting it verbatim would have a syntactically valid
  config that eth_calls to 0x0, gets zero bytes back, hits the
  zero-owner sentinel, and silently AUTHs every RSLV. Replace with
  "0x<paste-your-contract-address>" so an unsubstituted copy fails
  hex parsing at startup.

- Add an INI comment explaining cache_max_entries / cache_max_bytes
  interaction — whichever fills first triggers eviction.
Previously fetchOnce returned Left NotFound and coalesce skipped cache
insertion for any Left, so every RSLV for a non-existent name hit
eth_call. An adversary holding a proxy connection could spam unique
non-existent names to keep rpcMaxConcurrency saturated indefinitely.

Cache NotFound results with a shorter TTL (30s, bounded by cacheSeconds)
so newly-registered names become visible quickly while still absorbing
unique-name bursts.

CacheEntry now carries ceResult :: Maybe NameRecord (Nothing = NotFound)
and ceTtlNs per-entry so positive and negative results expire on
different schedules. Transient errors (HTTP / decode / timeout) still
bypass the cache so a flapping endpoint doesn't durably poison entries.

Add a regression test: two sequential NotFound lookups must produce
exactly one ethCall (hit + miss counters confirm).
nrExpiry was parsed and bounds-checked at the wire layer but never
compared against current time. A pre-upgrade or buggy SNRC contract
that returns expired records unmodified would have its data cached
and served to clients.

In fetchOnce, after a successful ABI decode, compare nrExpiry to the
current POSIX time. If nrExpiry > 0 and < now, treat the record as
NotFound (now also cache-friendly per the previous fix). nrExpiry == 0
is the sentinel for "never expires" (reserved names); unaffected.

This is defense in depth — the contract is expected to filter expired
records itself by returning the zero-owner sentinel — but the smp-server
also enforces, so a stale contract doesn't poison the user experience.
Caddy and most HTTP/RFC 7235 docs write the auth scheme capitalized
(Bearer / Basic). Operators who paste "Bearer <token>" verbatim into
rpc_auth previously got a parse error because the matcher only accepted
the all-lowercase form.

Fold the scheme keyword to lowercase before matching. Token / user / pass
preserve case (HTTP wire convention applies the scheme name only).
Two observability fixes for [NAMES] enable: on deployments:

1. Endpoint probe. newEnv now calls pingEndpoint immediately after
   constructing NamesEnv: a single eth_call to getRecord(namehash "")
   against the configured snrcAddress. Logs "endpoint probe ok" on
   success or a warning on transport failure (DNS, TLS, refused, 4xx,
   5xx). JSON-RPC errors are tolerated — those indicate the endpoint
   is reachable but possibly misconfigured, which surfaces later via
   the rslvEthErrs counter.

   Deliberately does NOT exitFailure: a network blip or an Ethereum
   host that comes up minutes after smp-server should not block the
   server from accepting non-RSLV traffic. The warning is the operator
   signal.

2. Log exception text before mapping to EthHttpErr. The previous
   mapSyncEthExn returned EthHttpErr with no diagnostic — TLS errors,
   DNS failures, IOExceptions all collapsed identically and the only
   observability was the aggregate counter. Now the original exception
   message is logged at the synchronous catch point in coalesce.
Five parallel cases mapping EthRpcError to ResolveError were inlined in
fetchOnce, conflating the structural mapping with the resolver's main
flow. Extract mapEthRpcError so fetchOnce reads as one round-trip:
"call, map transport errors, decode, expiry-check." The mapping is a
named concept; the resolver loop reads as the domain.

No behavior change.
Single-use one-liner that discarded its argument — the exception text
was already logged immediately above the call site by the N7 fix. The
function was a remnant from before that logging was added. Inline the
constant return at the call site; the line is gone.
Four parallel branches that each (a) chose a counter and (b) built a
response carried the same shape — "incStat <X> $> response (corrId,
NoEntity, <msg>)". Compute the (counter, msg) pair in one dispatch,
then increment + respond once. The post-processing on a sum value is
now in one place; adding a new ResolveError variant means changing one
mapping line, not four.

No behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant