Skip to content

[Bug] DNSCache never evicts unresolvable hostnames after a BE is dropped, causing be.WARNING flood and persistent brpc EPOLLOUT timeout #63358

@zhaorongsheng

Description

@zhaorongsheng

Search before asking

  • I had searched in the issues and found no similar issues.

Version

Doris BE 2.1.x.

What's Wrong?

After a group of BE nodes was permanently removed from the cluster (DROP BACKEND on the FE, the machines were shut down, and their DNS A/PTR records were deleted), every surviving BE in the same cluster keeps logging two kinds of
WARNING forever:

Symptom A — DNSCache refresh thread floods be.WARNING

W network_util.cpp:115] failed to get ip from host: be-old-1.example.com err: Name or service not known
W status.h:415] meet error status: [INTERNAL_ERROR]failed to get ip from host: be-old-1.example.com, err: Name or service not known
0# doris::hostname_to_ipv4(...) at be/src/util/network_util.cpp:125
1# doris::hostname_to_ip(...) at be/src/util/network_util.cpp:104
2# doris::DNSCache::_update(...) at be/src/common/status.h:494
3# doris::DNSCache::_refresh_cache() at be/src/common/status.h:380

Once per minute per stale hostname, indefinitely.

Symptom B — brpc keeps reconnecting to the cached (now unreachable) IP

W socket.cpp:1270] Fail to wait EPOLLOUT of fd=: Connection timed out [110]

In our case this fires ~4 times per second, ~340K times per hour, accumulating > 3.7M occurrences over 11 days. The IPs the BE keeps trying to reach are the last successfully resolved IPs of the dropped hostnames, served back by
DNSCache::_resolve_hostname() after every refresh failure. A single BE's be.WARNING grew to 634 MB in 11 days — multiplied by every BE in the cluster.

Root cause

be/src/util/dns_cache.cpp (master HEAD, lines 57–121):

  • _refresh_cache() iterates every cached hostname every 60 s and calls _update.
  • _update → _resolve_hostname. On resolution failure, _resolve_hostname returns the stale cached IP so callers can keep using it. That is a reasonable graceful-degradation choice.
  • However, the entry is never removed from the cache map. There is no failure counter, no TTL, no eviction policy.
  • Consequence: as long as the BE process lives, the hostname is re-resolved (and re-fails) once per minute, forever. BrpcClientCache / ClientCache keep handing the stale IP to brpc, which keeps timing out at the kernel level (ETIMEDOUT
    after tcp_syn_retries, ~127 s).

What You Expected?

After a hostname has consistently failed to resolve for a configurable threshold (e.g. 30 consecutive refresh attempts = 30 minutes), the entry should be evicted from the cache. Subsequent callers will either re-resolve successfully (if
DNS comes back) or get a clean InternalError rather than silently retrying a long-dead IP.

How to Reproduce?

  1. Bring up a Doris cluster (≥ 2 BEs).
  2. Pick a hostname victim.example.com that points to a working BE. Issue queries / data ingestion that go through DNSCache::get (e.g. broker load, internal RPC) so the hostname enters the cache.
  3. Decommission and remove the BE: DROP BACKEND "victim.example.com:9050";
  4. Delete victim.example.com from DNS (or /etc/hosts).
  5. Observe be.WARNING on the other BEs. Within 1 minute the first failed to get ip from host line appears. It never goes away.

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions