Search before asking
Version
Doris BE 2.1.x.
What's Wrong?
After a group of BE nodes was permanently removed from the cluster (DROP BACKEND on the FE, the machines were shut down, and their DNS A/PTR records were deleted), every surviving BE in the same cluster keeps logging two kinds of
WARNING forever:
Symptom A — DNSCache refresh thread floods be.WARNING
W network_util.cpp:115] failed to get ip from host: be-old-1.example.com err: Name or service not known
W status.h:415] meet error status: [INTERNAL_ERROR]failed to get ip from host: be-old-1.example.com, err: Name or service not known
0# doris::hostname_to_ipv4(...) at be/src/util/network_util.cpp:125
1# doris::hostname_to_ip(...) at be/src/util/network_util.cpp:104
2# doris::DNSCache::_update(...) at be/src/common/status.h:494
3# doris::DNSCache::_refresh_cache() at be/src/common/status.h:380
Once per minute per stale hostname, indefinitely.
Symptom B — brpc keeps reconnecting to the cached (now unreachable) IP
W socket.cpp:1270] Fail to wait EPOLLOUT of fd=: Connection timed out [110]
In our case this fires ~4 times per second, ~340K times per hour, accumulating > 3.7M occurrences over 11 days. The IPs the BE keeps trying to reach are the last successfully resolved IPs of the dropped hostnames, served back by
DNSCache::_resolve_hostname() after every refresh failure. A single BE's be.WARNING grew to 634 MB in 11 days — multiplied by every BE in the cluster.
Root cause
be/src/util/dns_cache.cpp (master HEAD, lines 57–121):
- _refresh_cache() iterates every cached hostname every 60 s and calls _update.
- _update → _resolve_hostname. On resolution failure, _resolve_hostname returns the stale cached IP so callers can keep using it. That is a reasonable graceful-degradation choice.
- However, the entry is never removed from the cache map. There is no failure counter, no TTL, no eviction policy.
- Consequence: as long as the BE process lives, the hostname is re-resolved (and re-fails) once per minute, forever. BrpcClientCache / ClientCache keep handing the stale IP to brpc, which keeps timing out at the kernel level (ETIMEDOUT
after tcp_syn_retries, ~127 s).
What You Expected?
After a hostname has consistently failed to resolve for a configurable threshold (e.g. 30 consecutive refresh attempts = 30 minutes), the entry should be evicted from the cache. Subsequent callers will either re-resolve successfully (if
DNS comes back) or get a clean InternalError rather than silently retrying a long-dead IP.
How to Reproduce?
- Bring up a Doris cluster (≥ 2 BEs).
- Pick a hostname victim.example.com that points to a working BE. Issue queries / data ingestion that go through DNSCache::get (e.g. broker load, internal RPC) so the hostname enters the cache.
- Decommission and remove the BE: DROP BACKEND "victim.example.com:9050";
- Delete victim.example.com from DNS (or /etc/hosts).
- Observe be.WARNING on the other BEs. Within 1 minute the first failed to get ip from host line appears. It never goes away.
Anything Else?
No response
Are you willing to submit PR?
Code of Conduct
Search before asking
Version
Doris BE 2.1.x.
What's Wrong?
After a group of BE nodes was permanently removed from the cluster (DROP BACKEND on the FE, the machines were shut down, and their DNS A/PTR records were deleted), every surviving BE in the same cluster keeps logging two kinds of
WARNING forever:
Symptom A — DNSCache refresh thread floods be.WARNING
W network_util.cpp:115] failed to get ip from host: be-old-1.example.com err: Name or service not known
W status.h:415] meet error status: [INTERNAL_ERROR]failed to get ip from host: be-old-1.example.com, err: Name or service not known
0# doris::hostname_to_ipv4(...) at be/src/util/network_util.cpp:125
1# doris::hostname_to_ip(...) at be/src/util/network_util.cpp:104
2# doris::DNSCache::_update(...) at be/src/common/status.h:494
3# doris::DNSCache::_refresh_cache() at be/src/common/status.h:380
Once per minute per stale hostname, indefinitely.
Symptom B — brpc keeps reconnecting to the cached (now unreachable) IP
W socket.cpp:1270] Fail to wait EPOLLOUT of fd=: Connection timed out [110]
In our case this fires ~4 times per second, ~340K times per hour, accumulating > 3.7M occurrences over 11 days. The IPs the BE keeps trying to reach are the last successfully resolved IPs of the dropped hostnames, served back by
DNSCache::_resolve_hostname() after every refresh failure. A single BE's be.WARNING grew to 634 MB in 11 days — multiplied by every BE in the cluster.
Root cause
be/src/util/dns_cache.cpp (master HEAD, lines 57–121):
after tcp_syn_retries, ~127 s).
What You Expected?
After a hostname has consistently failed to resolve for a configurable threshold (e.g. 30 consecutive refresh attempts = 30 minutes), the entry should be evicted from the cache. Subsequent callers will either re-resolve successfully (if
DNS comes back) or get a clean InternalError rather than silently retrying a long-dead IP.
How to Reproduce?
Anything Else?
No response
Are you willing to submit PR?
Code of Conduct