feat: add simple retries for status requests by pboros · Pull Request #64 · tigrisdata/fdb-exporter

pboros · 2026-01-23T14:12:25Z

Add simple retries for getting the status, adding timeouts.

Copilot

Pull request overview

This PR adds retry logic and timeouts for status requests to improve reliability when fetching FoundationDB cluster status. The changes introduce timeout handling for metric collection and implement a simple retry mechanism for status queries.

Changes:

Added timeout and concurrency control to metric collection with a 2-second timeout
Implemented retry logic for status requests with a 1-second delay between attempts
Extracted status completeness check into a reusable function

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
metrics/metric_provider.go	Added context-based timeout handling for metric collection ticks and reduced collection interval
db/conn.go	Refactored status retrieval to support retries and extracted status validation into a helper function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T14:19:44Z

+			ulog.E(ctx.Err(), "metric collection tick timed out")
+		}
+
+		cancel()


The context cancel function should be called immediately after the timeout context is created using defer to ensure cleanup happens regardless of the code path taken. Move defer cancel() to line 70, right after the context creation.

Copilot · 2026-01-23T14:19:44Z

@@ -56,24 +57,42 @@ func (m *MetricProvider) Close() {

 // Periodic data collection, called from main in a goroutine
 func (mp *MetricProvider) Collect() {


The interval change from 10 seconds to 3 seconds is undocumented. Consider adding a comment explaining why this interval was reduced, especially since the original TODO suggested making it configurable.

Suggested change

func (mp *MetricProvider) Collect() {

func (mp *MetricProvider) Collect() {

// NOTE: interval was reduced from 10s to 3s to improve metric freshness for dashboards

// and alerts while keeping collection overhead acceptable. This value may still be

// made configurable in the future if more tuning is required.

Copilot · 2026-01-23T14:19:44Z

+		log.Error().Msg("failed to get status, retrying")
+		time.Sleep(1 * time.Second)
+		status, err = getStatusOnce()
+		if err != nil {
+			log.Error().Msg("failed to get status, retry failed")
+		}


The retry sleep duration and count are hardcoded. Consider making these configurable constants at the package level to allow easier adjustment of retry behavior without code changes.

Copilot · 2026-01-23T14:19:45Z

 				Str("status_json", statusString).
 				Msg("status json is missing cluster fields")
-
+			return nil, fmt.Errorf("incomplete status")


This code path returns an error when DEBUG_LOG_INCOMPLETE_STATUS is enabled and the status is incomplete, but does not return an error when DEBUG_LOG_INCOMPLETE_STATUS is disabled. This creates inconsistent behavior where incomplete status is only treated as an error in debug mode. The error return should be moved outside the debug flag check to ensure incomplete status is always treated as an error, or the function should explicitly return the incomplete status without error when not in debug mode.

rbarabas · 2026-01-23T14:22:37Z

@greptileai

greptile-apps · 2026-01-23T14:25:23Z

Greptile Summary

This PR adds retry logic for FoundationDB status requests and timeout handling for metric collection. The GetStatus() function now retries once after a 1-second delay if the initial request fails. The metric collection goroutine adds a 2-second timeout with a mutex to protect reporter swaps. The collection interval was reduced from 10s to 3s.

Key changes:

Extracted incomplete status validation into IsStatusIncomplete() helper function
Added single retry with 1s delay in GetStatus() for failed status requests
Introduced context-based timeout (2s) for metric collection cycles
Added reporterSwap mutex to protect reporter field during swaps (but not during reads in ServeHTTP)
Changed collection interval from 10s to 3s

The changes address reliability issues with status fetching and metric collection, though several concurrency and timeout issues remain as noted in previous review comments.

Confidence Score: 2/5

This PR has critical concurrency issues that could cause race conditions and goroutine leaks in production
Score reflects multiple critical issues already identified in previous threads: race condition between ServeHTTP reads and Collect() writes to mp.reporter, timeout (2s) shorter than internal sleep (1s) making it ineffective, and goroutine leak when collectOnce() errors occur. The retry logic in db/conn.go is straightforward and safe, but the metric_provider.go changes introduce concurrency problems that need resolution before merge.
metrics/metric_provider.go requires significant attention to resolve race condition, timeout logic, and goroutine leak issues

Important Files Changed

Filename	Overview
db/conn.go	Added retry logic for status requests with 1s delay between attempts; extracted incomplete status check into `IsStatusIncomplete()` helper
metrics/metric_provider.go	Added timeout (2s) and mutex protection for reporter swaps; changed collection interval from 10s to 3s; introduced goroutine with channel-based timeout handling

greptile-apps

_{2 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-23T14:25:21Z

+			oldReporter := mp.reporter
+			mp.reporter = newReporter


logic: race condition: mp.reporter can be read by HTTP requests in ServeHTTP while being written here

The goroutine updates mp.reporter without synchronization. If an HTTP request calls ServeHTTP (line 49) at the same time, it reads mp.reporter.reporter.HTTPHandler() leading to a data race.

Suggested change

oldReporter := mp.reporter

mp.reporter = newReporter

mp.mu.Lock()

oldReporter := mp.reporter

mp.reporter = newReporter

mp.mu.Unlock()

Add a sync.RWMutex field to MetricProvider and protect both reads (in ServeHTTP) and writes (here). Is there a reason you avoided using a mutex here, or was this an oversight?

Prompt To Fix With AI

This is a comment left during a code review. Path: metrics/metric_provider.go Line: 84:85 Comment: **logic:** race condition: `mp.reporter` can be read by HTTP requests in `ServeHTTP` while being written here The goroutine updates `mp.reporter` without synchronization. If an HTTP request calls `ServeHTTP` (line 49) at the same time, it reads `mp.reporter.reporter.HTTPHandler()` leading to a data race. ```suggestion mp.mu.Lock() oldReporter := mp.reporter mp.reporter = newReporter mp.mu.Unlock() ``` Add a `sync.RWMutex` field to `MetricProvider` and protect both reads (in `ServeHTTP`) and writes (here). Is there a reason you avoided using a mutex here, or was this an oversight? How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-23T14:25:21Z

-			ulog.E(err, "failed to collect metrics in a tick")
-		}
-		time.Sleep(1 * time.Second) // Wait a bit before serving new tally's data (otherwise the first query will return 0)
+		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)


logic: timeout (2s) shorter than the sleep inside goroutine (1s at line 82)

If collectOnce() takes >1s, the goroutine will always timeout since it sleeps for 1s after collection. This defeats the purpose of the timeout. Consider using a timeout of at least 2.5-3s, or removing the 1s sleep and handling reporter swapping differently.

Prompt To Fix With AI

This is a comment left during a code review. Path: metrics/metric_provider.go Line: 69:69 Comment: **logic:** timeout (2s) shorter than the sleep inside goroutine (1s at line 82) If `collectOnce()` takes >1s, the goroutine will always timeout since it sleeps for 1s after collection. This defeats the purpose of the timeout. Consider using a timeout of at least 2.5-3s, or removing the 1s sleep and handling reporter swapping differently. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-23T14:25:22Z

+				ulog.E(err, "failed to collect metrics in a tick")
+				return


logic: return on error leaks the goroutine

When collectOnce() fails, the goroutine returns early without closing done channel or swapping reporters. Meanwhile, the timeout still fires after 2s, and the cancel() is still called. The old reporter continues to be used, and errors will repeat on every tick.

Suggested change

ulog.E(err, "failed to collect metrics in a tick")

return

if err := newReporter.collectOnce(); err != nil {

ulog.E(err, "failed to collect metrics in a tick")

newReporter.Close()

return

}

Close the new reporter on error to avoid leaking resources.

Prompt To Fix With AI

This is a comment left during a code review. Path: metrics/metric_provider.go Line: 78:79 Comment: **logic:** `return` on error leaks the goroutine When `collectOnce()` fails, the goroutine returns early without closing `done` channel or swapping reporters. Meanwhile, the timeout still fires after 2s, and the `cancel()` is still called. The old reporter continues to be used, and errors will repeat on every tick. ```suggestion if err := newReporter.collectOnce(); err != nil { ulog.E(err, "failed to collect metrics in a tick") newReporter.Close() return } ``` Close the new reporter on error to avoid leaking resources. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-23T14:25:23Z

+func GetStatus() (*models.FullStatus, error) {
+	status, err := getStatusOnce()
+	if err != nil {
+		log.Error().Msg("failed to get status, retrying")


style: log message doesn't include error details

Should use log.Error().Err(err).Msg(...) to include the actual error in the log output for debugging.

Prompt To Fix With AI

This is a comment left during a code review. Path: db/conn.go Line: 155:155 Comment: **style:** log message doesn't include error details Should use `log.Error().Err(err).Msg(...)` to include the actual error in the log output for debugging. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-23T14:25:23Z

+		time.Sleep(1 * time.Second)
+		status, err = getStatusOnce()
+		if err != nil {
+			log.Error().Msg("failed to get status, retry failed")


style: log message doesn't include error details

Should use log.Error().Err(err).Msg(...) to include the actual error in the log output for debugging.

Prompt To Fix With AI

This is a comment left during a code review. Path: db/conn.go Line: 159:159 Comment: **style:** log message doesn't include error details Should use `log.Error().Err(err).Msg(...)` to include the actual error in the log output for debugging. How can I resolve this? If you propose a fix, please make it concise.

pboros · 2026-01-23T14:33:26Z

@greptileai

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-23T14:35:32Z

 				Str("status_json", statusString).
 				Msg("status json is missing cluster fields")
-
+			return nil, fmt.Errorf("incomplete status")


logic: returning error only when DEBUG_LOG_INCOMPLETE_STATUS=true creates inconsistent behavior

When the env var is not set, incomplete status is returned as success. When set, it triggers retry. This means production (without the debug flag) silently accepts incomplete status while debug mode rejects it.

Consider either:

Always returning error for incomplete status (and remove the env var check)

Or never returning error (just log it)

The current behavior makes debugging harder since enabling the debug flag changes the functional behavior, not just logging. Is the intent to eventually always reject incomplete status, or is this env var a temporary debugging aid that should only affect logging?

Prompt To Fix With AI

This is a comment left during a code review. Path: db/conn.go Line: 146:146 Comment: **logic:** returning error only when `DEBUG_LOG_INCOMPLETE_STATUS=true` creates inconsistent behavior When the env var is not set, incomplete status is returned as success. When set, it triggers retry. This means production (without the debug flag) silently accepts incomplete status while debug mode rejects it. Consider either: - Always returning error for incomplete status (and remove the env var check) - Or never returning error (just log it) The current behavior makes debugging harder since enabling the debug flag changes the functional behavior, not just logging. Is the intent to eventually always reject incomplete status, or is this env var a temporary debugging aid that should only affect logging? How can I resolve this? If you propose a fix, please make it concise.

rbarabas requested a review from Copilot January 23, 2026 14:18

Copilot AI reviewed Jan 23, 2026

View reviewed changes

greptile-apps Bot reviewed Jan 23, 2026

View reviewed changes

pboros force-pushed the pboros/simple-retry-status branch from f5fa4e5 to 26b4438 Compare January 23, 2026 14:30

feat: add simple retries for status requests

b391536

pboros force-pushed the pboros/simple-retry-status branch from 26b4438 to b391536 Compare January 23, 2026 14:33

greptile-apps Bot reviewed Jan 23, 2026

View reviewed changes

pboros closed this Jan 23, 2026

		@@ -56,24 +57,42 @@ func (m *MetricProvider) Close() {

		// Periodic data collection, called from main in a goroutine
		func (mp *MetricProvider) Collect() {

-func (mp *MetricProvider) Collect() {
+func (mp *MetricProvider) Collect() {
+	// NOTE: interval was reduced from 10s to 3s to improve metric freshness for dashboards
+	// and alerts while keeping collection overhead acceptable. This value may still be
+	// made configurable in the future if more tuning is required.

Conversation

pboros commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

rbarabas commented Jan 23, 2026

Uh oh!

greptile-apps Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

pboros commented Jan 23, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Jan 23, 2026 •

edited

Loading