Skip to content

feat: add simple retries for status requests#64

Closed
pboros wants to merge 1 commit intomainfrom
pboros/simple-retry-status
Closed

feat: add simple retries for status requests#64
pboros wants to merge 1 commit intomainfrom
pboros/simple-retry-status

Conversation

@pboros
Copy link
Copy Markdown
Collaborator

@pboros pboros commented Jan 23, 2026

Add simple retries for getting the status, adding timeouts.

@rbarabas rbarabas requested a review from Copilot January 23, 2026 14:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds retry logic and timeouts for status requests to improve reliability when fetching FoundationDB cluster status. The changes introduce timeout handling for metric collection and implement a simple retry mechanism for status queries.

Changes:

  • Added timeout and concurrency control to metric collection with a 2-second timeout
  • Implemented retry logic for status requests with a 1-second delay between attempts
  • Extracted status completeness check into a reusable function

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
metrics/metric_provider.go Added context-based timeout handling for metric collection ticks and reduced collection interval
db/conn.go Refactored status retrieval to support retries and extracted status validation into a helper function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ulog.E(ctx.Err(), "metric collection tick timed out")
}

cancel()
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context cancel function should be called immediately after the timeout context is created using defer to ensure cleanup happens regardless of the code path taken. Move defer cancel() to line 70, right after the context creation.

Copilot uses AI. Check for mistakes.
@@ -56,24 +57,42 @@ func (m *MetricProvider) Close() {

// Periodic data collection, called from main in a goroutine
func (mp *MetricProvider) Collect() {
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interval change from 10 seconds to 3 seconds is undocumented. Consider adding a comment explaining why this interval was reduced, especially since the original TODO suggested making it configurable.

Suggested change
func (mp *MetricProvider) Collect() {
func (mp *MetricProvider) Collect() {
// NOTE: interval was reduced from 10s to 3s to improve metric freshness for dashboards
// and alerts while keeping collection overhead acceptable. This value may still be
// made configurable in the future if more tuning is required.

Copilot uses AI. Check for mistakes.
Comment thread db/conn.go Outdated
Comment on lines +155 to +160
log.Error().Msg("failed to get status, retrying")
time.Sleep(1 * time.Second)
status, err = getStatusOnce()
if err != nil {
log.Error().Msg("failed to get status, retry failed")
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry sleep duration and count are hardcoded. Consider making these configurable constants at the package level to allow easier adjustment of retry behavior without code changes.

Copilot uses AI. Check for mistakes.
Comment thread db/conn.go
Str("status_json", statusString).
Msg("status json is missing cluster fields")

return nil, fmt.Errorf("incomplete status")
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path returns an error when DEBUG_LOG_INCOMPLETE_STATUS is enabled and the status is incomplete, but does not return an error when DEBUG_LOG_INCOMPLETE_STATUS is disabled. This creates inconsistent behavior where incomplete status is only treated as an error in debug mode. The error return should be moved outside the debug flag check to ensure incomplete status is always treated as an error, or the function should explicitly return the incomplete status without error when not in debug mode.

Copilot uses AI. Check for mistakes.
@rbarabas
Copy link
Copy Markdown
Contributor

@greptileai

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jan 23, 2026

Greptile Summary

This PR adds retry logic for FoundationDB status requests and timeout handling for metric collection. The GetStatus() function now retries once after a 1-second delay if the initial request fails. The metric collection goroutine adds a 2-second timeout with a mutex to protect reporter swaps. The collection interval was reduced from 10s to 3s.

Key changes:

  • Extracted incomplete status validation into IsStatusIncomplete() helper function
  • Added single retry with 1s delay in GetStatus() for failed status requests
  • Introduced context-based timeout (2s) for metric collection cycles
  • Added reporterSwap mutex to protect reporter field during swaps (but not during reads in ServeHTTP)
  • Changed collection interval from 10s to 3s

The changes address reliability issues with status fetching and metric collection, though several concurrency and timeout issues remain as noted in previous review comments.

Confidence Score: 2/5

  • This PR has critical concurrency issues that could cause race conditions and goroutine leaks in production
  • Score reflects multiple critical issues already identified in previous threads: race condition between ServeHTTP reads and Collect() writes to mp.reporter, timeout (2s) shorter than internal sleep (1s) making it ineffective, and goroutine leak when collectOnce() errors occur. The retry logic in db/conn.go is straightforward and safe, but the metric_provider.go changes introduce concurrency problems that need resolution before merge.
  • metrics/metric_provider.go requires significant attention to resolve race condition, timeout logic, and goroutine leak issues

Important Files Changed

Filename Overview
db/conn.go Added retry logic for status requests with 1s delay between attempts; extracted incomplete status check into IsStatusIncomplete() helper
metrics/metric_provider.go Added timeout (2s) and mutex protection for reporter swaps; changed collection interval from 10s to 3s; introduced goroutine with channel-based timeout handling

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +84 to +85
oldReporter := mp.reporter
mp.reporter = newReporter
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: race condition: mp.reporter can be read by HTTP requests in ServeHTTP while being written here

The goroutine updates mp.reporter without synchronization. If an HTTP request calls ServeHTTP (line 49) at the same time, it reads mp.reporter.reporter.HTTPHandler() leading to a data race.

Suggested change
oldReporter := mp.reporter
mp.reporter = newReporter
mp.mu.Lock()
oldReporter := mp.reporter
mp.reporter = newReporter
mp.mu.Unlock()

Add a sync.RWMutex field to MetricProvider and protect both reads (in ServeHTTP) and writes (here). Is there a reason you avoided using a mutex here, or was this an oversight?

Prompt To Fix With AI
This is a comment left during a code review.
Path: metrics/metric_provider.go
Line: 84:85

Comment:
**logic:** race condition: `mp.reporter` can be read by HTTP requests in `ServeHTTP` while being written here

The goroutine updates `mp.reporter` without synchronization. If an HTTP request calls `ServeHTTP` (line 49) at the same time, it reads `mp.reporter.reporter.HTTPHandler()` leading to a data race.

```suggestion
			mp.mu.Lock()
			oldReporter := mp.reporter
			mp.reporter = newReporter
			mp.mu.Unlock()
```

Add a `sync.RWMutex` field to `MetricProvider` and protect both reads (in `ServeHTTP`) and writes (here). Is there a reason you avoided using a mutex here, or was this an oversight?

How can I resolve this? If you propose a fix, please make it concise.

ulog.E(err, "failed to collect metrics in a tick")
}
time.Sleep(1 * time.Second) // Wait a bit before serving new tally's data (otherwise the first query will return 0)
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: timeout (2s) shorter than the sleep inside goroutine (1s at line 82)

If collectOnce() takes >1s, the goroutine will always timeout since it sleeps for 1s after collection. This defeats the purpose of the timeout. Consider using a timeout of at least 2.5-3s, or removing the 1s sleep and handling reporter swapping differently.

Prompt To Fix With AI
This is a comment left during a code review.
Path: metrics/metric_provider.go
Line: 69:69

Comment:
**logic:** timeout (2s) shorter than the sleep inside goroutine (1s at line 82)

If `collectOnce()` takes >1s, the goroutine will always timeout since it sleeps for 1s after collection. This defeats the purpose of the timeout. Consider using a timeout of at least 2.5-3s, or removing the 1s sleep and handling reporter swapping differently.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +78 to +79
ulog.E(err, "failed to collect metrics in a tick")
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: return on error leaks the goroutine

When collectOnce() fails, the goroutine returns early without closing done channel or swapping reporters. Meanwhile, the timeout still fires after 2s, and the cancel() is still called. The old reporter continues to be used, and errors will repeat on every tick.

Suggested change
ulog.E(err, "failed to collect metrics in a tick")
return
if err := newReporter.collectOnce(); err != nil {
ulog.E(err, "failed to collect metrics in a tick")
newReporter.Close()
return
}

Close the new reporter on error to avoid leaking resources.

Prompt To Fix With AI
This is a comment left during a code review.
Path: metrics/metric_provider.go
Line: 78:79

Comment:
**logic:** `return` on error leaks the goroutine

When `collectOnce()` fails, the goroutine returns early without closing `done` channel or swapping reporters. Meanwhile, the timeout still fires after 2s, and the `cancel()` is still called. The old reporter continues to be used, and errors will repeat on every tick.

```suggestion
			if err := newReporter.collectOnce(); err != nil {
				ulog.E(err, "failed to collect metrics in a tick")
				newReporter.Close()
				return
			}
```

Close the new reporter on error to avoid leaking resources.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread db/conn.go Outdated
func GetStatus() (*models.FullStatus, error) {
status, err := getStatusOnce()
if err != nil {
log.Error().Msg("failed to get status, retrying")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: log message doesn't include error details

Should use log.Error().Err(err).Msg(...) to include the actual error in the log output for debugging.

Prompt To Fix With AI
This is a comment left during a code review.
Path: db/conn.go
Line: 155:155

Comment:
**style:** log message doesn't include error details

Should use `log.Error().Err(err).Msg(...)` to include the actual error in the log output for debugging.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread db/conn.go Outdated
time.Sleep(1 * time.Second)
status, err = getStatusOnce()
if err != nil {
log.Error().Msg("failed to get status, retry failed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: log message doesn't include error details

Should use log.Error().Err(err).Msg(...) to include the actual error in the log output for debugging.

Prompt To Fix With AI
This is a comment left during a code review.
Path: db/conn.go
Line: 159:159

Comment:
**style:** log message doesn't include error details

Should use `log.Error().Err(err).Msg(...)` to include the actual error in the log output for debugging.

How can I resolve this? If you propose a fix, please make it concise.

@pboros pboros force-pushed the pboros/simple-retry-status branch from f5fa4e5 to 26b4438 Compare January 23, 2026 14:30
@pboros pboros force-pushed the pboros/simple-retry-status branch from 26b4438 to b391536 Compare January 23, 2026 14:33
@pboros
Copy link
Copy Markdown
Collaborator Author

pboros commented Jan 23, 2026

@greptileai

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread db/conn.go
Str("status_json", statusString).
Msg("status json is missing cluster fields")

return nil, fmt.Errorf("incomplete status")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: returning error only when DEBUG_LOG_INCOMPLETE_STATUS=true creates inconsistent behavior

When the env var is not set, incomplete status is returned as success. When set, it triggers retry. This means production (without the debug flag) silently accepts incomplete status while debug mode rejects it.

Consider either:

  • Always returning error for incomplete status (and remove the env var check)
  • Or never returning error (just log it)

The current behavior makes debugging harder since enabling the debug flag changes the functional behavior, not just logging. Is the intent to eventually always reject incomplete status, or is this env var a temporary debugging aid that should only affect logging?

Prompt To Fix With AI
This is a comment left during a code review.
Path: db/conn.go
Line: 146:146

Comment:
**logic:** returning error only when `DEBUG_LOG_INCOMPLETE_STATUS=true` creates inconsistent behavior

When the env var is not set, incomplete status is returned as success. When set, it triggers retry. This means production (without the debug flag) silently accepts incomplete status while debug mode rejects it.

Consider either:
- Always returning error for incomplete status (and remove the env var check)
- Or never returning error (just log it)

The current behavior makes debugging harder since enabling the debug flag changes the functional behavior, not just logging. Is the intent to eventually always reject incomplete status, or is this env var a temporary debugging aid that should only affect logging?

How can I resolve this? If you propose a fix, please make it concise.

@pboros pboros closed this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants