Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds retry logic and timeouts for status requests to improve reliability when fetching FoundationDB cluster status. The changes introduce timeout handling for metric collection and implement a simple retry mechanism for status queries.
Changes:
- Added timeout and concurrency control to metric collection with a 2-second timeout
- Implemented retry logic for status requests with a 1-second delay between attempts
- Extracted status completeness check into a reusable function
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| metrics/metric_provider.go | Added context-based timeout handling for metric collection ticks and reduced collection interval |
| db/conn.go | Refactored status retrieval to support retries and extracted status validation into a helper function |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ulog.E(ctx.Err(), "metric collection tick timed out") | ||
| } | ||
|
|
||
| cancel() |
There was a problem hiding this comment.
The context cancel function should be called immediately after the timeout context is created using defer to ensure cleanup happens regardless of the code path taken. Move defer cancel() to line 70, right after the context creation.
| @@ -56,24 +57,42 @@ func (m *MetricProvider) Close() { | |||
|
|
|||
| // Periodic data collection, called from main in a goroutine | |||
| func (mp *MetricProvider) Collect() { | |||
There was a problem hiding this comment.
The interval change from 10 seconds to 3 seconds is undocumented. Consider adding a comment explaining why this interval was reduced, especially since the original TODO suggested making it configurable.
| func (mp *MetricProvider) Collect() { | |
| func (mp *MetricProvider) Collect() { | |
| // NOTE: interval was reduced from 10s to 3s to improve metric freshness for dashboards | |
| // and alerts while keeping collection overhead acceptable. This value may still be | |
| // made configurable in the future if more tuning is required. |
| log.Error().Msg("failed to get status, retrying") | ||
| time.Sleep(1 * time.Second) | ||
| status, err = getStatusOnce() | ||
| if err != nil { | ||
| log.Error().Msg("failed to get status, retry failed") | ||
| } |
There was a problem hiding this comment.
The retry sleep duration and count are hardcoded. Consider making these configurable constants at the package level to allow easier adjustment of retry behavior without code changes.
| Str("status_json", statusString). | ||
| Msg("status json is missing cluster fields") | ||
|
|
||
| return nil, fmt.Errorf("incomplete status") |
There was a problem hiding this comment.
This code path returns an error when DEBUG_LOG_INCOMPLETE_STATUS is enabled and the status is incomplete, but does not return an error when DEBUG_LOG_INCOMPLETE_STATUS is disabled. This creates inconsistent behavior where incomplete status is only treated as an error in debug mode. The error return should be moved outside the debug flag check to ensure incomplete status is always treated as an error, or the function should explicitly return the incomplete status without error when not in debug mode.
Greptile SummaryThis PR adds retry logic for FoundationDB status requests and timeout handling for metric collection. The Key changes:
The changes address reliability issues with status fetching and metric collection, though several concurrency and timeout issues remain as noted in previous review comments. Confidence Score: 2/5
Important Files Changed
|
| oldReporter := mp.reporter | ||
| mp.reporter = newReporter |
There was a problem hiding this comment.
logic: race condition: mp.reporter can be read by HTTP requests in ServeHTTP while being written here
The goroutine updates mp.reporter without synchronization. If an HTTP request calls ServeHTTP (line 49) at the same time, it reads mp.reporter.reporter.HTTPHandler() leading to a data race.
| oldReporter := mp.reporter | |
| mp.reporter = newReporter | |
| mp.mu.Lock() | |
| oldReporter := mp.reporter | |
| mp.reporter = newReporter | |
| mp.mu.Unlock() |
Add a sync.RWMutex field to MetricProvider and protect both reads (in ServeHTTP) and writes (here). Is there a reason you avoided using a mutex here, or was this an oversight?
Prompt To Fix With AI
This is a comment left during a code review.
Path: metrics/metric_provider.go
Line: 84:85
Comment:
**logic:** race condition: `mp.reporter` can be read by HTTP requests in `ServeHTTP` while being written here
The goroutine updates `mp.reporter` without synchronization. If an HTTP request calls `ServeHTTP` (line 49) at the same time, it reads `mp.reporter.reporter.HTTPHandler()` leading to a data race.
```suggestion
mp.mu.Lock()
oldReporter := mp.reporter
mp.reporter = newReporter
mp.mu.Unlock()
```
Add a `sync.RWMutex` field to `MetricProvider` and protect both reads (in `ServeHTTP`) and writes (here). Is there a reason you avoided using a mutex here, or was this an oversight?
How can I resolve this? If you propose a fix, please make it concise.| ulog.E(err, "failed to collect metrics in a tick") | ||
| } | ||
| time.Sleep(1 * time.Second) // Wait a bit before serving new tally's data (otherwise the first query will return 0) | ||
| ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) |
There was a problem hiding this comment.
logic: timeout (2s) shorter than the sleep inside goroutine (1s at line 82)
If collectOnce() takes >1s, the goroutine will always timeout since it sleeps for 1s after collection. This defeats the purpose of the timeout. Consider using a timeout of at least 2.5-3s, or removing the 1s sleep and handling reporter swapping differently.
Prompt To Fix With AI
This is a comment left during a code review.
Path: metrics/metric_provider.go
Line: 69:69
Comment:
**logic:** timeout (2s) shorter than the sleep inside goroutine (1s at line 82)
If `collectOnce()` takes >1s, the goroutine will always timeout since it sleeps for 1s after collection. This defeats the purpose of the timeout. Consider using a timeout of at least 2.5-3s, or removing the 1s sleep and handling reporter swapping differently.
How can I resolve this? If you propose a fix, please make it concise.| ulog.E(err, "failed to collect metrics in a tick") | ||
| return |
There was a problem hiding this comment.
logic: return on error leaks the goroutine
When collectOnce() fails, the goroutine returns early without closing done channel or swapping reporters. Meanwhile, the timeout still fires after 2s, and the cancel() is still called. The old reporter continues to be used, and errors will repeat on every tick.
| ulog.E(err, "failed to collect metrics in a tick") | |
| return | |
| if err := newReporter.collectOnce(); err != nil { | |
| ulog.E(err, "failed to collect metrics in a tick") | |
| newReporter.Close() | |
| return | |
| } |
Close the new reporter on error to avoid leaking resources.
Prompt To Fix With AI
This is a comment left during a code review.
Path: metrics/metric_provider.go
Line: 78:79
Comment:
**logic:** `return` on error leaks the goroutine
When `collectOnce()` fails, the goroutine returns early without closing `done` channel or swapping reporters. Meanwhile, the timeout still fires after 2s, and the `cancel()` is still called. The old reporter continues to be used, and errors will repeat on every tick.
```suggestion
if err := newReporter.collectOnce(); err != nil {
ulog.E(err, "failed to collect metrics in a tick")
newReporter.Close()
return
}
```
Close the new reporter on error to avoid leaking resources.
How can I resolve this? If you propose a fix, please make it concise.| func GetStatus() (*models.FullStatus, error) { | ||
| status, err := getStatusOnce() | ||
| if err != nil { | ||
| log.Error().Msg("failed to get status, retrying") |
There was a problem hiding this comment.
style: log message doesn't include error details
Should use log.Error().Err(err).Msg(...) to include the actual error in the log output for debugging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: db/conn.go
Line: 155:155
Comment:
**style:** log message doesn't include error details
Should use `log.Error().Err(err).Msg(...)` to include the actual error in the log output for debugging.
How can I resolve this? If you propose a fix, please make it concise.| time.Sleep(1 * time.Second) | ||
| status, err = getStatusOnce() | ||
| if err != nil { | ||
| log.Error().Msg("failed to get status, retry failed") |
There was a problem hiding this comment.
style: log message doesn't include error details
Should use log.Error().Err(err).Msg(...) to include the actual error in the log output for debugging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: db/conn.go
Line: 159:159
Comment:
**style:** log message doesn't include error details
Should use `log.Error().Err(err).Msg(...)` to include the actual error in the log output for debugging.
How can I resolve this? If you propose a fix, please make it concise.f5fa4e5 to
26b4438
Compare
26b4438 to
b391536
Compare
| Str("status_json", statusString). | ||
| Msg("status json is missing cluster fields") | ||
|
|
||
| return nil, fmt.Errorf("incomplete status") |
There was a problem hiding this comment.
logic: returning error only when DEBUG_LOG_INCOMPLETE_STATUS=true creates inconsistent behavior
When the env var is not set, incomplete status is returned as success. When set, it triggers retry. This means production (without the debug flag) silently accepts incomplete status while debug mode rejects it.
Consider either:
- Always returning error for incomplete status (and remove the env var check)
- Or never returning error (just log it)
The current behavior makes debugging harder since enabling the debug flag changes the functional behavior, not just logging. Is the intent to eventually always reject incomplete status, or is this env var a temporary debugging aid that should only affect logging?
Prompt To Fix With AI
This is a comment left during a code review.
Path: db/conn.go
Line: 146:146
Comment:
**logic:** returning error only when `DEBUG_LOG_INCOMPLETE_STATUS=true` creates inconsistent behavior
When the env var is not set, incomplete status is returned as success. When set, it triggers retry. This means production (without the debug flag) silently accepts incomplete status while debug mode rejects it.
Consider either:
- Always returning error for incomplete status (and remove the env var check)
- Or never returning error (just log it)
The current behavior makes debugging harder since enabling the debug flag changes the functional behavior, not just logging. Is the intent to eventually always reject incomplete status, or is this env var a temporary debugging aid that should only affect logging?
How can I resolve this? If you propose a fix, please make it concise.
Add simple retries for getting the status, adding timeouts.