Skip to content

Ensure StateStore lock is acquired before updating RoomInfo#6478

Merged
poljar merged 16 commits intomatrix-org:mainfrom
mgoldenberg:save-locked-state-store
May 6, 2026
Merged

Ensure StateStore lock is acquired before updating RoomInfo#6478
poljar merged 16 commits intomatrix-org:mainfrom
mgoldenberg:save-locked-state-store

Conversation

@mgoldenberg
Copy link
Copy Markdown
Contributor

@mgoldenberg mgoldenberg commented Apr 21, 2026

Overview

Currently, updates to a RoomInfo object are typically persisted in two places.

  1. An in-memory Room object via Room::set_room_info
  2. A persistent StateStore via StateStore::save_changes

It is generally expected that the caller acquire a lock - often referred to as a state store lock - to synchronize these updates, but this is not enforced in any systematic way and is easy to forget.

The primary purpose of this pull request is to provide a better - albeit imperfect - way of ensuring all RoomInfo updates are synchronized, while also allowing the RoomInfo object of a Room to be read at any time - even during an update (see #6384).

Changes

Synchronizing StateStore::save_changes

The primary change was to add a type, SaveLockedStateStore, that wraps an existing StateStore and ensures that calls to the underlying StateStore::save_changes are synchronized.

pub struct SaveLockedStateStore<T = Arc<DynStateStore>> {
    store: T,
    lock: Arc<Mutex<()>>,
}

This is accomplished by implementing StateStore for SaveLockedStateStore and ensuring that the lock is acquired before calling the underlying StateStore::save_changes.

impl<T: StateStore> StateStore for SaveLockedStateStore<T> {
    // -- snip --

    async fn save_changes(&self, changes: &StateChanges) -> Result<(), Self::Error> {
        let _guard = self.lock.lock().await;
        self.store.save_changes(changes).await
    }
}

Additionally, the SaveLockedStateStore provides a means of acquiring the underlying lock and making calls StateStore::save_changes by providing the guard for that lock. This is ensured by checking the guard provided references the same Mutex held by SaveLockedStateStore.

The benefit is that the lock may be held for an extended period where many operations may need to take place in order to build up the StateChanges - e.g., while processing a sync response.

impl<T: StateStore> SaveLockedStateStore<T> {
    pub fn lock(&self) -> &Mutex<()> {
        self.lock.as_ref()
    }

    pub async fn save_changes_with_guard(
        &self,
        guard: &MutexGuard<'_, ()>,
        changes: &StateChanges,
    ) -> Result<(), StoreError> {
        if !std::ptr::eq(MutexGuard::mutex(guard), self.lock()) {
            Err(IncorrectMutexGuardError.into())
        } else {
            self.store.save_changes(changes).await.map_err(Into::into)
        }
    }
}

Finally, SaveLockedStateStore replaces the underlying lock and store in BaseStateStore and clones are provided to downstream structures.

Synchronzing RoomInfo updates to Room

The SaveLockedStateStore is cloned and provided to each Room created via the BaseStateStore. This allows a Room to synchronize its actions using the underlying lock in SaveLockedStateStore.

Furthermore, previous functions in Room for making changes to RoomInfo are replaced with atomic versions. Room::update_room_info ensures updates to the in-memory representations of the RoomInfo are atomic and Room::update_and_save_room_info ensures that updates to in-memory as well as persistent representations of RoomInfo are atomic. Both have versions which allow the caller to pass a guard for SaveLockedStateStore::lock.

Note that the proposal in #6384 suggested using a token issued by the state store lock to ensure updates to RoomInfo were synchronized. Instead, this has been accomplished by checking reference equality between the guard and the underlying mutex.

Considerations

While the abstractions outlined above offer some useful benefits, there are also some shortcomings to consider.

1. SaveLockedStateStore does not synchronize calls to all StateStore functions

While StateStore::save_changes is the primary function we care about, it's possible there are others that would be worth synchronizing. For instance, StateStore::remove_room, which modifies Room-related data.

2. SaveLockedStateStore does not provide atomic updates

While calls to StateStore::save_changes are synchronized, read-then-edit-then-write operations may overlap. For example, overlapping calls to StateStore::get_room_infos followed by StateStore::save_changes could overwrite one another's modifications.

3. It is easy to enter a deadlock

Because access is provided to the underlying lock in SaveLockedStateStore, it is a bit too easy to enter a deadlock. This can happen by acquiring the lock and then making a call to StateStore::save_changes rather than SaveLockedStateStore::save_changes_with_guard.

An alternative here could be to make StateStore::save_changes simply try to acquire the lock and return an error if it fails.

Questions

Currently, while executing a sync via SlidingSync::sync_once, a downstream function - i.e., SlidingSync::handle_response - acquires the underlying state store lock right before processing the response and holds it until it has finished processing the response. As of this pull request, the guard for that lock is also passed to many downstream functions so that updates to RoomInfo do not cause a deadlock.

let sync_response = {
let _timer = timer!("response processor");
let response_processor = {
// Take the lock to synchronise accesses to the state store, to avoid concurrent
// sliding syncs overwriting each other's room infos.
let _state_store_lock = {
let _timer = timer!("acquiring the `state_store_lock`");
self.inner.client.base_client().state_store_lock().lock().await
};
let mut response_processor =
SlidingSyncResponseProcessor::new(self.inner.client.clone());
// Process thread subscriptions if they're available.
//
// It's important to do this *before* handling the room responses, so that
// notifications can be properly generated based on the thread subscriptions,
// for the events in threads we've subscribed to.
if self.is_thread_subscriptions_enabled() {
response_processor
.handle_thread_subscriptions(
position.pos.as_deref(),
std::mem::take(
&mut sliding_sync_response.extensions.thread_subscriptions,
),
)
.await?;
}
#[cfg(feature = "e2e-encryption")]
if self.is_e2ee_enabled() {
response_processor.handle_encryption(&sliding_sync_response.extensions).await?
}
// Only handle the room's subsection of the response, if this sliding sync was
// configured to do so.
if must_process_rooms_response {
response_processor
.handle_room_response(&sliding_sync_response, &requested_required_states)
.await?;
}
response_processor
};
// Release the lock before calling event handlers
response_processor.process_and_take_response().await?
};

On the other hand, while executing a sync via Client::sync_once, downstream functions - e.g., Client::process_sync - do not hold the state store lock for the duration of the sync. They simply acquires it at very particular points during the process - e.g., in Client::receive_sync_response_with_requested_required_states.

{
let _state_store_lock = self.state_store_lock().lock().await;
processors::changes::save_and_apply(
context,
&self.state_store,
&self.ignore_user_list_changes,
Some(response.next_batch.clone()),
)
.await?;
}

Should executing a sync through Client::sync_once mimic the behavior of a SlidingSync::sync_once in this regard? It seems to me they should, but I may be overlooking a reason that Client::sync_once should not be holding a lock for the duration of the response processing logic.


Closes #6384.

  • I've documented the public API Changes in the appropriate CHANGELOG.md files.
  • This PR was made with the help of AI.

Signed-off-by: Michael Goldenberg m@mgoldenberg.net

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 87.74403% with 113 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.90%. Comparing base (d0cb664) to head (315fc51).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
crates/matrix-sdk-base/src/store/traits.rs 69.56% 90 Missing and 1 partial ⚠️
crates/matrix-sdk-base/src/room/room_info.rs 96.03% 0 Missing and 8 partials ⚠️
crates/matrix-sdk-base/src/client.rs 81.57% 0 Missing and 7 partials ⚠️
...matrix-sdk-base/src/response_processors/changes.rs 88.88% 0 Missing and 2 partials ⚠️
crates/matrix-sdk/src/room/mod.rs 83.33% 0 Missing and 2 partials ⚠️
crates/matrix-sdk-base/src/sliding_sync.rs 99.60% 0 Missing and 1 partial ⚠️
...es/matrix-sdk/src/event_cache/caches/room/state.rs 85.71% 0 Missing and 1 partial ⚠️
crates/matrix-sdk/src/sliding_sync/client.rs 97.05% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6478      +/-   ##
==========================================
- Coverage   89.95%   89.90%   -0.05%     
==========================================
  Files         381      381              
  Lines      106066   106780     +714     
  Branches   106066   106780     +714     
==========================================
+ Hits        95412    96001     +589     
- Misses       7007     7115     +108     
- Partials     3647     3664      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 21, 2026

Merging this PR will not alter performance

✅ 50 untouched benchmarks


Comparing mgoldenberg:save-locked-state-store (315fc51) with main (d0cb664)

Open in CodSpeed

@mgoldenberg mgoldenberg force-pushed the save-locked-state-store branch from 66b178b to 2f89722 Compare April 21, 2026 21:26
@mgoldenberg mgoldenberg marked this pull request as ready for review April 22, 2026 14:10
@mgoldenberg mgoldenberg requested a review from a team as a code owner April 22, 2026 14:10
@mgoldenberg mgoldenberg requested review from stefanceriu and removed request for a team April 22, 2026 14:10
@stefanceriu stefanceriu requested review from poljar and removed request for stefanceriu April 22, 2026 15:14
@mgoldenberg mgoldenberg force-pushed the save-locked-state-store branch from 0db9ead to bdc3c56 Compare May 4, 2026 13:27
Copy link
Copy Markdown
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this PR looked way scarier than it is. Most of the changes are mechanical because of the addition of another argument, namely the lock guard.

It's a bit annoying that we need to add some more implementations of the storage trait but I don't think I have a better idea for this.

I think this looks sensible and removes one of the footguns our codebase had, I can't spot anything I would object to so approving it.

Could you just resolve the merge conflicts?

@mgoldenberg
Copy link
Copy Markdown
Contributor Author

Will do! And glad it was not so scary :)

Any thoughts, by the way, about the differences mentioned above between SlidingSync::sync_once versus Client::sync_once. Or about the considerations outlined - in particular, how easy it is to deadlock?

mgoldenberg added 16 commits May 6, 2026 08:43
…ave_changes

Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
…to room info

Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
… sync

Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
Signed-off-by: Michael Goldenberg <m@mgoldenberg.net>
@mgoldenberg mgoldenberg force-pushed the save-locked-state-store branch from bdc3c56 to 315fc51 Compare May 6, 2026 12:45
@poljar
Copy link
Copy Markdown
Contributor

poljar commented May 6, 2026

For instance, StateStore::remove_room, which modifies Room-related data.

Hmm, this just deletes the room completely, it might be unlikely to have this clash with a sync but I think it needs to be synchronized as well for it to behave correctly without edge-cases.

While calls to StateStore::save_changes are synchronized, read-then-edit-then-write operations may overlap. For example, overlapping calls to StateStore::get_room_infos followed by StateStore::save_changes could overwrite one another's modifications.

I think that should be fine, as we're only calling get_room_infos() once, when we instantiate the Client object.

This can happen by acquiring the lock and then making a call to StateStore::save_changes rather than SaveLockedStateStore::save_changes_with_guard.

Yeah, but we can make a rule to avoid the usage of the StateStore::save_changes() directly, don't think it'll be possible to remove the function from the public API, which would be the best approach.

Should executing a sync through Client::sync_once mimic the behavior of a SlidingSync::sync_once in this regard? It seems to me they should, but I may be overlooking a reason that Client::sync_once should not be holding a lock for the duration of the response processing logic.

Yes, this looks like a clear mistake.

@poljar
Copy link
Copy Markdown
Contributor

poljar commented May 6, 2026

I'm going to merge this now. If you have the time, the remaining issues you raised would be neat to have fixed as well.

Otherwise please open an issue so someone else takes care of them.

@poljar poljar merged commit 99cfce8 into matrix-org:main May 6, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RoomInfo modifications aren't protected by a lock in a systemic manner

2 participants