Skip to content

Mount locks up on listing bucket #37

@tvogel

Description

@tvogel

This is a bit involved but I found that under certain circumstances, tigrisfs can lock up in a way that ls on a bucket will hang forever.

Please see this reproducer for a detailed description:

https://github.com/SupaAutoFly/tigrisfs-bug

Copying the README here:

Reproducer for tigrisfs locking up listing bucket

Instructions

  1. cp example.env .env and fill in details for an editor Tigris access key to an empty bucket.
  2. Run docker-compose up --build to start the services (if you receive duplicated output lines, run docker container prune as a workaround, see docker issue)

Expected behavior

The borg-lock container mounts the Tigris bucket and creates a borg repository in it (if not already created). It then breaks any stale locks and creates a new lock. The lock is held for 20 seconds and then released. After that, the container shuts down.

In parallel, the borg-wait container first waits for the borg-lock container to init the repo and acquire the lock. It then waits for the lock (for up to 60 minutes). While it is waiting, it prints an "alive" message with an incrementing number every 10 secons. Plus, it calls ls /mnt/tigris.
Thus, the expected output is:

borg-lock-1  | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-lock-1  | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-lock-1  | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-lock-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-lock-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1  | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-wait-1  | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-wait-1  | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-wait-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-wait-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1  | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-wait-1  | Waiting for borg repo to exist and being locked
borg-wait-1  | Repo locked, now waiting to acquire borg lock
borg-lock-1  | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-lock-1  | Init borg repo
borg-lock-1  | A repository already exists at /mnt/tigris/repo.
borg-lock-1  | Acquire lock
borg-lock-1  | Lock acquired
borg-wait-1  | alive 1
borg-wait-1  | repo
borg-wait-1  | alive 2
borg-wait-1  | repo
borg-lock-1  | Released lock
borg-lock-1  | tigrisfs still running, waiting
borg-lock-1  | tigrisfs still running, waiting
borg-lock-1 exited with code 0
borg-wait-1  | Successfully acquired borg lock
borg-wait-1  | tigrisfs still running, waiting
borg-wait-1  | alive 3
borg-wait-1  | repo
borg-wait-1 exited with code 0

Actual behavior

There are two surprising deviations from the expected behavior:

  • Even though, I am using --stat-cache-ttl 0, the borg-wait container does not see the released lock until about a minute later. This feels like the directory in fact is not immediately checked again but cached. But, let's not get distracted because the second issue is more severe:
  • While waiting for the lock to vanish, the borg-wait container's ls /mnt/tigris locks up at some point and never returns. This means that the tigrisfs mount is no longer responsive.

The observed output is:

borg-lock-1  | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-lock-1  | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-lock-1  | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-lock-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-lock-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1  | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-wait-1  | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-wait-1  | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-wait-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-wait-1  | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1  | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-wait-1  | Waiting for borg repo to exist and being locked
borg-wait-1  | Repo locked, now waiting to acquire borg lock
borg-lock-1  | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-lock-1  | Init borg repo
borg-lock-1  | A repository already exists at /mnt/tigris/repo.
borg-lock-1  | Acquire lock
borg-lock-1  | Lock acquired
borg-wait-1  | alive 1
borg-wait-1  | repo
borg-wait-1  | alive 2
borg-wait-1  | repo
borg-lock-1  | Released lock
borg-lock-1  | tigrisfs still running, waiting
borg-lock-1  | tigrisfs still running, waiting
borg-lock-1 exited with code 0
borg-wait-1  | alive 3
borg-wait-1  | repo
borg-wait-1  | alive 4
borg-wait-1  | repo
borg-wait-1  | alive 5
borg-wait-1  | repo
borg-wait-1  | alive 6
borg-wait-1  | repo
borg-wait-1  | alive 7

Do note the lacking repo after alive 7. This indicates that the ls /mnt/tigris command has locked up.

At this point, you can also

$ docker compose exec borg-wait ash
/ # ls /mnt/tigris

and it will hang forever.

Diagnosis

Looking at the stack traces:

$ docker compose exec --privileged borg-wait ash
/ # cat /proc/$(pgrep ls)/stack
[<0>] request_wait_answer+0x10a/0x1e0 [fuse]
[<0>] __fuse_simple_request+0xd5/0x2b0 [fuse]
[<0>] fuse_readdir_uncached+0x12e/0x7f0 [fuse]
[<0>] iterate_dir+0xa7/0x270
[<0>] __x64_sys_getdents64+0x84/0x130
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/ # cat /proc/$(pgrep borg)/stack
[<0>] request_wait_answer+0x10a/0x1e0 [fuse]
[<0>] __fuse_simple_request+0xd5/0x2b0 [fuse]
[<0>] fuse_readdir_uncached+0x12e/0x7f0 [fuse]
[<0>] iterate_dir+0xa7/0x270
[<0>] __x64_sys_getdents64+0x84/0x130
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

This shows that both ls and borg are stuck in fuse_readdir_uncached, waiting for a response from the FUSE filesystem.

The stack traces of the tigrisfs threads are:

~ # for task in /proc/$(pgrep tigrisfs)/task/*; do echo $task; cat $task/stack; echo; done
/proc/22/task/22
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/23
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/24
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/25
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/26
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/27
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/28
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/29
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/30
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/32
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/33
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/34
[<0>] do_epoll_wait+0x4ba/0x4f0
[<0>] do_compat_epoll_pwait.part.0+0xe/0x70
[<0>] __x64_sys_epoll_pwait+0x7e/0x130
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/35
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/36
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/37
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/38
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/39
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/40
[<0>] fuse_dev_do_read+0xa51/0xab0 [fuse]
[<0>] fuse_dev_read+0x5c/0x90 [fuse]
[<0>] vfs_read+0x271/0x350
[<0>] ksys_read+0x6a/0xe0
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

/proc/22/task/54
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Conclusion

This looks like a dead-lock or stale lock in tigrisfs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions