This is a bit involved but I found that under certain circumstances, tigrisfs can lock up in a way that ls on a bucket will hang forever.
Please see this reproducer for a detailed description:
https://github.com/SupaAutoFly/tigrisfs-bug
Copying the README here:
Reproducer for tigrisfs locking up listing bucket
Instructions
cp example.env .env and fill in details for an editor Tigris access key to an empty bucket.
- Run
docker-compose up --build to start the services (if you receive duplicated output lines, run docker container prune as a workaround, see docker issue)
Expected behavior
The borg-lock container mounts the Tigris bucket and creates a borg repository in it (if not already created). It then breaks any stale locks and creates a new lock. The lock is held for 20 seconds and then released. After that, the container shuts down.
In parallel, the borg-wait container first waits for the borg-lock container to init the repo and acquire the lock. It then waits for the lock (for up to 60 minutes). While it is waiting, it prints an "alive" message with an incrementing number every 10 secons. Plus, it calls ls /mnt/tigris.
Thus, the expected output is:
borg-lock-1 | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-lock-1 | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-lock-1 | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-lock-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-lock-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1 | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-wait-1 | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-wait-1 | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-wait-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-wait-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1 | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-wait-1 | Waiting for borg repo to exist and being locked
borg-wait-1 | Repo locked, now waiting to acquire borg lock
borg-lock-1 | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-lock-1 | Init borg repo
borg-lock-1 | A repository already exists at /mnt/tigris/repo.
borg-lock-1 | Acquire lock
borg-lock-1 | Lock acquired
borg-wait-1 | alive 1
borg-wait-1 | repo
borg-wait-1 | alive 2
borg-wait-1 | repo
borg-lock-1 | Released lock
borg-lock-1 | tigrisfs still running, waiting
borg-lock-1 | tigrisfs still running, waiting
borg-lock-1 exited with code 0
borg-wait-1 | Successfully acquired borg lock
borg-wait-1 | tigrisfs still running, waiting
borg-wait-1 | alive 3
borg-wait-1 | repo
borg-wait-1 exited with code 0
Actual behavior
There are two surprising deviations from the expected behavior:
- Even though, I am using
--stat-cache-ttl 0, the borg-wait container does not see the released lock until about a minute later. This feels like the directory in fact is not immediately checked again but cached. But, let's not get distracted because the second issue is more severe:
- While waiting for the lock to vanish, the
borg-wait container's ls /mnt/tigris locks up at some point and never returns. This means that the tigrisfs mount is no longer responsive.
The observed output is:
borg-lock-1 | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-lock-1 | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-lock-1 | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-lock-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-lock-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1 | {"level":"info","module":"","version":"1.2.1","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:126","message":"Starting TigrisFS version"}
borg-wait-1 | {"level":"info","module":"","bucketName":"tv-mount-test","mountPoint":"/mnt/tigris","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:127","message":"Mounting"}
borg-wait-1 | {"level":"info","module":"","uid":0,"defaultGid":0,"time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:128","message":"Default uid=gid"}
borg-wait-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:120","message":"env AWS_ACCESS_KEY_ID = tid_dYGIftMTQNXmWLNFlOYkODJcfa_MciOkCdydgBBEonBApUGDAU"}
borg-wait-1 | {"level":"info","module":"","time":1758099577,"caller":"/home/runner/work/tigrisfs/tigrisfs/core/cfg/conf_s3.go:129","message":"default endpoint = https://t3.storage.dev"}
borg-wait-1 | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-wait-1 | Waiting for borg repo to exist and being locked
borg-wait-1 | Repo locked, now waiting to acquire borg lock
borg-lock-1 | {"level":"info","module":"","time":1758099578,"caller":"/home/runner/work/tigrisfs/tigrisfs/main.go:161","message":"File system has been successfully mounted."}
borg-lock-1 | Init borg repo
borg-lock-1 | A repository already exists at /mnt/tigris/repo.
borg-lock-1 | Acquire lock
borg-lock-1 | Lock acquired
borg-wait-1 | alive 1
borg-wait-1 | repo
borg-wait-1 | alive 2
borg-wait-1 | repo
borg-lock-1 | Released lock
borg-lock-1 | tigrisfs still running, waiting
borg-lock-1 | tigrisfs still running, waiting
borg-lock-1 exited with code 0
borg-wait-1 | alive 3
borg-wait-1 | repo
borg-wait-1 | alive 4
borg-wait-1 | repo
borg-wait-1 | alive 5
borg-wait-1 | repo
borg-wait-1 | alive 6
borg-wait-1 | repo
borg-wait-1 | alive 7
Do note the lacking repo after alive 7. This indicates that the ls /mnt/tigris command has locked up.
At this point, you can also
$ docker compose exec borg-wait ash
/ # ls /mnt/tigris
and it will hang forever.
Diagnosis
Looking at the stack traces:
$ docker compose exec --privileged borg-wait ash
/ # cat /proc/$(pgrep ls)/stack
[<0>] request_wait_answer+0x10a/0x1e0 [fuse]
[<0>] __fuse_simple_request+0xd5/0x2b0 [fuse]
[<0>] fuse_readdir_uncached+0x12e/0x7f0 [fuse]
[<0>] iterate_dir+0xa7/0x270
[<0>] __x64_sys_getdents64+0x84/0x130
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/ # cat /proc/$(pgrep borg)/stack
[<0>] request_wait_answer+0x10a/0x1e0 [fuse]
[<0>] __fuse_simple_request+0xd5/0x2b0 [fuse]
[<0>] fuse_readdir_uncached+0x12e/0x7f0 [fuse]
[<0>] iterate_dir+0xa7/0x270
[<0>] __x64_sys_getdents64+0x84/0x130
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
This shows that both ls and borg are stuck in fuse_readdir_uncached, waiting for a response from the FUSE filesystem.
The stack traces of the tigrisfs threads are:
~ # for task in /proc/$(pgrep tigrisfs)/task/*; do echo $task; cat $task/stack; echo; done
/proc/22/task/22
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/23
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/24
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/25
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/26
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/27
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/28
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/29
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/30
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/32
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/33
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/34
[<0>] do_epoll_wait+0x4ba/0x4f0
[<0>] do_compat_epoll_pwait.part.0+0xe/0x70
[<0>] __x64_sys_epoll_pwait+0x7e/0x130
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/35
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/36
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/37
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/38
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/39
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/40
[<0>] fuse_dev_do_read+0xa51/0xab0 [fuse]
[<0>] fuse_dev_read+0x5c/0x90 [fuse]
[<0>] vfs_read+0x271/0x350
[<0>] ksys_read+0x6a/0xe0
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
/proc/22/task/54
[<0>] futex_wait_queue+0x69/0xc0
[<0>] __futex_wait+0x145/0x1b0
[<0>] futex_wait+0x6b/0x110
[<0>] do_futex+0xc5/0x190
[<0>] __x64_sys_futex+0x112/0x200
[<0>] do_syscall_64+0x7b/0x820
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Conclusion
This looks like a dead-lock or stale lock in tigrisfs.
This is a bit involved but I found that under certain circumstances, tigrisfs can lock up in a way that
lson a bucket will hang forever.Please see this reproducer for a detailed description:
https://github.com/SupaAutoFly/tigrisfs-bug
Copying the
READMEhere:Reproducer for tigrisfs locking up listing bucket
Instructions
cp example.env .envand fill in details for an editor Tigris access key to an empty bucket.docker-compose up --buildto start the services (if you receive duplicated output lines, rundocker container pruneas a workaround, see docker issue)Expected behavior
The
borg-lockcontainer mounts the Tigris bucket and creates a borg repository in it (if not already created). It then breaks any stale locks and creates a new lock. The lock is held for 20 seconds and then released. After that, the container shuts down.In parallel, the
borg-waitcontainer first waits for theborg-lockcontainer to init the repo and acquire the lock. It then waits for the lock (for up to 60 minutes). While it is waiting, it prints an "alive" message with an incrementing number every 10 secons. Plus, it callsls /mnt/tigris.Thus, the expected output is:
Actual behavior
There are two surprising deviations from the expected behavior:
--stat-cache-ttl 0, theborg-waitcontainer does not see the released lock until about a minute later. This feels like the directory in fact is not immediately checked again but cached. But, let's not get distracted because the second issue is more severe:borg-waitcontainer'sls /mnt/tigrislocks up at some point and never returns. This means that the tigrisfs mount is no longer responsive.The observed output is:
Do note the lacking
repoafteralive 7. This indicates that thels /mnt/tigriscommand has locked up.At this point, you can also
and it will hang forever.
Diagnosis
Looking at the stack traces:
This shows that both
lsandborgare stuck infuse_readdir_uncached, waiting for a response from the FUSE filesystem.The stack traces of the
tigrisfsthreads are:Conclusion
This looks like a dead-lock or stale lock in tigrisfs.