-
Notifications
You must be signed in to change notification settings - Fork 50
Fix race condition in concurrent artifact additions #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6526 |
|
Packit jobs failed. @containers/packit-build please check. |
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The diagnosis + fix is correct. Are we fine with paying the price? Earlier containers/podman#26524 , Cc: @mheon @baude @Luap99 .
Alternatively, it should be possible for oci/layout/newImageDestination to defer reading the index.json until the start of the “write manifests” phase; that would avoid the need to hold the lock for the duration of blob copies — with a fairly major downside that this would (again) make pkg/libartifact have a very remote dependency on internal implementation choices of oci/layout`.
|
|
||
| as.lock.Lock() | ||
| locked = true | ||
| // Lock is still held from the beginning of the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment does not make much sense stand-alone for future readers of the code, without seeing it in this diff.
|
yes, ref containers/podman#27574 It would be much better to fix this right so that we reload index.json after the relock. It means redoing the conflict check again. |
| locked = false | ||
| // Release the lock during blob copying to allow concurrent operations. | ||
| // We'll reacquire it later and check for conflicts before committing. | ||
| as.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're keeping this you absolutely need to keep the locked variable, this risks a double-unlock if the defer fires before the lock reacquires
This fixes a race condition where concurrent 'podman artifact add' commands for different artifacts would result in only one artifact being created, without any error messages. The root cause was in the artifact store's Add() method. When the lock was released during blob copying (for performance), concurrent additions could overwrite each other's index entries: - Process A: Lock → Read index → Unlock → Copy blobs → Lock → Commit - Process B: Lock → Read index (missing A) → Unlock → Copy blobs → Lock → Commit - Result: Process B's commit overwrites Process A's entry The fix adds conflict detection after reacquiring the lock. Before committing, we reload the artifact index and verify the artifact name hasn't been added by another process. If a conflict is detected, we return an error rather than silently overwriting the concurrent change. A 'locked' boolean variable tracks lock state to prevent double-unlock if an error occurs during blob copying (when the lock is released). The deferred unlock only fires if 'locked' is true. This preserves the performance benefits of concurrent blob copying while preventing the race condition that caused artifacts to be lost. Fixes: containers/podman#27569 Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
| return nil, fmt.Errorf("%s: %w", dest, libartTypes.ErrArtifactAlreadyExists) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original imageDest, as identified in the original PR description, continues to exist, will be used below, and contains the cached data. So this no longer fixes the race at all.
(*grumble* I think ferrying data between an AI and review comments with little human review is effectively equivalent to asking the reviewers to do the work themselves with the help of an AI. In that sense, it takes advantage of the PR as a signal “I care about this so much to do the work, so you should care about this more than about just a filed issue”, and risks breaking that signal. In the worst case, PRs will stay in the same queue as issues, with no reason to look at them with any extra priority.)
This fixes a race condition where concurrent 'podman artifact add'
commands for different artifacts would result in only one artifact
being created, without any error messages.
The root cause was in the artifact store's Add() method. When the lock
was released during blob copying (for performance), concurrent additions
could overwrite each other's index entries:
The fix adds conflict detection after reacquiring the lock. Before
committing, we reload the artifact index and verify the artifact name
hasn't been added by another process. If a conflict is detected, we
return an error rather than silently overwriting the concurrent change.
This preserves the performance benefits of concurrent blob copying
while preventing the race condition that caused artifacts to be lost.
Fixes: containers/podman#27569
Generated-with: Cursor AI