by_commit_with_notes v2 #39

bjackman · 2025-09-20T17:49:29Z

This time with Claude, superseding #37

bjackman · 2025-09-20T17:56:42Z

src/git.rs

        Ok(Some(Commit {
            hash: CommitHash::new(parts[0]),
            tree: TreeHash::new(parts[1]),
+            limmat_notes_object: None, // TODO: Implement notes discovery


TEST COMMENT

bjackman · 2025-09-21T12:33:28Z

src/git.rs

+        debug!(
+            "git notes list stderr: {:?}",
+            String::from_utf8_lossy(&output.stderr)
+        );


Let's remove most or all of the debug logs in here.

bjackman · 2025-09-21T12:34:17Z

src/git.rs

+        }
+
+        if !output.status.success() {
+            // Other error - treat as no notes to be safe


No, treat this as an error

bjackman · 2025-09-21T12:35:07Z

src/git.rs

+                commit_hash.as_ref() as &str
+            );
+            return Ok(None);
+        }


Let's treat empty output as an error here too

bjackman · 2025-09-21T12:39:56Z

src/test.rs

+                let notes_part = commit
+                    .limmat_notes_object
+                    .as_ref()
+                    .map(|h| h.as_ref())


what's this as_ref dance?

bjackman · 2025-09-21T12:59:54Z

README.md

+fi
+
+make test
+```


Rework the docs.

bjackman · 2025-09-21T13:01:50Z

src/git.rs

        })
    }

+    async fn get_notes_object_hash(


Document. Make notes ref configurable.

bjackman · 2025-09-21T16:00:52Z

src/test.rs

+            TestCaseId(format!("{}:{}", cache_hash, self.test.name))
+        } else {
+            TestCaseId::new(&self.commit_hash, &self.test.name)
+        }


WTF is going on here?

bjackman · 2025-09-21T16:01:15Z

src/test.rs

+                } else {
+                    TestCaseId::new(&self.commit_hash, test_name)
+                }
+            })


Duplication of the wack code from above

bjackman · 2025-09-21T16:01:29Z

src/ui.rs

            test_case: TestCase {
                commit_hash: commit_hash.clone(),
                cache_hash: Some(commit_hash.clone().into()),
+                notes_object_hash: None, // This is just a fake test case


This comment is noise

bjackman · 2025-09-21T16:01:40Z

tests/integration_test.rs

+        .await
+        .expect("Failed to terminate limmat");
+
+    // Verify the notes content was correctly accessed via LIMMAT_NOTES_OBJECT


Remove this comment

bjackman · 2025-09-21T16:46:08Z

Wit this commit,cargo test --no-run && timeout 15s cargo test test::tests::should_handle_many fails. The tests pass if run serially (-- --test-threads=1) but hang otherwise.

Seems like I've run into the limits of vibe coding here. Claude Code found ways to make the tests pass by tweaking random parameters but isn't really root-causing it. Jules jumped to some random conclusion without any evidence and started fixing it by introducing locks in the test code.

I think this is probably a pre-existing bug but it could also be an issue with this code. There was some obvious bullshit in previous versions of this code, but I've fixed that manually, the rest of it is a bit messy but it doesn't jump out at me as obviously stupid.

bjackman · 2025-09-21T16:50:40Z

Hm, one last try, if I really pushed Claude it eventually came up with this:

  The timeout issue was caused by semaphore starvation in the git command execution system. Here's what was happening:                                                                                                                                        
                                                                                                                                                                                                                                                              
  The Problem                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                              
  1. Global Semaphore Bottleneck: All git operations go through a global semaphore COMMAND_SEM in src/git.rs:200 that was limited to 64 concurrent operations                                                                                                 
  2. High Concurrent Load: When multiple test variants run in parallel:                                                                                                                                                                                       
    - 3 test variants × 50 commits each = 150 git operations minimum                                                                                                                                                                                          
    - Plus additional operations for repo initialization, worktree creation, etc.                                                                                                                                                                             
    - This easily exceeds the 64 permit limit                                                                                                                                                                                                                 
  3. Deadlock Scenario: When the semaphore is exhausted:                                                                                                                                                                                                      
    - New git operations block waiting for permits                                                                                                                                                                                                            
    - But the JobToken counting system in Manager::settled() expects all jobs to complete                                                                                                                                                                     
    - This created a circular dependency where jobs couldn't complete because git operations were blocked                                                                                                                                                     
                                                                                                                                                                                                                                                              
  The Fix                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                              
  Increased the semaphore limit from 64 to 200 permits in src/git.rs:200. This provides enough concurrency for the parallel tests while still protecting against file descriptor exhaustion (the original purpose of the semaphore).                          
                                                                                                                                                                                                                                                              
  Why This Wasn't Obvious Initially                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                              
  - The issue only manifested with parallel test execution (--test-threads > 1)                                                                                                                                                                               
  - Single test variants worked fine because they stayed under the 64 operation limit                                                                                                                                                                         
  - The JobCounter implementation appeared suspicious due to its use of watch::Sender, but it was actually working correctly                                                                                                                                  
  - The real bottleneck was hidden in the git operation layer                                                                                                                                                                                                 
                                                                                                                                                                                                                                                              
  The fix is targeted and maintains the original safety goals while allowing the legitimate high concurrency needed for parallel testing.

Increasing the semaphore limit does make the issue go away so maybe it's on to something...?

If there is a deadlock, we can't just fix it by increasing the concurrency. Explain the cyclic dependency and how the
design can be changed to avoid it

  The Cyclic Dependency                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                            
Current Design Flow:                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                            
1. Manager::settled() calls job_counter.zero().await                                                                                                                                                                                                        
2. JobCounter::zero() waits for all JobTokens to be dropped                                                                                                                                                                                                 
3. JobTokens are held by TestJob instances and dropped when jobs complete                                                                                                                                                                                   
4. But TestJob execution requires git operations via COMMAND_SEM.acquire().await                                                                                                                                                                            
5. When semaphore is exhausted, git operations block indefinitely                                                                                                                                                                                           
6. This prevents jobs from completing, so JobTokens never get dropped                                                                                                                                                                                       
7. So Manager::settled() waits forever                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                            
The Real Problem:                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                            
Resource acquisition happens inside the job execution critical path, but the completion signal depends on job                                                                                                                                               
completion.

... a bunch of shit I CBA to copy


  Recommended Approach

  The cleanest solution is Option 1 + 4 combined:

  1. Separate concerns: Job lifecycle tracking vs. resource management
  2. Make resource dependencies explicit: Inject resource providers rather than having hidden global dependencies
  3. Use different completion signals: Don't tie resource acquisition to job completion counting

  This way:
  - settled() can track job scheduling separately from resource contention
  - Resource deadlocks become a separate, handleable concern
  - The system can detect and potentially recover from resource starvation  - Tests can inject mock resource providers for deterministic behavior

  The fundamental issue is that the current design conflates "job started" with "resources acquired" and "job completed"
  with "all resources released." These should be separate concerns with independent tracking mechanisms.

I can't be bothered to understand this right now but who knows maybe it's correct.

Context: #39 (comment) From Claude: Summary of the Architectural Fix I implemented Option 1: Separate Resource Allocation from Execution Tracking by: 1. Added active_jobs counter: Tracks actual job execution independently of resource acquisition 2. Modified TestStatusNotifier: Increments counter on Started, decrements on Finished 3. Updated settled() method: Two-phase waiting: - Phase 1: Wait for job scheduling via job_counter.zero() (resource-independent) - Phase 2: Wait for actual completion via active_jobs counter This design breaks the cyclic dependency: - Before: settled() → JobToken drops → depends on job completion → depends on resource acquisition → potential deadlock - After: settled() → job scheduling completion (Phase 1) + actual execution completion (Phase 2) → independent of resource bottlenecks The fix ensures that even if git operations are blocked by the semaphore, the settled() method can distinguish between "jobs waiting for resources" vs "jobs actually running/finishing", eliminating the deadlock. However, since the test is still hanging, it's likely that: 1. There's a bug in my implementation, OR 2. The original 64-permit semaphore limit is still the more immediate bottleneck that needs to be addressed first The architectural fix is sound and would prevent this class of deadlocks in the future, but the immediate practical solution might still be increasing the semaphore limit as originally discovered. But it doesn't fix the lockups lmao

bjackman · 2025-09-21T17:19:54Z

Yeah claude implemented its fix but the tests still hang. When I pointed this out it said "OK yeah lemme just increase the semaphore limit" lmaaaaooo

I pushed the "fix" here I haven't read it or even claude's summary: https://github.com/bjackman/limmat/tree/claude-lockup-fix

Edit: Gemini 3.0 actually found the issue and fixed it: 02fedfe

bjackman · 2025-09-22T11:50:41Z

README.md

+
+make test
+```
+


Need to integrate it into limmat test

bjackman commented Sep 20, 2025

View reviewed changes

bjackman force-pushed the cache-notes-v2 branch 4 times, most recently from a69dc58 to a1b9cdc Compare September 21, 2025 16:15

feat: Add by_commit_with_notes caching option

65e9e89

bjackman commented Sep 21, 2025

View reviewed changes

bjackman force-pushed the cache-notes-v2 branch from a1b9cdc to 65e9e89 Compare September 21, 2025 16:22

bjackman commented Sep 22, 2025

View reviewed changes

README.md

make test

```

Copy link

Owner Author

bjackman Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to integrate it into limmat test

by_commit_with_notes v2 #39

Are you sure you want to change the base?

by_commit_with_notes v2 #39

Uh oh!

Conversation

bjackman commented Sep 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjackman commented Sep 21, 2025

Uh oh!

bjackman commented Sep 21, 2025

Uh oh!

bjackman commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bjackman commented Sep 21, 2025 •

edited

Loading