Skip to content

Conversation

@bjackman
Copy link
Owner

This time with Claude, superseding #37

src/git.rs Outdated
Ok(Some(Commit {
hash: CommitHash::new(parts[0]),
tree: TreeHash::new(parts[1]),
limmat_notes_object: None, // TODO: Implement notes discovery
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST COMMENT

@bjackman bjackman force-pushed the cache-notes-v2 branch 4 times, most recently from a69dc58 to a1b9cdc Compare September 21, 2025 16:15
debug!(
"git notes list stderr: {:?}",
String::from_utf8_lossy(&output.stderr)
);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove most or all of the debug logs in here.

}

if !output.status.success() {
// Other error - treat as no notes to be safe
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, treat this as an error

commit_hash.as_ref() as &str
);
return Ok(None);
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's treat empty output as an error here too

let notes_part = commit
.limmat_notes_object
.as_ref()
.map(|h| h.as_ref())
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this as_ref dance?

fi

make test
```
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework the docs.

})
}

async fn get_notes_object_hash(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document. Make notes ref configurable.

src/test.rs Outdated
TestCaseId(format!("{}:{}", cache_hash, self.test.name))
} else {
TestCaseId::new(&self.commit_hash, &self.test.name)
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WTF is going on here?

src/test.rs Outdated
} else {
TestCaseId::new(&self.commit_hash, test_name)
}
})
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplication of the wack code from above

test_case: TestCase {
commit_hash: commit_hash.clone(),
cache_hash: Some(commit_hash.clone().into()),
notes_object_hash: None, // This is just a fake test case
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is noise

.await
.expect("Failed to terminate limmat");

// Verify the notes content was correctly accessed via LIMMAT_NOTES_OBJECT
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment

@bjackman
Copy link
Owner Author

Wit this commit,cargo test --no-run && timeout 15s cargo test test::tests::should_handle_many fails. The tests pass if run serially (-- --test-threads=1) but hang otherwise.

Seems like I've run into the limits of vibe coding here. Claude Code found ways to make the tests pass by tweaking random parameters but isn't really root-causing it. Jules jumped to some random conclusion without any evidence and started fixing it by introducing locks in the test code.

I think this is probably a pre-existing bug but it could also be an issue with this code. There was some obvious bullshit in previous versions of this code, but I've fixed that manually, the rest of it is a bit messy but it doesn't jump out at me as obviously stupid.

@bjackman
Copy link
Owner Author

Hm, one last try, if I really pushed Claude it eventually came up with this:

  The timeout issue was caused by semaphore starvation in the git command execution system. Here's what was happening:                                                                                                                                        
                                                                                                                                                                                                                                                              
  The Problem                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                              
  1. Global Semaphore Bottleneck: All git operations go through a global semaphore COMMAND_SEM in src/git.rs:200 that was limited to 64 concurrent operations                                                                                                 
  2. High Concurrent Load: When multiple test variants run in parallel:                                                                                                                                                                                       
    - 3 test variants × 50 commits each = 150 git operations minimum                                                                                                                                                                                          
    - Plus additional operations for repo initialization, worktree creation, etc.                                                                                                                                                                             
    - This easily exceeds the 64 permit limit                                                                                                                                                                                                                 
  3. Deadlock Scenario: When the semaphore is exhausted:                                                                                                                                                                                                      
    - New git operations block waiting for permits                                                                                                                                                                                                            
    - But the JobToken counting system in Manager::settled() expects all jobs to complete                                                                                                                                                                     
    - This created a circular dependency where jobs couldn't complete because git operations were blocked                                                                                                                                                     
                                                                                                                                                                                                                                                              
  The Fix                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                              
  Increased the semaphore limit from 64 to 200 permits in src/git.rs:200. This provides enough concurrency for the parallel tests while still protecting against file descriptor exhaustion (the original purpose of the semaphore).                          
                                                                                                                                                                                                                                                              
  Why This Wasn't Obvious Initially                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                              
  - The issue only manifested with parallel test execution (--test-threads > 1)                                                                                                                                                                               
  - Single test variants worked fine because they stayed under the 64 operation limit                                                                                                                                                                         
  - The JobCounter implementation appeared suspicious due to its use of watch::Sender, but it was actually working correctly                                                                                                                                  
  - The real bottleneck was hidden in the git operation layer                                                                                                                                                                                                 
                                                                                                                                                                                                                                                              
  The fix is targeted and maintains the original safety goals while allowing the legitimate high concurrency needed for parallel testing.             

Increasing the semaphore limit does make the issue go away so maybe it's on to something...?

If there is a deadlock, we can't just fix it by increasing the concurrency. Explain the cyclic dependency and how the
design can be changed to avoid it

  The Cyclic Dependency                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                            
Current Design Flow:                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                            
1. Manager::settled() calls job_counter.zero().await                                                                                                                                                                                                        
2. JobCounter::zero() waits for all JobTokens to be dropped                                                                                                                                                                                                 
3. JobTokens are held by TestJob instances and dropped when jobs complete                                                                                                                                                                                   
4. But TestJob execution requires git operations via COMMAND_SEM.acquire().await                                                                                                                                                                            
5. When semaphore is exhausted, git operations block indefinitely                                                                                                                                                                                           
6. This prevents jobs from completing, so JobTokens never get dropped                                                                                                                                                                                       
7. So Manager::settled() waits forever                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                            
The Real Problem:                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                            
Resource acquisition happens inside the job execution critical path, but the completion signal depends on job                                                                                                                                               
completion.                                                                                                                     

... a bunch of shit I CBA to copy


  Recommended Approach

  The cleanest solution is Option 1 + 4 combined:

  1. Separate concerns: Job lifecycle tracking vs. resource management
  2. Make resource dependencies explicit: Inject resource providers rather than having hidden global dependencies
  3. Use different completion signals: Don't tie resource acquisition to job completion counting

  This way:
  - settled() can track job scheduling separately from resource contention
  - Resource deadlocks become a separate, handleable concern
  - The system can detect and potentially recover from resource starvation  - Tests can inject mock resource providers for deterministic behavior

  The fundamental issue is that the current design conflates "job started" with "resources acquired" and "job completed"
  with "all resources released." These should be separate concerns with independent tracking mechanisms.

I can't be bothered to understand this right now but who knows maybe it's correct.

bjackman added a commit that referenced this pull request Sep 21, 2025
Context: #39 (comment)

From Claude:

  Summary of the Architectural Fix

  I implemented Option 1: Separate Resource Allocation from Execution Tracking by:

  1. Added active_jobs counter: Tracks actual job execution independently of resource acquisition
  2. Modified TestStatusNotifier: Increments counter on Started, decrements on Finished
  3. Updated settled() method: Two-phase waiting:
    - Phase 1: Wait for job scheduling via job_counter.zero() (resource-independent)
    - Phase 2: Wait for actual completion via active_jobs counter

  This design breaks the cyclic dependency:
  - Before: settled() → JobToken drops → depends on job completion → depends on resource acquisition → potential deadlock
  - After: settled() → job scheduling completion (Phase 1) + actual execution completion (Phase 2) → independent of
  resource bottlenecks

  The fix ensures that even if git operations are blocked by the semaphore, the settled() method can distinguish between
  "jobs waiting for resources" vs "jobs actually running/finishing", eliminating the deadlock.

  However, since the test is still hanging, it's likely that:
  1. There's a bug in my implementation, OR
  2. The original 64-permit semaphore limit is still the more immediate bottleneck that needs to be addressed first

  The architectural fix is sound and would prevent this class of deadlocks in the future, but the immediate practical
  solution might still be increasing the semaphore limit as originally discovered.

But it doesn't fix the lockups lmao
@bjackman
Copy link
Owner Author

bjackman commented Sep 21, 2025

Yeah claude implemented its fix but the tests still hang. When I pointed this out it said "OK yeah lemme just increase the semaphore limit" lmaaaaooo

I pushed the "fix" here I haven't read it or even claude's summary: https://github.com/bjackman/limmat/tree/claude-lockup-fix

Edit: Gemini 3.0 actually found the issue and fixed it: 02fedfe


make test
```

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to integrate it into limmat test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants