Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Oct 15, 2025

This PR got bigger than expected when debugging test failures. The PR includes

  • Call InitChain by Syncer intial state setup as Executor does
  • Moved new state validation to state.AssertValidForNextState with additional checks to ensure consistency. Used by Syncer and Executor
  • Abort execution on invalid state errors
  • Return runtime errors in fullnode run for better error tracing

@github-actions
Copy link
Contributor

github-actions bot commented Oct 15, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 30, 2025, 12:28 PM

@claude
Copy link
Contributor

claude bot commented Oct 15, 2025

Claude finished @alpe's task —— View job


Comprehensive Code Review Complete

I've performed a comprehensive code review of PR #2763 focusing on the areas requested. Below is my analysis organized by focus area:


1. Code Quality & Architecture

Strong Points

  • Clean separation of concerns: The validation logic has been properly consolidated into State.AssertValidForNextState() (types/state.go:55-85), which is now shared by both Syncer and Executor
  • Consistent error handling: Both executor.go:636 and syncer.go:573 use the same validation method, ensuring consistency
  • Good use of atomic operations: Syncer properly uses atomic.Pointer[types.State] for concurrent state access (syncer.go:51)

Areas for Improvement

1. Proto field ordering (proto/evnode/v1/state.proto:18)

bytes last_header_hash = 9;

As noted in review comments, this should be = 8 to follow protobuf conventions (keeping fields sequential). Field 8 is currently used by app_hash, but that should be swapped. Update: The review comment suggests keeping it as-is for backward compatibility reasons.

2. Backward compatibility issue (types/state.go:66-68) - CRITICAL

if len(s.LastHeaderHash) == 0 { // initial state
    return nil
}

This addresses the testnet compatibility concern raised by @tac0turtle. When upgrading from an older binary, LastHeaderHash will be empty in the loaded state, causing validation to fail. The current fix skips validation for initial states, which is correct. However, there's a subtle bug: this check happens AFTER basic validation (line 62-64), so the validation will still fail if the header itself doesn't have LastHeaderHash populated. The logic should ensure both old and new nodes can handle the transition gracefully.

Recommendation: Consider adding a migration path or clearer documentation about the upgrade procedure.

3. Error handling pattern inconsistency (syncer.go:405-411)

case errors.Is(err, errInvalidBlock):
    // do not reschedule
case errors.Is(err, errInvalidState):
    s.sendCriticalError(fmt.Errorf("invalid state detected..."))
default:
    s.cache.SetPendingEvent(height, event)

The errInvalidBlock case doesn't reschedule the event, which is correct, but it also doesn't send a critical error like errInvalidState does. While this may be intentional (blocks from DA can be invalid), it would be clearer to add a comment explaining why invalid blocks don't trigger critical errors.


2. Error Handling & Resilience

Good Practices

  • Non-blocking error channel: Both syncer.go:580 and executor.go:640 use sendCriticalError() with non-blocking sends, preventing deadlocks
  • Graceful shutdown: node/full.go:273-442 properly handles runtime errors separately from shutdown errors, returning them appropriately

Concerns

1. Error propagation in fullnode (node/full.go:324-332)

var runtimeErr error
if err := n.blockComponents.Start(ctx); err != nil {
    if !errors.Is(err, context.Canceled) {
        runtimeErr = fmt.Errorf("running block components: %w", err)
    }
}

This is an improvement over previous code, but the runtime error only captures errors from blockComponents.Start(). Critical errors sent via errorCh from Syncer or Executor won't be returned by FullNode.Run(). The error channel is created but never read from in the fullnode code.

Recommendation: Consider reading from errorCh in a goroutine within FullNode.Run() and propagating those errors as runtime errors.

2. Missing context for validation errors (types/state.go:77-82)
The error messages for validation failures are good, but they don't include the current state height, which would be helpful for debugging:

return fmt.Errorf("invalid last header hash - got: %x, want: %x (state height: %d)", 
    header.LastHeaderHash, s.LastHeaderHash, s.LastBlockHeight)

3. Security

Positive Security Measures

  • Hash validation: The new LastHeaderHash validation (state.go:77-79) prevents block reordering attacks and ensures chain continuity
  • Legacy field size limits: serialization.go:434-441 properly limits unknown field sizes to prevent DoS attacks
  • Defensive copying: syncer.go:160-161 properly clones byte slices when returning state

Potential Issues

1. Empty hash bypass (types/state.go:66-68)

if len(s.LastHeaderHash) == 0 { // initial state
    return nil
}

While this is needed for backward compatibility, it creates a window where validation is skipped. An attacker could potentially exploit this during the upgrade window. Consider adding additional checks like verifying that LastBlockHeight == InitialHeight - 1 when LastHeaderHash is empty.

2. No signature verification in AssertValidForNextState
The validation method checks structural consistency but doesn't verify signatures. While signature verification is done elsewhere (syncer.go:567-571), it would be more robust to include signature validation in the state assertion method.


4. Performance & Resource Efficiency

Efficient Patterns

  • Batch operations: syncer.go:473-492 properly uses batch writes for atomic state updates
  • Atomic state access: Using atomic pointers prevents lock contention for state reads

Performance Considerations

1. Hash computation overhead (types/hashing.go:14-63)

func (h *Header) Hash() Hash {
    slimHash, err := h.HashSlim()
    if err != nil {
        return nil
    }
    if h.Legacy != nil && !h.Legacy.IsZero() {
        legacyHash, err := h.HashLegacy()
        // ...
    }
    return slimHash
}

The Hash() method potentially computes both slim and legacy hashes. In hot paths (block validation, syncing), this could be optimized by caching the hash after first computation.

2. Repeated state cloning (syncer.go:159-163)

stateCopy := *state
stateCopy.AppHash = bytes.Clone(state.AppHash)
stateCopy.LastHeaderHash = bytes.Clone(state.LastHeaderHash)

Every call to GetLastState() creates a new copy with cloned byte slices. For read-only access, consider returning a read-only view or using copy-on-write semantics.


5. Testing & Reliability

Test Coverage Analysis

Strong Coverage:

  • types/state_test.go provides comprehensive table-driven tests for AssertValidForNextState() covering all validation paths
  • Edge cases well tested: chain ID mismatch, invalid heights, time, hashes

Coverage Gaps (39 missing lines per Codecov):

  1. node/full.go:273-442 - The enhanced error handling and shutdown logic has 13 missing lines + 3 partials. Critical paths like the runtime error propagation aren't fully tested.
  2. types/hashing.go - 5 missing + 5 partials. The legacy hash fallback logic needs more test coverage.
  3. syncer.go:405-411 - The errInvalidState error handling path may not be fully covered

Missing Test Scenarios:

  1. Upgrade scenario: No test for the backward compatibility case where an old state (without LastHeaderHash) is loaded and a new block is validated
  2. Concurrent state access: No stress tests for concurrent GetLastState() calls during state updates
  3. InitChain error handling: syncer.go:188-196 calls InitChain, but error paths aren't well tested

Recommendation: Add integration tests for:

  • Upgrading from old state format to new format
  • Syncer handling errInvalidState critical errors
  • FullNode runtime error propagation

6. Documentation & Developer Experience

Good Documentation

  • Clear comments explaining the purpose of validations (state.go:55-57)
  • Review comments provide good context for design decisions

Documentation Gaps

1. Migration guide missing
The PR introduces a breaking change (LastHeaderHash field) that affects existing testnets. There should be:

  • A migration section in CHANGELOG
  • Instructions for node operators on how to upgrade safely
  • Explanation of the backward compatibility logic

2. Error message documentation
The new errInvalidState is critical (triggers node shutdown) but isn't documented. Consider adding package-level documentation explaining the error taxonomy:

  • errInvalidBlock - block fails validation, don't retry
  • errInvalidState - state divergence, manual intervention required

3. Proto field reservation
proto/evnode/v1/state.proto:20 has reserved 7; but no comment explaining why. This should be documented for future maintainers.


7. Specific Issues from Review Comments

Issue 1: Proto field naming (RESOLVED)

Comment: @tac0turtle suggested bytes last_header_hash = 8; instead of = 9
Status: This appears to be a valid point for proto standards, but may have backward compatibility implications since field 8 is currently app_hash.

Issue 2: Testnet compatibility (ADDRESSED)

Comment: @tac0turtle tested on Eden testnet and found it can't sync when switching branches
Status: The backward compatibility check (state.go:66-68) addresses this, but see security concern above.

Issue 3: Orphaned error variable (ACKNOWLEDGED)

Comment: @julienrbrt noted errInvalidBlock at line 430 is unused
Status: @alpe acknowledged this is from a merge conflict. Should be removed if truly unused.


Summary & Recommendations

Critical Issues (Must Fix)

  1. Testnet upgrade path: Strengthen the backward compatibility logic to prevent security issues during migration
  2. Runtime error propagation: Fix FullNode.Run() to properly capture and return critical errors from errorCh
  3. Remove orphaned error variable if confirmed unused (syncer.go:430)

High Priority (Should Fix)

  1. Add integration tests for the upgrade scenario
  2. Document the migration process for node operators
  3. Add state height to validation error messages
  4. Consider caching header hashes to reduce repeated computation

Medium Priority (Consider)

  1. Optimize state cloning for read-only access
  2. Add package-level documentation for error taxonomy
  3. Document proto field reservation
  4. Add comments explaining why errInvalidBlock doesn't trigger critical errors

Low Priority (Nice to Have)

  1. Stress tests for concurrent state access
  2. More comprehensive test coverage for error paths
  3. Consider including signature verification in AssertValidForNextState

Conclusion

This PR makes important improvements to block validation and error handling, particularly by consolidating validation logic and improving error traceability. The main concerns are around the backward compatibility implementation (which could create security issues) and the incomplete error propagation in the fullnode. With the critical issues addressed, this will be a solid improvement to the codebase.

The code quality is generally high, following Go idioms and the project's architectural patterns. The test coverage is good for happy paths but could be strengthened for error scenarios and the upgrade path.

alpe added 3 commits October 17, 2025 17:36
* main:
  feat(cache): cache pruning (#2761)
  refactor: replace sort.Slice with slices.Sort for natural ordering (#2768)
  chore!: cleanup header unused types (#2766)
  chore: remove unused flag (#2765)
  chore: fix some comments (#2762)
@alpe alpe force-pushed the alex/fix_block_validation branch from d3aa64e to 74cc608 Compare October 17, 2025 15:59
@alpe alpe marked this pull request as ready for review October 20, 2025 06:35
alpe added 2 commits October 27, 2025 10:37
* main:
  fix: make signature verification backwards compatible  (#2782)
  chore: adding upgrade test for evm-single (#2780)
  refactor: replace interface{} with any for clarity and modernization (#2781)
  feat: replay mechanism to sync node with execution layer (#2771)
  docs: update readme for sync pkg (#2776)
  build(deps): Bump the all-go group across 6 directories with 4 updates (#2772)
  refactor:  remove obsolete // +build tag (#2774)
  build(deps): Bump vite from 5.4.20 to 5.4.21 in /docs in the npm_and_yarn group across 1 directory (#2775)
  build(deps): Bump actions/setup-node from 5 to 6 (#2773)
@julienrbrt julienrbrt self-requested a review October 27, 2025 12:38
alpe added 2 commits October 28, 2025 11:25
* main:
  build(deps): Bump actions/upload-artifact from 4 to 5 (#2788)
  build(deps): Bump actions/download-artifact from 5 to 6 (#2786)
  chore!: audit fixes (#2764)
  refactor(executing): add retries on ExecuteTxs (#2784)
  feat: increase default ReadinessMaxBlocksBehind from 3 to 30 (#2779)
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 58.41584% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.40%. Comparing base (da374bf) to head (e04714a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
node/full.go 15.78% 13 Missing and 3 partials ⚠️
types/hashing.go 54.16% 6 Missing and 5 partials ⚠️
block/internal/syncing/syncer.go 71.42% 5 Missing and 3 partials ⚠️
types/serialization.go 40.00% 1 Missing and 2 partials ⚠️
types/state.go 88.23% 2 Missing ⚠️
pkg/store/batch.go 50.00% 0 Missing and 1 partial ⚠️
pkg/store/store.go 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2763      +/-   ##
==========================================
+ Coverage   62.24%   62.40%   +0.15%     
==========================================
  Files          82       82              
  Lines        7221     7273      +52     
==========================================
+ Hits         4495     4539      +44     
- Misses       2186     2189       +3     
- Partials      540      545       +5     
Flag Coverage Δ
combined 62.40% <58.41%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

outReader, err := cmd.StdoutPipe()
require.NoError(s.t, err)

if s.debug {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store logs for debugging issues

cmdToPids: make(map[string][]int),
outBuff: ring.New(100),
errBuff: ring.New(100),
debug: testing.Verbose(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled, when tests are run with -v flag

google.protobuf.Timestamp last_block_time = 5;
uint64 da_height = 6;
bytes app_hash = 8;
bytes LastHeaderHash = 9;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store last header hash to validate ensure consistency

if headerTime := header.Time(); s.LastBlockTime.After(headerTime) {
return fmt.Errorf("invalid block time - got: %v, last: %v", headerTime, s.LastBlockTime)
}
if !bytes.Equal(header.LastHeaderHash, s.LastHeaderHash) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensures that state is correct in scenarios with empty data. Sometimes only header timestamps differed on double sign scenarios

julienrbrt
julienrbrt previously approved these changes Oct 29, 2025
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->


Updated header hashing to respect both encodings. Header.Hash() now
tries the slim hash first and, when
legacy fields are present, recomputes using the legacy binary payload so
historical blocks keep their
original digest. Added standalone HashSlim / HashLegacy helpers and a
regression test that ensures the legacy
path is chosen even when legacy hashes are zero-filled. Tests run: go
test ./types. You may still see state
mismatches from other code paths that compare hashes generated before
this change—rerun your sync tests to
  confirm where the remaining divergence happens.
tac0turtle
tac0turtle previously approved these changes Oct 29, 2025
alpe added 2 commits October 30, 2025 13:06
* main:
  docs: abci migration  (#2783)
  perf: cleanup marshaling in batch (#2794)
  chore: add debug logs in submitter (#2792)
@tac0turtle tac0turtle added this pull request to the merge queue Oct 30, 2025
Merged via the queue into main with commit 1fce716 Oct 30, 2025
37 of 38 checks passed
@tac0turtle tac0turtle deleted the alex/fix_block_validation branch October 30, 2025 12:57
@github-project-automation github-project-automation bot moved this to Done in Evolve Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants