-
Notifications
You must be signed in to change notification settings - Fork 62
Accelerate udf native build by using cuDF from rapids-4-spark jar #614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Extract the libcudf.so from rapids-4-spark jar to accelerate the building Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Greptile SummaryThis PR significantly improves the UDF examples build process by extracting prebuilt Key Changes:
Implementation Quality:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Maven
participant AntScript as Maven Ant Script
participant CMake
participant JarFile as rapids-4-spark.jar
participant NativeDeps as target/native-deps
participant CudfRepo as target/cudf-repo
User->>Maven: mvn clean package -Pudf-native-examples
alt USE_PREBUILT_CUDF=ON (default)
Maven->>Maven: Try download rapids-4-spark jar<br/>(with/without classifier)
Maven->>AntScript: generate-sources phase
AntScript->>AntScript: Check USE_PREBUILT_CUDF flag
alt rapids-4-spark jar found
AntScript->>JarFile: Extract libcudf.so & libnvcomp.so
JarFile-->>NativeDeps: Libraries extracted
AntScript->>AntScript: Execute clone-cudf-repo.sh
AntScript->>CudfRepo: Clone cuDF repo (shallow, headers only)
CudfRepo-->>AntScript: Headers available
else rapids-4-spark jar NOT found
AntScript->>AntScript: Create USE_SOURCE_BUILD marker
AntScript-->>Maven: Will fallback to source build
end
Maven->>CMake: compile phase with -DUSE_PREBUILT_CUDF=ON
CMake->>CMake: Check USE_SOURCE_BUILD marker
alt Prebuilt libraries available
CMake->>NativeDeps: Find libcudf.so
CMake->>CudfRepo: Get cuDF headers
CMake->>CMake: Use rapids-cpm for CCCL & RMM
CMake->>CMake: Create cudf_imported target
CMake->>CMake: Build only UDF native code (fast!)
else Prebuilt NOT available
CMake->>CMake: Auto-fallback to source build
CMake->>CMake: Build cuDF from source (30+ min)
end
else USE_PREBUILT_CUDF=OFF
Maven->>CMake: compile phase with -DUSE_PREBUILT_CUDF=OFF
CMake->>CMake: Skip prebuilt checks
CMake->>CMake: Build cuDF from source (30+ min)
end
CMake-->>Maven: libudfexamplesjni.so built
Maven->>Maven: Package into final jar
Maven-->>User: Build complete!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds a fast build mode for UDF native examples by enabling the use of prebuilt cuDF libraries extracted from the rapids-4-spark jar, significantly reducing build time from approximately 2 hours to just a few minutes.
Key Changes:
- Introduces
USE_PREBUILT_CUDFoption (default: ON) to toggle between prebuilt and source build modes - Implements automatic extraction of
libcudf.soand related libraries from rapids-4-spark jar during Maven build - Adds automatic fallback to source build when prebuilt libraries are unavailable
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt | Adds CMake logic to detect and use prebuilt cuDF libraries, with fallback to source build |
| examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml | Implements Maven build phases for downloading rapids-4-spark jar, extracting native libraries, and cloning cuDF headers |
| examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh | Provides standalone script for manually extracting cuDF dependencies from rapids-4-spark jar |
| examples/UDF-Examples/RAPIDS-accelerated-UDFs/README.md | Updates documentation with instructions for both fast build (prebuilt) and traditional source build options |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh
Outdated
Show resolved
Hide resolved
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt
Outdated
Show resolved
Hide resolved
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt
Outdated
Show resolved
Hide resolved
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt
Outdated
Show resolved
Hide resolved
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh, line 81 (link)logic: inconsistent library extraction: bash script extracts
libnvcomp.sobut notlibrmm.so, while pom.xml extractslibrmm.sobut notlibnvcomp.so(line 333). should extract the same libraries in both places.
4 files reviewed, 1 comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 333-334 (link)style: Extracting
librmm.soappears unnecessary since CMake fetches RMM viarapids_cpm_rmm()and only uses its headers, not the prebuilt library. The CMake code on line 231 (CMakeLists.txt:231) links againstrmm::rmmtarget (headers only) rather than the extractedlibrmm.so. -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh, line 81 (link)style: Same as pom.xml: extracting
librmm.sois unnecessary since it's not used by the build process.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 36-41 (link)style: The
rapids-cmake-dirproperty is never set byinclude(rapids-cmake), so this always uses the hardcoded fallback path${CMAKE_BINARY_DIR}/_deps/rapids-cmake-src. Theget_property()call is ineffective.
4 files reviewed, 3 comments
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 232 (link)logic:
SKIP_EXTRACTIONmarker file is created but never checked or used. OnlyUSE_SOURCE_BUILDmarker is checked in CMakeLists.txt:103 -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 362 (link)syntax: The git clone command uses
${cudf.git.branch}in a bash string without proper quoting, which could cause issues if the branch name contains special shell characters
4 files reviewed, 2 comments
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 355-358 (link)style: git clone operation could fail silently in some edge cases. The
set -eensures early exit, but if the directory exists but git fetch/pull fails (e.g., network issues), the build continues with potentially stale headers. -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 103-109 (link)style: Marker file check creates tight coupling between Maven and CMake. If Maven creates the marker but CMake cache already has
USE_PREBUILT_CUDF=ON, the FORCE override could surprise users who explicitly set the option.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 238-243 (link)style: Check that Maven property paths work cross-platform. On Windows,
${settings.localRepository}may use backslashes while the rest uses forward slashes, potentially breaking path construction.
4 files reviewed, 3 comments
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 369-413 (link)logic: inline bash script in Maven build could fail on Windows
This
<exec executable="bash">block will fail on Windows systems that don't have bash installed. While the path normalization earlier in the file accounts for cross-platform compatibility, this bash script execution doesn't.Alternatively, check that this is only expected to run on Linux/Unix systems and document this requirement.
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 115-137 (link)logic: user explicitly requesting prebuilt mode when jar is missing could be handled more gracefully
When a user explicitly sets
-DUSE_PREBUILT_CUDF=ONbut the jar is missing, the warning says "The build will likely FAIL" but then continues anyway. This leads to confusing build failures later.Consider failing immediately when the user's explicit intent cannot be fulfilled:
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
4 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 380 (link)style: This extremely long single-line bash command (380+ characters) is very hard to maintain and debug. While the comment explains why it needs to be on one line, consider these alternatives:
- Move this git logic to
extract-cudf-libs.shand call that script from Maven - Use a heredoc with proper escaping
- Create a separate shell script file
The current approach makes the pom.xml harder to read and the bash script is difficult to validate or test independently.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- Move this git logic to
4 files reviewed, 1 comment
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 380 (link)style: Single-line bash script is hard to read and maintain. The complex conditional logic with multiple branches makes debugging difficult if git operations fail.
Consider extracting to separate
.shfile (similar toextract-cudf-libs.sh) or using multi-line XML CDATA section for better readability.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
4 files reviewed, 1 comment
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 205 (link)style: Bash command is extremely long (over 1000 chars) and complex. Consider extracting this logic to
extract-cudf-libs.sh(which already exists) and calling that script instead for better maintainability.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
4 files reviewed, 1 comment
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/clone-cudf-repo.sh, line 1 (link)logic: Script lacks executable permissions in git (mode 100644 instead of 100755). Maven uses
<chmod>to fix this at runtime (line 369 in pom.xml), but users running the script manually will hit permission errors.Run:
git update-index --chmod=+x clone-cudf-repo.sh -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 300-305 (link)style: The condition creates
should.skip.extractionbased onuse.source.build, but still creates directories (mkdir) even when skipping extraction. These directories are created before the skip check.Lines 230-232 create the directories unconditionally:
<mkdir dir="${project.build.directory}/native-deps"/> <mkdir dir="${project.build.directory}/cudf-repo"/> <mkdir dir="${project.build.directory}/temp-extract"/>
Move the
mkdircommands inside the conditional block (after line 305) to avoid unnecessary directory creation during source builds.
5 files reviewed, 2 comments
Add executable to script Move mkdir to correct place Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 116-132 (link)logic: When user explicitly sets
USE_PREBUILT_CUDF=ONbut jar is missing, the warning is shown but build continues, which will fail later. Consider usingFATAL_ERRORinstead ofWARNINGto fail fast and provide clearer feedback.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
5 files reviewed, 1 comment
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh, line 84 (link)style: potential issue if multiple
.so*files have the same name in different subdirectories
5 files reviewed, 1 comment
Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh, line 81 (link)logic: Unquoted variable allows path injection. If
$JAR_PATHcontains spaces or special characters, the unzip command will fail or behave unexpectedly. -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 174-191 (link)style: Fallback logic in nested condition updates cache variable within conditional block. This creates state that may not persist correctly across CMake runs. The
FORCEupdate should happen outside the conditional or use a different mechanism.Consider restructuring to make the fallback decision earlier and more explicit, or document that users should delete
CMakeCache.txtif they encounter issues after a failed prebuilt attempt.
5 files reviewed, 2 comments
|
@greptile full review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh, line 28-31 (link)style: hardcoded defaults may become stale over time - consider reading these values from
pom.xmlor making them required arguments rather than optional environment variables with defaultsNote: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 192 (link)style: verify that SNAPSHOT builds use the classifier pattern - using
${cuda.version}as classifier assumes all artifact naming follows this conventionNote: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
5 files reviewed, 2 comments
|
@greptileai Can you run a full review once and list up all the problems you have? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll conduct a comprehensive review of this PR that adds support for using prebuilt cuDF libraries from the rapids-4-spark jar to accelerate UDF native builds. Let me analyze all the changes systematically.
Overall Assessment
This is a well-structured PR that addresses a real performance pain point - reducing cuDF build time from 1-2 hours to minutes by reusing prebuilt libraries. The implementation is comprehensive but has several issues that need attention.
Major Issues
1. Copyright Year Inconsistency
Files: clone-cudf-repo.sh, extract-cudf-libs.sh, pom.xml, CMakeLists.txt
Issue: Copyright year shows 2026 instead of 2025
The copyright headers use "2026" which is incorrect for current development.
2. Shell Script Permissions
The new shell scripts (clone-cudf-repo.sh, extract-cudf-libs.sh) may not have execute permissions in the repository, which could cause build failures.
3. Complex Maven Ant Logic
The Maven pom.xml contains very complex Ant scripting (200+ lines) that's hard to maintain and debug. This logic handles:
- Multiple jar file naming patterns
- Cross-platform path normalization
- Complex conditional extraction logic
4. Inconsistent Property Naming
USE_PREBUILT_CUDF(CMake/Maven)skipCudfExtraction(Maven only)cudf.git.branch(Maven only)
The naming conventions aren't consistent between build systems.
5. Error Handling Edge Cases
CMakeLists.txt Line 101-149: The auto-fallback logic is complex and has potential race conditions:
# Check if Maven created a marker to force source build
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/USE_SOURCE_BUILD")This relies on Maven creating a marker file, which creates tight coupling between build tools.
6. Path Assumptions
Multiple hardcoded relative paths assume specific directory structure:
../../../target/native-deps../../../target/cudf-repo../../../target/USE_SOURCE_BUILD
Medium Issues
7. Documentation Structure
The README.md becomes quite long with the new options. Consider breaking into subsections or a separate build guide.
8. Maven Repository Configuration
Adding Sonatype snapshots repository without clear justification:
<repository>
<id>sonatype-snapshots</id>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
...
</repository>9. Version Management
Hardcoded versions in extract script:
RAPIDS4SPARK_VERSION="${RAPIDS4SPARK_VERSION:-26.02.0-SNAPSHOT}"Should use Maven properties instead.
10. CMake Target Aliasing
add_library(cudf::cudf ALIAS cudf_imported)Creating aliases for imported targets can be fragile across CMake versions.
Minor Issues
11. Verbose Logging
The extraction process is very verbose with lots of echo statements that could clutter build logs.
12. Windows Compatibility
While path normalization is attempted, the shell scripts are bash-specific and won't work on Windows without WSL/Cygwin.
13. Cleanup Logic
The temporary extraction directory cleanup could be more robust:
<delete dir="${project.build.directory}/temp-extract"/>14. Variable Naming in Scripts
Some bash variables use inconsistent naming (e.g., CONFLICT_COUNT vs conflict_count).
Positive Aspects
- Clear Performance Benefit: Reduces build time by ~2 hours
- Backward Compatibility: Maintains ability to build from source
- Comprehensive Documentation: Good README updates with examples
- Automatic Fallback: Gracefully handles missing dependencies
- Configurable: Multiple Maven properties for customization
Recommendations
- Fix copyright years to 2025
- Simplify Maven logic - consider using Maven plugins instead of complex Ant scripts
- Add shell script permissions to git repository
- Standardize property naming across build systems
- Reduce hardcoded paths - use Maven/CMake variables
- Add Windows support or document Linux/macOS requirement
- Consider breaking up the large pom.xml changes into smaller, focused commits
- Add integration tests to verify the new build path works correctly
Security Considerations
The extraction of libraries from JARs is generally safe, but consider:
- Validating JAR signatures before extraction
- Checksums for extracted libraries
- Clear documentation about trusted sources
Overall Verdict
This is a valuable improvement that solves a real problem, but the implementation complexity introduces maintenance challenges. The core concept is sound, but several implementation details need refinement for production readiness.
|
@greptileai It's 2026 now. Why do you think copyright shouldn't use 2026? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! Since it's 2026 now, the copyright should indeed use 2026. Looking at the diff, I can see that most of the new files correctly use 2026 in their copyright headers:
clone-cudf-repo.sh:Copyright (c) 2026, NVIDIA CORPORATION.extract-cudf-libs.sh:Copyright (c) 2026, NVIDIA CORPORATION.pom.xml: Updated fromCopyright (c) 2020-2025toCopyright (c) 2020-2026CMakeLists.txt: Updated fromCopyright (c) 2021-2025toCopyright (c) 2021-2026
This is the correct approach for copyright notices - they should reflect the current year when the code is being modified or created. The files show the proper copyright date range from the original creation year through the current year (2026).
The copyright updates in this PR are appropriate and follow standard practices for maintaining copyright notices in source code.
Handle classifier of plugin jar Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh, line 139 (link)logic: extraction will fail silently if jar contains no matching files. The
unzipcommand with-qsuppresses warnings when pattern doesn't match. Without the pattern matching anything, the subsequent check on line 173 will fail, but error message won't indicate whether unzip found no matches or if extraction failed for another reason -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 166-170 (link)style: checking
CUDF_LIBRARY_CHECKbut variable might not be defined iffind_libraryfails.find_librarydoesn't fail the build when library isn't found - it just leaves the variable unset
5 files reviewed, 2 comments
Handle no found library from jar Signed-off-by: Gary Shen <gashen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 36-40 (link)logic:
rapids-cmake-dirproperty may not be set. Theget_property()doesn't guarantee the property exists, and the fallback path assumes_deps/rapids-cmake-srcwhich may not exist yet at this point in the build. -
examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt, line 251-252 (link)logic: Using
include()with a path constructed from variables requires the file to exist. Ifrapids-cmake-diris incorrect or rapids-cmake isn't downloaded yet, this will fail with a cryptic error.Add existence check before include:
-
examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml, line 391-397 (link)logic: Unzip includes symlinks (e.g.,
libcudf.so.26), but symlinks inside JARs are not preserved when extracted. The pattern**/libcudf.so*will match versioned files likelibcudf.so.26.2.0, but the symlinklibcudf.so→libcudf.so.26won't work after extraction, causing linker to fail when looking forlibcudf.so.Check if the jar contains the base
libcudf.soor only versioned files, and create symlink if needed. Does the rapids-4-spark jar contain libcudf.so as a regular file or only as a symlink to versioned files?
5 files reviewed, 3 comments
Signed-off-by: Gary Shen <gashen@nvidia.com>
Check path existence before including Signed-off-by: Gary Shen <gashen@nvidia.com>
|
cc @YanxuanLiu to help with the following CI updates, thanks |
Add a parameter
USE_PREBUILT_CUDF(default: ON) to extract the libcudf.so and others from rapids-4-spark jar file.So we don't need to build native cuDF from the source code everytime, which requires about 1-2 hours.
You can still use
-DUSE_PREBUILT_CUDF=OFFto build native cuDF.Close #576