Changes to TRT-LLM download tool for multigpu distributed case #3830

apbose · 2025-09-22T06:33:45Z

TRT-LLM installation tool for distributed

The download is to be done by only one GPU to avoid unnecessary downloads
Use of lock files in the tool for the above purpose

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py	2025-09-22 06:35:28.523784+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py	2025-09-22 06:36:00.657186+00:00
@@ -863,6 +863,6 @@
    return False


def is_thor() -> bool:
    if torch.cuda.get_device_capability() in [(11, 0)]:
-        return True
\ No newline at end of file
+        return True

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py	2025-09-25 19:33:28.176615+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py	2025-09-25 19:34:02.325958+00:00
@@ -100,11 +100,10 @@
        return True

    except Exception as e:
        logger.warning(f"Failed to detect CUDA version: {e}")
        return False
-

    return True


def _cache_root() -> Path:

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-11-19 02:00:51.654228+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-11-19 02:01:22.401857+00:00
@@ -60,38 +60,40 @@
        return True


def create_distributed_config(item: Dict[str, Any]) -> Dict[str, Any]:
    """Create distributed test configuration from a regular config.
-    
+
    Takes a standard test config and modifies it for distributed testing:
    - Changes runner to multi-GPU instance
    - Adds num_gpus field
    - Adds config marker
    """
    import sys
-    
+
    # Create a copy to avoid modifying the original
    dist_item = item.copy()
-    
+
    # Debug: Show original config
    print(f"[DEBUG] Creating distributed config from:", file=sys.stderr)
    print(f"[DEBUG]   Python: {item.get('python_version')}", file=sys.stderr)
    print(f"[DEBUG]   CUDA: {item.get('desired_cuda')}", file=sys.stderr)
-    print(f"[DEBUG]   Original runner: {item.get('validation_runner')}", file=sys.stderr)
-    
+    print(
+        f"[DEBUG]   Original runner: {item.get('validation_runner')}", file=sys.stderr
+    )
+
    # Override runner to use multi-GPU instance
    dist_item["validation_runner"] = "linux.g4dn.12xlarge.nvidia.gpu"
-    
+
    # Add distributed-specific fields
    dist_item["num_gpus"] = 2
    dist_item["config"] = "distributed"
-    
+
    # Debug: Show modified config
    print(f"[DEBUG]   New runner: {dist_item['validation_runner']}", file=sys.stderr)
    print(f"[DEBUG]   GPUs: {dist_item['num_gpus']}", file=sys.stderr)
-    
+
    return dist_item


def main(args: list[str]) -> None:
    parser = argparse.ArgumentParser()
@@ -131,38 +133,43 @@
        raise ValueError(f"Invalid matrix structure: {e}")

    includes = matrix_dict["include"]
    filtered_includes = []
    distributed_includes = []  # NEW: separate list for distributed configs
-    
+
    print(f"[DEBUG] Processing {len(includes)} input configs", file=sys.stderr)

    for item in includes:
        if filter_matrix_item(
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            filtered_includes.append(item)
-            
+
            # NEW: Create distributed variant for specific configs
            # Only Python 3.10 + CUDA 13.0 for now
            if item["python_version"] == "3.10" and item["desired_cuda"] == "cu130":
-                print(f"[DEBUG] Creating distributed config for py3.10+cu130", file=sys.stderr)
+                print(
+                    f"[DEBUG] Creating distributed config for py3.10+cu130",
+                    file=sys.stderr,
+                )
                distributed_includes.append(create_distributed_config(item))
-    
+
    # Debug: Show summary
    print(f"[DEBUG] Final counts:", file=sys.stderr)
    print(f"[DEBUG]   Regular configs: {len(filtered_includes)}", file=sys.stderr)
-    print(f"[DEBUG]   Distributed configs: {len(distributed_includes)}", file=sys.stderr)
+    print(
+        f"[DEBUG]   Distributed configs: {len(distributed_includes)}", file=sys.stderr
+    )

    # NEW: Output both regular and distributed configs
    filtered_matrix_dict = {
        "include": filtered_includes,
-        "distributed_include": distributed_includes  # NEW field
+        "distributed_include": distributed_includes,  # NEW field
    }
-    
+
    # Output to stdout (consumed by GitHub Actions)
    print(json.dumps(filtered_matrix_dict))


if __name__ == "__main__":

…iGPU

…T-LLM wheel by using lock file

…ding race conditionst

narendasan · 2025-12-01T17:55:52Z

tests/py/dynamo/distributed/distributed_utils.py

+# the below two functions are used to set the environment variables for the pytest single and multi process
+# this is for the github CI where we use pytest
+def set_environment_variables_pytest_single_process():
+    port = 29500 + random.randint(1, 1000)


Why does it need to be random?

I have encountered cases locally where in the port was already in use and then it led to address bus error. I have changed the logic a bit to handle the single process and multi process differently. Since multi would need same port for the two processes. In multi we shall set it via the env variable. Added a warning for the same.

…shared memory in teardown class

… pre req for L2 distributed test

github-actions bot added component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Sep 22, 2025

meta-cla bot added the cla signed label Sep 22, 2025

github-actions bot requested a review from peri044 September 22, 2025 06:34

github-actions bot requested changes Sep 22, 2025

View reviewed changes

apbose mentioned this pull request Sep 22, 2025

Changes to TRT-LLM download tool for multigpu distributed case #3784

Closed

apbose force-pushed the abose/trt_llm_installation_dist branch from 6e99bbc to 7134053 Compare September 22, 2025 22:47

github-actions bot added the component: build system Issues re: Build system label Sep 25, 2025

apbose force-pushed the abose/trt_llm_installation_dist branch from 3f1fa7e to 54948d9 Compare September 25, 2025 19:33

github-actions bot requested changes Sep 25, 2025

View reviewed changes

apbose force-pushed the abose/trt_llm_installation_dist branch 3 times, most recently from 2bbc423 to 5beefc0 Compare September 25, 2025 22:13

apbose changed the title ~~Changes to TRT-LLM download tool for multigpu distributed case~~ Changes to TRT-LLM download tool for multigpu distributed case [WIP] Sep 25, 2025

apbose force-pushed the abose/trt_llm_installation_dist branch from 5beefc0 to 809c7ee Compare September 26, 2025 00:11

apbose changed the title ~~Changes to TRT-LLM download tool for multigpu distributed case [WIP]~~ Changes to TRT-LLM download tool for multigpu distributed case Sep 26, 2025

apbose force-pushed the abose/trt_llm_installtion branch 5 times, most recently from b96b9ee to 2f2cd31 Compare October 7, 2025 17:27

apbose force-pushed the abose/trt_llm_installation_dist branch from 809c7ee to 5fb74da Compare October 11, 2025 02:04

github-actions bot added documentation Improvements or additions to documentation component: lowering Issues re: The lowering / preprocessing passes component: converters Issues re: Specific op converters component: runtime component: torch_compile labels Oct 11, 2025

apbose changed the base branch from abose/trt_llm_installtion to main October 11, 2025 02:05

apbose force-pushed the abose/trt_llm_installation_dist branch from f4f338a to f07b5cb Compare October 17, 2025 22:57

apbose force-pushed the abose/trt_llm_installation_dist branch from f07b5cb to 4118355 Compare November 11, 2025 03:31

apbose mentioned this pull request Nov 14, 2025

Adding rank based logging for torch distributed examples #3897

Open

github-actions bot requested changes Nov 19, 2025

View reviewed changes

apbose force-pushed the abose/trt_llm_installation_dist branch 3 times, most recently from 51970f3 to c658233 Compare November 19, 2025 20:48

apbose marked this pull request as draft November 19, 2025 20:49

apbose force-pushed the abose/trt_llm_installation_dist branch from 57abf82 to 2cb5d68 Compare November 20, 2025 18:57

apbose marked this pull request as ready for review November 20, 2025 19:08

apbose force-pushed the abose/trt_llm_installation_dist branch 3 times, most recently from 4f0ffd6 to f57fac9 Compare November 25, 2025 02:20

apbose added 13 commits November 24, 2025 20:20

Changes to TRT-LLM download tool for multigpu distributed case

0ece1e7

Distributed utils package, separating out env for single GPU and mult…

88dc109

…iGPU

changes to account for the base branch change

7721953

the barrier for TRT-LLM installation

0f1a96c

changing the implementation of avoiding race condition in unzip of TR…

8d246d2

…T-LLM wheel by using lock file

adding logging lock file for debug purpose with file descriptor, avoi…

7ec838d

…ding race conditionst

adding code for lock file and cache deletion

26875a5

mmultigpu ci test

b378bcc

adding the include build matrix for all the configs

fd8175b

debug prints

93e9198

debug- removing post-script and smoke-test-script

fbdc354

addressing CI master port assignment

15a3687

changing the filter matrix to all all the configs

30dcc4c

apbose force-pushed the abose/trt_llm_installation_dist branch from f57fac9 to 30dcc4c Compare November 25, 2025 04:49

narendasan reviewed Dec 1, 2025

View reviewed changes

apbose added 2 commits December 3, 2025 06:33

removing the debug prints, increasing docker shared memory, clearing …

d92f040

…shared memory in teardown class

random port initialization logic for single and multi, correcting the…

b396284

… pre req for L2 distributed test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes to TRT-LLM download tool for multigpu distributed case #3830

Changes to TRT-LLM download tool for multigpu distributed case #3830

Uh oh!

apbose commented Sep 22, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

narendasan Dec 1, 2025

Uh oh!

apbose Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Changes to TRT-LLM download tool for multigpu distributed case #3830

Are you sure you want to change the base?

Changes to TRT-LLM download tool for multigpu distributed case #3830

Uh oh!

Conversation

apbose commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

narendasan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

apbose Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apbose commented Sep 22, 2025 •

edited

Loading