-
Notifications
You must be signed in to change notification settings - Fork 372
Changes to TRT-LLM download tool for multigpu distributed case #3830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py 2025-09-22 06:35:28.523784+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py 2025-09-22 06:36:00.657186+00:00
@@ -863,6 +863,6 @@
return False
def is_thor() -> bool:
if torch.cuda.get_device_capability() in [(11, 0)]:
- return True
\ No newline at end of file
+ return True6e99bbc to
7134053
Compare
3f1fa7e to
54948d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py 2025-09-25 19:33:28.176615+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py 2025-09-25 19:34:02.325958+00:00
@@ -100,11 +100,10 @@
return True
except Exception as e:
logger.warning(f"Failed to detect CUDA version: {e}")
return False
-
return True
def _cache_root() -> Path:2bbc423 to
5beefc0
Compare
5beefc0 to
809c7ee
Compare
b96b9ee to
2f2cd31
Compare
809c7ee to
5fb74da
Compare
f4f338a to
f07b5cb
Compare
f07b5cb to
4118355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py 2025-11-19 02:00:51.654228+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py 2025-11-19 02:01:22.401857+00:00
@@ -60,38 +60,40 @@
return True
def create_distributed_config(item: Dict[str, Any]) -> Dict[str, Any]:
"""Create distributed test configuration from a regular config.
-
+
Takes a standard test config and modifies it for distributed testing:
- Changes runner to multi-GPU instance
- Adds num_gpus field
- Adds config marker
"""
import sys
-
+
# Create a copy to avoid modifying the original
dist_item = item.copy()
-
+
# Debug: Show original config
print(f"[DEBUG] Creating distributed config from:", file=sys.stderr)
print(f"[DEBUG] Python: {item.get('python_version')}", file=sys.stderr)
print(f"[DEBUG] CUDA: {item.get('desired_cuda')}", file=sys.stderr)
- print(f"[DEBUG] Original runner: {item.get('validation_runner')}", file=sys.stderr)
-
+ print(
+ f"[DEBUG] Original runner: {item.get('validation_runner')}", file=sys.stderr
+ )
+
# Override runner to use multi-GPU instance
dist_item["validation_runner"] = "linux.g4dn.12xlarge.nvidia.gpu"
-
+
# Add distributed-specific fields
dist_item["num_gpus"] = 2
dist_item["config"] = "distributed"
-
+
# Debug: Show modified config
print(f"[DEBUG] New runner: {dist_item['validation_runner']}", file=sys.stderr)
print(f"[DEBUG] GPUs: {dist_item['num_gpus']}", file=sys.stderr)
-
+
return dist_item
def main(args: list[str]) -> None:
parser = argparse.ArgumentParser()
@@ -131,38 +133,43 @@
raise ValueError(f"Invalid matrix structure: {e}")
includes = matrix_dict["include"]
filtered_includes = []
distributed_includes = [] # NEW: separate list for distributed configs
-
+
print(f"[DEBUG] Processing {len(includes)} input configs", file=sys.stderr)
for item in includes:
if filter_matrix_item(
item,
options.jetpack == "true",
options.limit_pr_builds == "true",
):
filtered_includes.append(item)
-
+
# NEW: Create distributed variant for specific configs
# Only Python 3.10 + CUDA 13.0 for now
if item["python_version"] == "3.10" and item["desired_cuda"] == "cu130":
- print(f"[DEBUG] Creating distributed config for py3.10+cu130", file=sys.stderr)
+ print(
+ f"[DEBUG] Creating distributed config for py3.10+cu130",
+ file=sys.stderr,
+ )
distributed_includes.append(create_distributed_config(item))
-
+
# Debug: Show summary
print(f"[DEBUG] Final counts:", file=sys.stderr)
print(f"[DEBUG] Regular configs: {len(filtered_includes)}", file=sys.stderr)
- print(f"[DEBUG] Distributed configs: {len(distributed_includes)}", file=sys.stderr)
+ print(
+ f"[DEBUG] Distributed configs: {len(distributed_includes)}", file=sys.stderr
+ )
# NEW: Output both regular and distributed configs
filtered_matrix_dict = {
"include": filtered_includes,
- "distributed_include": distributed_includes # NEW field
+ "distributed_include": distributed_includes, # NEW field
}
-
+
# Output to stdout (consumed by GitHub Actions)
print(json.dumps(filtered_matrix_dict))
if __name__ == "__main__":51970f3 to
c658233
Compare
57abf82 to
2cb5d68
Compare
4f0ffd6 to
f57fac9
Compare
…T-LLM wheel by using lock file
…ding race conditionst
f57fac9 to
30dcc4c
Compare
| # the below two functions are used to set the environment variables for the pytest single and multi process | ||
| # this is for the github CI where we use pytest | ||
| def set_environment_variables_pytest_single_process(): | ||
| port = 29500 + random.randint(1, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to be random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have encountered cases locally where in the port was already in use and then it led to address bus error. I have changed the logic a bit to handle the single process and multi process differently. Since multi would need same port for the two processes. In multi we shall set it via the env variable. Added a warning for the same.
…shared memory in teardown class
… pre req for L2 distributed test
TRT-LLM installation tool for distributed