make explicit typing as default, polish examples and docs

zewenli98 · zewenli98 · commit a7a80393885c · 2025-11-19T23:02:07.000-08:00
diff --git a/docsrc/user_guide/mixed_precision.rst b/docsrc/user_guide/mixed_precision.rst
@@ -32,8 +32,9 @@ Consider the following PyTorch model which explicitly casts intermediate layer t
             return x
 
 
-If we compile the above model using Torch-TensorRT with the following settings, layer profiling logs indicate that all the layers are 
-run in FP32. This is because TensorRT picks the kernels for layers which result in the best performance (i.e., weak typing in TensorRT). 
+Before TensorRT 10.12, if we compile the above model using Torch-TensorRT with the following settings, 
+layer profiling logs indicate that all the layers are run in FP32. This is because old TensorRT picks 
+the kernels for layers which result in the best performance (i.e., weak typing in old TensorRT). 
 
 .. code-block:: python
 
@@ -49,8 +50,10 @@ run in FP32. This is because TensorRT picks the kernels for layers which result
     # Name: __myl_AddResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye146_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_AddResMulSumAdd_0xcdd0085ad25f5f45ac5fafb72acbffd6, StreamId: 0, Metadata: 
 
 
-In order to respect the types specified by the user in the model (eg: in this case, ``linear2`` layer to run in FP16), users can enable 
-the compilation setting ``use_explicit_typing=True``. Compiling with this option results in the following TensorRT logs:
+However, since TensorRT 10.12, TensorRT has deprecated weak typing, we must set ``use_explicit_typing=True`` 
+to enable strong typing, which means users must specify the precision of the nodes in the model. For example,
+in the case above, we set ``linear2`` layer to run in FP16, so if we compile the model with the following settings,
+the ``linear2`` layer will run in FP16 and other layers will run in FP32 as shown in the following TensorRT logs:
 
 .. code-block:: python
 
@@ -68,32 +71,67 @@ the compilation setting ``use_explicit_typing=True``. Compiling with this option
 Autocast
 ---------------
 
-Weak typing behavior in TensorRT is deprecated. However it is a good way to maximize performance. Therefore, in Torch-TensorRT,
-we want to provide a way to enable weak typing behavior in Torch-TensorRT, which is called `Autocast`. 
+Weak typing behavior in TensorRT is deprecated. However mixed precision is a good way to maximize performance. 
+Therefore, in Torch-TensorRT, we want to provide a way to enable mixed precision behavior like weak typing in 
+old TensorRT, which is called `Autocast`. 
 
-Torch-TensorRT Autocast intelligently selects nodes to keep in FP32 precision to maintain model accuracy while benefiting from 
-reduced precision on the rest of the nodes. Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast,
-considering some nodes might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch 
-native Autocast, allowing users to use both PyTorch and Torch-TensorRT Autocast in the same model. Torch-TensorRT respects the precision
-of the nodes within PyTorch Autocast.
+Before we dive into Torch-TensorRT Autocast, let's first take a look at PyTorch Autocast. PyTorch Autocast is a 
+context-based autocast, which means it will affect the precision of the nodes inside the context. For example,
+in PyTorch, we can do the following:
 
-To enable Torch-TensorRT Autocast, users need to set both ``enable_autocast=True`` and ``use_explicit_typing=True``. For example,
+.. code-block:: python
+
+    x = self.linear1(x)
+    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
+        x = self.linear2(x)
+    x = self.linear3(x)
+
+This will run ``linear2`` in FP16 and other layers remain in FP32. Please refer to `PyTorch Autocast documentation <https://docs.pytorch.org/docs/stable/amp.html#torch.autocast>`_ for more details.
+
+Unlike PyTorch Autocast, Torch-TensorRT Autocast is a rule-based autocast, which intelligently selects nodes to 
+keep in FP32 precision to maintain model accuracy while benefiting from reduced precision on the rest of the nodes. 
+Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast, considering some nodes 
+might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch Autocast, 
+allowing users to use both PyTorch Autocast and Torch-TensorRT Autocast in the same model. Torch-TensorRT Autocast 
+respects the precision of the nodes within PyTorch Autocast context.
+
+To enable Torch-TensorRT Autocast, we need to set both ``enable_autocast=True`` and ``use_explicit_typing=True``. 
+On top of them, we can also specify the precision of the nodes to reduce to by ``autocast_low_precision_type``, 
+and exclude certain nodes/ops from Torch-TensorRT Autocast by ``autocast_excluded_nodes`` or ``autocast_excluded_ops``.
+For example,
 
 .. code-block:: python
 
+    class MyModule(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.linear1 = torch.nn.Linear(10,10)
+            self.linear2 = torch.nn.Linear(10,30)
+            self.linear3 = torch.nn.Linear(30,40)
+
+        def forward(self, x):
+            x = self.linear1(x)
+            x = self.linear2(x)
+            x = self.linear3(x)
+            return x
+
     inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
     mod = MyModule().eval().cuda()
     ep = torch.export.export(mod, tuple(inputs))
-    trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs, enable_autocast=True, use_explicit_typing=True)
-
+    trt_gm = torch_tensorrt.dynamo.compile(
+        ep, 
+        inputs=inputs, 
+        enable_autocast=True, 
+        use_explicit_typing=True,
+        autocast_low_precision_type=torch.float16,
+        autocast_excluded_nodes={"^linear2$"},
+    )
 
-Users can also specify the precision of the nodes by ``autocast_low_precision_type``, or ``autocast_excluded_nodes`` / ``autocast_excluded_ops`` 
-to exclude certain nodes/ops from Autocast.
+This model excludes ``linear2`` from Autocast, so it will run ``linear2`` in FP32 and other layers in FP16. 
 
-In summary, there are three ways in Torch-TensorRT to enable mixed precision:
-1. TRT chooses precision (weak typing):                     ``use_explicit_typing=False + enable_autocast=False``
-2. User specifies precision (strong typing):                ``use_explicit_typing=True + enable_autocast=False``
-3. Autocast chooses precision (autocast + strong typing):   ``use_explicit_typing=True + enable_autocast=True``
+In summary, now there are two ways in Torch-TensorRT to choose the precision of the nodes:
+1. User specifies precision (strong typing):                ``use_explicit_typing=True + enable_autocast=False``
+2. Autocast chooses precision (autocast + strong typing):   ``use_explicit_typing=True + enable_autocast=True``
 
 FP32 Accumulation
 -----------------
diff --git a/examples/dynamo/autocast_example.py b/examples/dynamo/autocast_example.py
@@ -1,7 +1,21 @@
+"""
+.. _autocast_example:
+
+An example of using Torch-TensorRT Autocast
+================
+
+This example demonstrates how to use Torch-TensorRT Autocast with PyTorch Autocast to compile a mixed precision model.
+"""
+
 import torch
 import torch.nn as nn
 import torch_tensorrt
 
+# %% Mixed Precision Model
+#
+# We define a mixed precision model that consists of a few layers, a ``log`` operation, and an ``abs`` operation.
+# Among them, the ``fc1``, ``log``, and ``abs`` operations are within PyTorch Autocast context with ``dtype=torch.float16``.
+
 
 class MixedPytorchAutocastModel(nn.Module):
     def __init__(self):
@@ -20,51 +34,89 @@ def __init__(self):
         self.fc1 = nn.Linear(16 * 8 * 8, 10)
 
     def forward(self, x):
-        x = self.conv1(x)
-        x = self.relu1(x)
-        x = self.pool1(x)
-        x = self.conv2(x)
-        x = self.relu2(x)
-        x = self.pool2(x)
-        x = self.flatten(x)
+        out1 = self.conv1(x)
+        out2 = self.relu1(out1)
+        out3 = self.pool1(out2)
+        out4 = self.conv2(out3)
+        out5 = self.relu2(out4)
+        out6 = self.pool2(out5)
+        out7 = self.flatten(out6)
         with torch.autocast(x.device.type, enabled=True, dtype=torch.float16):
-            x = self.fc1(x)
-            out = torch.log(
-                torch.abs(x) + 1
+            out8 = self.fc1(out7)
+            out9 = torch.log(
+                torch.abs(out8) + 1
             )  # log is fp32 due to Pytorch Autocast requirements
-        return out
-
-
-if __name__ == "__main__":
-    model = MixedPytorchAutocastModel().cuda().eval()
-    inputs = (torch.randn((8, 3, 32, 32), dtype=torch.float32, device="cuda"),)
-    ep = torch.export.export(model, inputs)
-    calibration_dataloader = torch.utils.data.DataLoader(
-        torch.utils.data.TensorDataset(*inputs), batch_size=2, shuffle=False
-    )
-
-    with torch_tensorrt.dynamo.Debugger(
-        "graphs",
-        logging_dir=".",
-        engine_builder_monitor=False,
-    ):
-        trt_autocast_mod = torch_tensorrt.compile(
-            ep.module(),
-            arg_inputs=inputs,
-            min_block_size=1,
-            use_python_runtime=True,
-            ##### weak typing #####
-            # use_explicit_typing=False,
-            # enabled_precisions={torch.float16},
-            ##### strong typing + autocast #####
-            use_explicit_typing=True,
-            enable_autocast=True,
-            autocast_low_precision_type=torch.float16,
-            autocast_excluded_nodes={"^conv1$", "relu"},
-            autocast_excluded_ops={"torch.ops.aten.flatten.using_ints"},
-            autocast_max_output_threshold=512,
-            autocast_max_depth_of_reduction=None,
-            autocast_calibration_dataloader=calibration_dataloader,
-        )
+        return x, out1, out2, out3, out4, out5, out6, out7, out8, out9
+
+
+# %%
+# Define the model, inputs, and calibration dataloader for Autocast, and then we run the original PyTorch model to get the reference outputs.
+
+model = MixedPytorchAutocastModel().cuda().eval()
+inputs = (torch.randn((8, 3, 32, 32), dtype=torch.float32, device="cuda"),)
+ep = torch.export.export(model, inputs)
+calibration_dataloader = torch.utils.data.DataLoader(
+    torch.utils.data.TensorDataset(*inputs), batch_size=2, shuffle=False
+)
+
+pytorch_outs = model(*inputs)
+
+# %% Compile the model with Torch-TensorRT Autocast
+#
+# We compile the model with Torch-TensorRT Autocast by setting ``enable_autocast=True``, ``use_explicit_typing=True``, and
+# ``autocast_low_precision_type=torch.bfloat16``. To illustrate, we exclude the ``conv1`` node, all nodes with name
+# containing ``relu``, and ``torch.ops.aten.flatten.using_ints`` ATen op from Autocast. In addtion, we also set
+# ``autocast_max_output_threshold``, ``autocast_max_depth_of_reduction``, and ``autocast_calibration_dataloader``. Please refer to
+# the documentation for more details.
+
+trt_autocast_mod = torch_tensorrt.compile(
+    ep.module(),
+    arg_inputs=inputs,
+    min_block_size=1,
+    use_python_runtime=True,
+    use_explicit_typing=True,
+    enable_autocast=True,
+    autocast_low_precision_type=torch.bfloat16,
+    autocast_excluded_nodes={"^conv1$", "relu"},
+    autocast_excluded_ops={"torch.ops.aten.flatten.using_ints"},
+    autocast_max_output_threshold=512,
+    autocast_max_depth_of_reduction=None,
+    autocast_calibration_dataloader=calibration_dataloader,
+)
+
+autocast_outs = trt_autocast_mod(*inputs)
+
+# %% Verify the outputs
+#
+# We verify both the dtype and values of the outputs of the model are correct.
+# As expected, ``fc1`` is in FP16 because of PyTorch Autocast;
+# ``pool1``, ``conv2``, and ``pool2`` are in BFP16 because of Torch-TensorRT Autocast;
+# the rest remain in FP32. Note that ``log`` is in FP32 because of PyTorch Autocast requirements.
+
+should_be_fp32 = [
+    autocast_outs[0],
+    autocast_outs[1],
+    autocast_outs[2],
+    autocast_outs[5],
+    autocast_outs[7],
+    autocast_outs[9],
+]
+should_be_fp16 = [
+    autocast_outs[8],
+]
+should_be_bf16 = [autocast_outs[3], autocast_outs[4], autocast_outs[6]]
 
-        autocast_outs = trt_autocast_mod(*inputs)
+assert all(
+    a.dtype == torch.float32 for a in should_be_fp32
+), "Some Autocast outputs are not float32!"
+assert all(
+    a.dtype == torch.float16 for a in should_be_fp16
+), "Some Autocast outputs are not float16!"
+assert all(
+    a.dtype == torch.bfloat16 for a in should_be_bf16
+), "Some Autocast outputs are not bfloat16!"
+for i, (a, w) in enumerate(zip(autocast_outs, pytorch_outs)):
+    assert torch.allclose(
+        a.to(torch.float32), w.to(torch.float32), atol=1e-2, rtol=1e-2
+    ), f"Autocast and Pytorch outputs do not match! autocast_outs[{i}] = {a}, pytorch_outs[{i}] = {w}"
+print("All dtypes and values match!")
diff --git a/py/torch_tensorrt/dynamo/_compiler.py b/py/torch_tensorrt/dynamo/_compiler.py
@@ -543,6 +543,13 @@ def compile(
             stacklevel=2,
         )
 
+    if kwargs.get("use_explicit_typing", False) == False:
+        warnings.warn(
+            "`use_explicit_typing` is deprecated. This setting will be removed and you should enable autocast instead.",
+            DeprecationWarning,
+            stacklevel=2,
+        )
+
     if "truncate_long_and_double" in kwargs.keys():
         if truncate_double is not _defaults.TRUNCATE_DOUBLE:
             raise ValueError(
diff --git a/py/torch_tensorrt/dynamo/_defaults.py b/py/torch_tensorrt/dynamo/_defaults.py
@@ -46,7 +46,7 @@
 ENGINE_CACHE_DIR = os.path.join(tempfile.gettempdir(), "torch_tensorrt_engine_cache")
 ENGINE_CACHE_SIZE = 5368709120  # 5GB
 CUSTOM_ENGINE_CACHE = None
-USE_EXPLICIT_TYPING = False
+USE_EXPLICIT_TYPING = True
 USE_FP32_ACC = False
 REFIT_IDENTICAL_ENGINE_WEIGHTS = False
 STRIP_ENGINE_WEIGHTS = False