From 85af24bdb551da3bea126cdb2d04889aa3cbe011 Mon Sep 17 00:00:00 2001
From: Guillaume Dalle <22795598+gdalle@users.noreply.github.com>
Date: Sat, 29 Nov 2025 20:36:00 +0100
Subject: [PATCH 1/3] Add docs page on kernel raising

---
 docs/src/tutorials/kernels.md | 73 +++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 docs/src/tutorials/kernels.md

diff --git a/docs/src/tutorials/kernels.md b/docs/src/tutorials/kernels.md
new file mode 100644
index 0000000000..2a82948e0a
--- /dev/null
+++ b/docs/src/tutorials/kernels.md
@@ -0,0 +1,73 @@
+# Kernels
+
+Suppose your codebase contains custom GPU kernels, typically those defined with [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl).
+
+## Example
+
+```@example kernels
+using KernelAbstractions
+
+@kernel function square_kernel!(y, @Const(x))
+    i = @index(Global)
+    @inbounds y[i] = x[i] * x[i]
+end
+
+function square(x)
+    y = similar(x)
+    backend = KernelAbstractions.get_backend(x)
+    kernel! = square_kernel!(backend)
+    kernel!(y, x; ndrange=length(x))
+    return y
+end
+```
+
+```jldoctest kernels
+x = float.(1:5)
+y = square(x)
+
+# output
+
+5-element Vector{Float64}:
+  1.0
+  4.0
+  9.0
+ 16.0
+ 25.0
+```
+
+## Kernel compilation
+
+To compile such kernels with Reactant, you need to pass the option `raise=true` to the `@compile` or `@jit` macro.
+Furthermore, the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package needs to be loaded (even on non-NVIDIA hardware).
+
+```jldoctest kernels
+import CUDA
+using Reactant
+
+xr = ConcreteRArray(x)
+yr = @jit raise=true square(xr)
+
+# output
+
+5-element ConcretePJRTArray{Float64,1}:
+  1.0
+  4.0
+  9.0
+ 16.0
+ 25.0
+```
+
+## Differentiated kernel
+
+In addition, if you want to compute derivatives of your kernel with [Enzyme.jl](https://github.com/EnzymeAD/Enzyme.jl), the option `raise_first=true` also becomes necessary.
+
+```jldoctest kernels
+import Enzyme
+
+sumsquare(x) = sum(square(x))
+gr = @jit raise=true raise_first=true Enzyme.gradient(Enzyme.Reverse, sumsquare, xr)
+
+# output
+
+(ConcretePJRTArray{Float64, 1, 1}([2.0, 4.0, 6.0, 8.0, 10.0]),)
+```

From 5fe6e01bb52b43d7265c5ca53358a42c70ade1a8 Mon Sep 17 00:00:00 2001
From: Paul Berg <9824244+Pangoraw@users.noreply.github.com>
Date: Sun, 30 Nov 2025 18:03:09 +0100
Subject: [PATCH 2/3] add section about kernel raising

---
 docs/src/tutorials/kernels.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/docs/src/tutorials/kernels.md b/docs/src/tutorials/kernels.md
index 2a82948e0a..e61b10f3f3 100644
--- a/docs/src/tutorials/kernels.md
+++ b/docs/src/tutorials/kernels.md
@@ -57,6 +57,19 @@ yr = @jit raise=true square(xr)
  25.0
 ```
 
+## GPU Kernel raising
+
+Kernel raising refer to Reactant's ability to transform a program written in a GPU kernel style. That is, kernel functions which are evaluated in a grid of blocks and threads where operations are done at the scalar level. The transformation raises the program to a tensor style function (in the StableHLO dialect) where operations are broadcasted.
+
+This transformation enables several features:
+
+ - Running the raised compute kernel on hardware where the original kernel was not designed to run on (_i.e._ running a CUDA kernel on a TPU).
+ - Enabling further optimizations, since the raised kernel is now indiscernible from the rest of the program, it can be optimized with it. For example, two sequential kernel launches operating on the result of each others can be fused if they are both raised. Resulting in a single kernel launch, in the final optimized StableHLO program.
+ - Lastly, automatic-differentiation in Reactant is currently not supported for GPU kernels. Raising kernels enables Enzyme to differentiate the raised kernel. For this to function, one must use the `raise_first` compilation option to make sure the kernel are raised before Enzyme performs automatic-differentiation on the program.
+
+!!! note
+    Not all classes of kernels are currently raisable to StableHLO. If your kernel encounters an error while being raised, please open an issue on [the Reactant.jl repository](https://github.com/EnzymeAD/Reactant.jl/issues/new?labels=raising).
+
 ## Differentiated kernel
 
 In addition, if you want to compute derivatives of your kernel with [Enzyme.jl](https://github.com/EnzymeAD/Enzyme.jl), the option `raise_first=true` also becomes necessary.

From da85ba6f55f4a6b0e298d1cd4341b129836692c8 Mon Sep 17 00:00:00 2001
From: Guillaume Dalle <22795598+gdalle@users.noreply.github.com>
Date: Thu, 18 Dec 2025 09:53:18 +0100
Subject: [PATCH 3/3] Switch to examples

---
 docs/src/tutorials/kernels.md | 96 +++++++++++++++++++++--------------
 1 file changed, 58 insertions(+), 38 deletions(-)

diff --git a/docs/src/tutorials/kernels.md b/docs/src/tutorials/kernels.md
index e61b10f3f3..204ae11f5b 100644
--- a/docs/src/tutorials/kernels.md
+++ b/docs/src/tutorials/kernels.md
@@ -1,12 +1,20 @@
-# Kernels
+# [GPU Kernels](@id gpu-kernels)
 
-Suppose your codebase contains custom GPU kernels, typically those defined with [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl).
+```@meta
+ShareDefaultModule = true
+```
+
+Suppose your code base contains custom GPU kernels, such as those defined with [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) or directly with a backend like [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl).
 
 ## Example
 
-```@example kernels
+```@example
 using KernelAbstractions
+```
 
+Here we define a very simple squaring kernel:
+
+```@example
 @kernel function square_kernel!(y, @Const(x))
     i = @index(Global)
     @inbounds y[i] = x[i] * x[i]
@@ -21,66 +29,78 @@ function square(x)
 end
 ```
 
-```jldoctest kernels
+Let's test it to make sure it works:
+
+```@example
 x = float.(1:5)
 y = square(x)
-
-# output
-
-5-element Vector{Float64}:
-  1.0
-  4.0
-  9.0
- 16.0
- 25.0
+@assert y == x .^ 2  # hide
 ```
 
 ## Kernel compilation
 
-To compile such kernels with Reactant, you need to pass the option `raise=true` to the `@compile` or `@jit` macro.
-Furthermore, the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package needs to be loaded (even on non-NVIDIA hardware).
+To compile this kernel with Reactant, the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package needs to be loaded (even on non-NVIDIA hardware).
 
-```jldoctest kernels
+```@example
 import CUDA
 using Reactant
+```
 
-xr = ConcreteRArray(x)
-yr = @jit raise=true square(xr)
+The rest of the compilation works as usual:
 
-# output
+```@example
+xr = ConcreteRArray(x)
+square_compiled = @compile square(xr)
+```
 
-5-element ConcretePJRTArray{Float64,1}:
-  1.0
-  4.0
-  9.0
- 16.0
- 25.0
+```@example
+yr = square_compiled(xr)
+@assert yr == xr .^ 2  # hide
 ```
 
-## GPU Kernel raising
+## Kernel raising
+
+Kernel raising refer to Reactant's ability to transform a program written in a GPU kernel style (that is, kernel functions which are evaluated in a grid of blocks and threads, where operations are done at the scalar level).
+The transformation raises the program to a tensor-style function (in the StableHLO dialect) where operations are broadcasted.
 
-Kernel raising refer to Reactant's ability to transform a program written in a GPU kernel style. That is, kernel functions which are evaluated in a grid of blocks and threads where operations are done at the scalar level. The transformation raises the program to a tensor style function (in the StableHLO dialect) where operations are broadcasted.
+Raising is achieved by passing the keyword `raise = true` during compilation:
 
-This transformation enables several features:
+```@example
+square_compiled_raised = @compile raise=true square(xr)
+```
+
+```@example
+yr2 = square_compiled_raised(xr)
+@assert yr2 == xr .^ 2  # hide
+```
 
- - Running the raised compute kernel on hardware where the original kernel was not designed to run on (_i.e._ running a CUDA kernel on a TPU).
- - Enabling further optimizations, since the raised kernel is now indiscernible from the rest of the program, it can be optimized with it. For example, two sequential kernel launches operating on the result of each others can be fused if they are both raised. Resulting in a single kernel launch, in the final optimized StableHLO program.
- - Lastly, automatic-differentiation in Reactant is currently not supported for GPU kernels. Raising kernels enables Enzyme to differentiate the raised kernel. For this to function, one must use the `raise_first` compilation option to make sure the kernel are raised before Enzyme performs automatic-differentiation on the program.
+This transformation unlocks several features:
+
+- Running the raised compute kernel on hardware where the original kernel was not designed to run on (_i.e._ running a CUDA kernel on a TPU).
+- Enabling further optimizations: since the raised kernel is now indiscernible from the rest of the program, it can be optimized with it. For example, two sequential kernel launches operating on the result of each other can be fused if they are both raised. This results in a single kernel launch for the final optimized StableHLO program.
+- Supporting automatic differentiation, which Reactant currently cannot handle for GPU kernels. Raising kernels enables Enzyme to differentiate the raised kernel (more on this below).
 
 !!! note
-    Not all classes of kernels are currently raisable to StableHLO. If your kernel encounters an error while being raised, please open an issue on [the Reactant.jl repository](https://github.com/EnzymeAD/Reactant.jl/issues/new?labels=raising).
+    Not all classes of kernels are currently raisable to StableHLO. If your kernel encounters an error while being raised, please open an issue on [the Reactant repository](https://github.com/EnzymeAD/Reactant.jl/issues/new?labels=raising).
 
-## Differentiated kernel
+## Kernel differentiation
 
-In addition, if you want to compute derivatives of your kernel with [Enzyme.jl](https://github.com/EnzymeAD/Enzyme.jl), the option `raise_first=true` also becomes necessary.
+If you want to compute derivatives of your kernel, combining Reactant with [Enzyme.jl](https://github.com/EnzymeAD/Enzyme.jl) is the best choice.
 
-```jldoctest kernels
+```@example
 import Enzyme
+```
 
+You must use the `raise_first = true` compilation option to make sure the kernel is raised before Enzyme performs automatic differentiation on the program.
+
+```@example
 sumsquare(x) = sum(square(x))
-gr = @jit raise=true raise_first=true Enzyme.gradient(Enzyme.Reverse, sumsquare, xr)
+gradient_compiled = @compile raise=true raise_first=true Enzyme.gradient(Enzyme.Reverse, sumsquare, xr)
+```
 
-# output
+Note that the mode and function argument are partially evaluated at compilation time, but we still need to provide them again at execution time:
 
-(ConcretePJRTArray{Float64, 1, 1}([2.0, 4.0, 6.0, 8.0, 10.0]),)
+```@example
+gr = gradient_compiled(Enzyme.Reverse, sumsquare, xr)[1]
+@assert gr == 2xr  # hide
 ```