gw = torch.sum(gw, dim=0) RuntimeError: CUDA error: invalid argument Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.