Skip to content

model training error #19

@guandailu

Description

@guandailu

Command:

python -u ./selene/selene_sdk/cli.py train.yml --lr=0.1

Error information:
Traceback (most recent call last):
File "train.py", line 11, in
parse_configs_and_run(configs, lr=0.01)
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/utils/config_utils.py", line 344, in parse_configs_and_run
execute(operations, configs, current_run_output_dir)
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/utils/config_utils.py", line 188, in execute
train_model.train_and_validate()
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/train_model.py", line 417, in train_and_validate
self.train()
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/train_model.py", line 453, in train
loss.backward()
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (try_all at /opt/conda/conda-bld/pytorch_1591914855613/work/aten/src/ATen/native/cudnn/Conv.cpp:693)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x14c3d6230b5e in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd5d68d (0x14c3d775d68d in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd5e1d1 (0x14c3d775e1d1 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd6220b (0x14c3d776220b in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0xb2 (0x14c3d7762762 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xdc9280 (0x14c3d77c9280 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xe0db18 (0x14c3d780db18 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x4fa (0x14c3d7763dfa in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: + 0xdc95ab (0x14c3d77c95ab in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xe0db74 (0x14c3d780db74 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0x29dee26 (0x14c4043dee26 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x2a2e634 (0x14c40442e634 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x14c403ff6ff8 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x2ae7df5 (0x14c4044e7df5 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x14c4044e50f3 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x14c4044e5ed2 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_init(int) + 0x39 (0x14c4044de549 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x14c407f0a638 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: + 0xd3e79 (0x14c41efd3e79 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/matplotlib/../../../libstdc++.so.6)
frame #19: + 0x94b43 (0x14c42c894b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: + 0x126a00 (0x14c42c926a00 in /lib/x86_64-linux-gnu/libc.so.6)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions