Skip to content

[BUG] unit test failures on Deepspeed upstream  #56

@bmedishe

Description

@bmedishe

Error Log :
=========================== short test summary info ============================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_moe[4]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_moe_and_zero[4-True]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_moe_and_zero[2-True]
FAILED tests/unit/test_configurable_parallel.py::TestConfigurableMP::test_gpt2_basic
====== 4 failed, 581 passed, 58 skipped, 1 warning in 3850.22s (1:04:10) =======
Steps to reproduce :
Follow the steps in this PR to install pytorch with hipify_torch as submodule
After building and installing pytorch from source , clone DeepSpeed from upstream and do a jit build and run unit tests:

  1. git clone https://github.com/microsoft/DeepSpeed.git
  2. #include<THC/THCGeneral.h> from csrc/lamb/fused_lamb_cuda_kernel.cu removed before building
  3. ./install.sh (JIT build)
  4. DEEPSPEED_TEST_WITH_ROCM=1 pytest --forked tests/unit/test_* 2>&1 | tee deepspeed_unit_test

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions