llama-70b SFT  OSError: [Errno 5] Input/output error

**Describe the bug**

Any potential directions that we have errors.

```
OSError: [Errno 5] Input/output error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 173, in main
    sft_trainer.fit()
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 249, in fit
    self.save(metrics, is_train_end=is_train_end)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 269, in save
    self.ckpt_callback.custom_save(monitor_candidates=monitor_candidates, is_train_end=is_train_end)
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 73, in custom_save_ckpt_func
    super(NeMoModelCheckpoint, self)._save_last_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 696, in _save_last_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 544, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1365, in save_checkpoint
    self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 416, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, ckpt_to_dir(filepath), storage_options=storage_options)
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 271, in save_checkpoint
    return dist_checkpointing.save(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 395, in save
    sharded_strategy.save(sharded_state_dict, checkpoint_dir)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/base.py", line 227, in save
    async_calls.maybe_finalize_async_calls(blocking=True)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/async_utils.py", line 209, in maybe_finalize_async_calls
    finalize_fn()
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py", line 706, in finalize_fn
    save_state_dict_async_finalize(*save_state_dict_ret)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/state_dict_saver.py", line 144, in save_state_dict_async_finalize
    write_results = storage_writer.retrieve_write_results()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/filesystem_async.py", line 328, in retrieve_write_results
    raise RuntimeError(f'Worker failure: {write_results_or_exc}') from write_results_or_exc
RuntimeError: Worker failure: [Errno 5] Input/output error
```

**Steps/Code to reproduce bug**

```
torchrun \
  --nnodes ${WORLD_SIZE} \
  --nproc_per_node=${GPU_NUM} \
  --node_rank ${RANK} \
  --master_addr ${MASTER_ADDR} \
  --master_port ${MASTER_PORT} \
  ${GPFS}/examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=bf16 \
   trainer.num_nodes=${WORLD_SIZE} \
   trainer.devices=${GPU_NUM} \
   trainer.sft.max_steps=-1 \
   trainer.sft.max_epochs=2 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=$SAVE_STEPS \
   model.megatron_amp_O2=True \
   model.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE} \
   model.tensor_model_parallel_size=${TP_SIZE} \
   model.pipeline_model_parallel_size=${PP_SIZE} \
   model.sequence_parallel=False \
   model.encoder_seq_length=${SEQUENCE_LENGTH} \
   model.optim.lr=${LEARNING_RATE} \
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.use_flash_attention=True \
   ++model.data.train_ds.packed_sequence=True \
   ++model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
   model.data.train_ds.max_seq_length=${SEQUENCE_LENGTH} \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=${GLOBAL_BATCH_SIZE} \
   ++model.data.validation_ds.packed_sequence=True \
   ++model.data.validation_ds.file_path=${VALID_DATA_PATH}  \
   model.data.validation_ds.max_seq_length=${SEQUENCE_LENGTH} \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=${GLOBAL_BATCH_SIZE} \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=${RESULTS_DIR} \
   exp_manager.wandb_logger_kwargs.project=sft_run_instruct_data \
   exp_manager.wandb_logger_kwargs.name=${RUN_NAME} \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   ++exp_manager.checkpoint_callback_params.always_save_nemo=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss
```

**Environment details**

docker :registry.zoomdev.us/languagetech/nemo:24.12.01 
Otherwise, please provide:
- OS version :linux
- PyTorch version :rch
Version: 2.5.0a0+e000cf0ad9.nv24.10
- Python version: 3.10.12


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-70b SFT OSError: [Errno 5] Input/output error #502

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama-70b SFT OSError: [Errno 5] Input/output error #502

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions