Skip to content

Conversation

@cui0523
Copy link

@cui0523 cui0523 commented Dec 1, 2025

Description

Changes

Testing & Benchmark

Checklist

  • Read and followed the Contributing Guidelines.
  • Self-tested locally to ensure the code runs correctly and achieves expected results (all CI checks expected to pass).
  • Updated documentation if needed.
  • Verified accuracy or performance benchmarks if applicable.

Reviewers

@@ -0,0 +1,342 @@
# 实验简介

本实验在完成Qwen2.5-7B-Instruct预训练的基础上,使用内存模拟工具DryRun模拟相同配置下的预训练,对比内存占用情况,从而证明DryRun工具可以作为模型训练资源规划和问题诊断的有效手段,避免真实训练的资源消耗和时间成本。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后半句"从而证明..."不要这么描述, 证明这个能力是你的隐性逻辑, 改为:“介绍DryRun工具的使用方法,帮助用户降低开发与调试成本”


```

# RryRun模拟Qwen2.5-7B-Instruct预训练流程
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种低级拼写错误不要有了

本实验在完成Qwen2.5-7B-Instruct预训练的基础上,使用内存模拟工具DryRun模拟相同配置下的预训练,对比内存占用情况,从而证明DryRun工具可以作为模型训练资源规划和问题诊断的有效手段,避免真实训练的资源消耗和时间成本。


# 环境准备
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一部分不用重复介绍了吧, 直接引用Chapter2的东西就好了


# RryRun模拟Qwen2.5-7B-Instruct预训练流程

## 权重转换
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

权重转换, 数据预处理啥的也不用介绍了,反正你都和前面是一样的, 直接点一下, 在"参考xx章节,完成xx,xx,xx流程后"


## DryRun模拟预训练

1. 修改文件 /MindSpeed-Core-MS/MindSpeed-LLM/mindspeed_llm/training/training.py中的pretain函数,在pretain函数的开头部分添加以下代码行
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同样不要用绝对路径.


![training](./image/training.png)

2. 修改文件 /MindSpeed-Core-MS/MSAdapter/mmsadpter/distributed/distributed_c10d.py中的init_method为"tcp://ip:port",此处ip和port根据实际情况修改即可
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

</tr>
<tr>
<td>真实预训练+无重计算</td>
<td>Device MOC memory size: 62420M<br>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我感觉这里内存占用分为4列较好,每个指标一列

Used peak memory usage (without fragments): 55405M<br>
Actual peak memory usage (with fragments): 57400M
</td>
<td rowspan="2" style="vertical-align: middle;">1. DryRun可以模拟实际预训练显存使用<br>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

结论的1改为: DryRun模拟的显存占用与实际显存占用差距较小, 后面的结论也是, "DryRun可以模拟"这种有点过于笃定

```
具体效果如下图所示

![training](./image/training.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个图太小了, 请截取一个大点的图

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants