-
Notifications
You must be signed in to change notification settings - Fork 5
pretrain + dryrun #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
| @@ -0,0 +1,342 @@ | |||
| # 实验简介 | |||
|
|
|||
| 本实验在完成Qwen2.5-7B-Instruct预训练的基础上,使用内存模拟工具DryRun模拟相同配置下的预训练,对比内存占用情况,从而证明DryRun工具可以作为模型训练资源规划和问题诊断的有效手段,避免真实训练的资源消耗和时间成本。 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后半句"从而证明..."不要这么描述, 证明这个能力是你的隐性逻辑, 改为:“介绍DryRun工具的使用方法,帮助用户降低开发与调试成本”
|
|
||
| ``` | ||
|
|
||
| # RryRun模拟Qwen2.5-7B-Instruct预训练流程 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种低级拼写错误不要有了
| 本实验在完成Qwen2.5-7B-Instruct预训练的基础上,使用内存模拟工具DryRun模拟相同配置下的预训练,对比内存占用情况,从而证明DryRun工具可以作为模型训练资源规划和问题诊断的有效手段,避免真实训练的资源消耗和时间成本。 | ||
|
|
||
|
|
||
| # 环境准备 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一部分不用重复介绍了吧, 直接引用Chapter2的东西就好了
|
|
||
| # RryRun模拟Qwen2.5-7B-Instruct预训练流程 | ||
|
|
||
| ## 权重转换 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
权重转换, 数据预处理啥的也不用介绍了,反正你都和前面是一样的, 直接点一下, 在"参考xx章节,完成xx,xx,xx流程后"
|
|
||
| ## DryRun模拟预训练 | ||
|
|
||
| 1. 修改文件 /MindSpeed-Core-MS/MindSpeed-LLM/mindspeed_llm/training/training.py中的pretain函数,在pretain函数的开头部分添加以下代码行 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同样不要用绝对路径.
|
|
||
|  | ||
|
|
||
| 2. 修改文件 /MindSpeed-Core-MS/MSAdapter/mmsadpter/distributed/distributed_c10d.py中的init_method为"tcp://ip:port",此处ip和port根据实际情况修改即可 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| </tr> | ||
| <tr> | ||
| <td>真实预训练+无重计算</td> | ||
| <td>Device MOC memory size: 62420M<br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我感觉这里内存占用分为4列较好,每个指标一列
| Used peak memory usage (without fragments): 55405M<br> | ||
| Actual peak memory usage (with fragments): 57400M | ||
| </td> | ||
| <td rowspan="2" style="vertical-align: middle;">1. DryRun可以模拟实际预训练显存使用<br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
结论的1改为: DryRun模拟的显存占用与实际显存占用差距较小, 后面的结论也是, "DryRun可以模拟"这种有点过于笃定
| ``` | ||
| 具体效果如下图所示 | ||
|
|
||
|  |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个图太小了, 请截取一个大点的图
Description
Changes
Testing & Benchmark
Checklist
Reviewers