[Feature] Auto scaling factor tuning for FP8 collective communication #140

wkcn · 2023-12-07T03:34:51Z

Description
Support for auto scaling factor tuning #41
Related Example: Azure/MS-AMP-Examples#21

Performance (model: GPT-345M, https://github.com/Azure/MS-AMP-Examples/blob/main/gpt3/pretrain_345m_megatron.sh):

msamp w/o auto scaling
validation loss at iteration 5000 | lm loss value: 3.531525E+00 | lm loss PPL: 3.417605E+01 |
samples per second: 519.524 | TFLOPs: 155.99 |
msamp w/ auto scaling (Add the argument --wgrad-auto-scaling):
validation loss at iteration 5000 | lm loss value: 3.529646E+00 | lm loss PPL: 3.411188E+01 |
samples per second: 516.702 | TFLOPs: 155.14 |

Major Revision

Add a new variable pre_scale in ScalingMeta
pre_scale support in Arithmetic.add_to_fp8
Auto scaling factor tuning in megatron FP8DistributedOptimizer
unittests

…d_auto_scaling

wkcn added 2 commits December 7, 2023 11:25

wgrad auto scaling in optimizer

6352d6f

pre_scale support

57d4a42

wkcn marked this pull request as draft December 7, 2023 03:34

wkcn added 4 commits December 7, 2023 11:42

fp8e4m3 for wgrad

69f73ce

refine code

55735b7

comment

bcd0f74

pre_scale support in add_to_fp8

d248505

wkcn mentioned this pull request Dec 8, 2023

Auto scaling factor tuning on FP8 weight gradients reduction for Megatron-LM Azure/MS-AMP-Examples#21

Open

wkcn added 8 commits December 8, 2023 16:51

pre_scale as Tensor

4a7fd1f

pre_scale *fp32

c850d63

pre_scale ut

617ee0b

test meta

199eb3a

auto scaling freq

1be092c

Merge branch 'wgrad_auto_scaling' of github.com:wkcn/ms-amp into wgra…

c08bc0d

…d_auto_scaling

unittest

6e58e1e

bug fixed when stacking pre_scales

df68631

wkcn marked this pull request as ready for review December 11, 2023 03:10

wkcn requested review from guoshzhao and tocean December 12, 2023 06:47

wkcn enabled auto-merge (squash) December 12, 2023 06:51

wkcn changed the title ~~Auto scaling factor tuning for FP8 collective communication~~ [Feature] Auto scaling factor tuning for FP8 collective communication Dec 14, 2023

wkcn mentioned this pull request Dec 14, 2023

V0.4 Release Plan #123

Open

9 tasks

Merge branch 'main' into wgrad_auto_scaling

664025d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Auto scaling factor tuning for FP8 collective communication #140

[Feature] Auto scaling factor tuning for FP8 collective communication #140

Uh oh!

wkcn commented Dec 7, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Feature] Auto scaling factor tuning for FP8 collective communication #140

Are you sure you want to change the base?

[Feature] Auto scaling factor tuning for FP8 collective communication #140

Uh oh!

Conversation

wkcn commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wkcn commented Dec 7, 2023 •

edited

Loading