Skip to content

Commit 95c9459

Browse files
authored
Update llm finetune scripts and doc (#4722)
1 parent 4847352 commit 95c9459

File tree

18 files changed

+576
-192
lines changed

18 files changed

+576
-192
lines changed

README.md

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -46,28 +46,40 @@ In the current technological landscape, Generative AI (GenAI) workloads and mode
4646

4747
#### LLM fine-tuning
4848

49-
**Note**:
50-
Intel® Data Center Max 1550 GPU: support all the models in the model list above. Intel® Core™ Ultra Processors with Intel® Arc™ Graphics: support Llama 2 7B, Llama 3 8B and Phi-3-Mini 3.8B.
51-
52-
| MODEL FAMILY | Verified < MODEL ID > (Hugging Face hub)| Mixed Precision (BF16+FP32) | Full fine-tuning | LoRA | Intel® Data Center Max 1550 GPU | Intel® Core™ Ultra Processors with Intel® Arc™ Graphics |
53-
|---|:---:|:---:|:---:|:---:|:---:|:---:|
54-
|Llama 2 7B| "meta-llama/Llama-2-7b-hf" | 🟩 | 🟩 | 🟩 | 🟩 | 🟩 |
55-
|Llama 2 70B| "meta-llama/Llama-2-70b-hf" | 🟩 | 🟥 |🟩 | 🟩 | 🟥 |
56-
|Llama 3 8B| "meta-llama/Meta-Llama-3-8B" | 🟩 | 🟩 |🟩 | 🟩 | 🟩 |
57-
|Qwen 7B|"Qwen/Qwen-7B"| 🟩 | 🟩 |🟩 | 🟩| 🟥 |
58-
|Phi-3-mini 3.8B|"Phi-3-mini-4k-instruct"| 🟩 | 🟩 |🟩 | 🟥 | 🟩 |
59-
49+
##### LLM fine-tuning optimized with Intel® Data Center Max 1550 GPU on Linux
6050

6151

6252
| Benchmark mode | Full fine-tuning | LoRA |
6353
|---|:---:|:---:|
6454
|Single-GPU | 🟥 | 🟩 |
6555
|Multi-GPU (FSDP) | 🟩 | 🟩 |
6656

57+
| MODEL FAMILY | Verified < MODEL ID > (Hugging Face hub)| Mixed Precision (BF16+FP32) | Full fine-tuning | LoRA |
58+
|---|:---:|:---:|:---:|:---:|
59+
|[Llama 2 7B](./Llama2/README.md)| "meta-llama/Llama-2-7b-hf" | 🟩 | 🟩 | 🟩 |
60+
|[Llama 2 70B](./Llama2/README.md)| "meta-llama/Llama-2-70b-hf" | 🟩 | 🟥 |🟩 |
61+
|[Llama 3 8B](./Llama3/README.md)| "meta-llama/Meta-Llama-3-8B" | 🟩 | 🟩 |🟩 |
62+
|[Llama 3 70B](./Llama3/README.md)| "meta-llama/Meta-Llama-3-70B" | 🟩 | 🟥 |🟩 |
63+
|[Qwen 7B](./Qwen/README.md)|"Qwen/Qwen-7B"| 🟩 | 🟩 |🟩 |
64+
|[Phi-3-mini 3.8B](./Phi3/README.md#fine-tuning-on-intel-data-center-max-1550-gpu-on-linux)|"Phi-3-mini-4k-instruct"| 🟩 | 🟩 |🟩 |
65+
66+
67+
\* Intel® Data Center Max 1550 GPU: support all the models in the model list above.
68+
69+
##### LLM fine-tuning optimized with Intel® Core™ Ultra Processors with Intel® Arc™ Graphics
70+
71+
| MODEL FAMILY | Verified < MODEL ID > (Hugging Face hub)| Mixed Precision (BF16+FP32) | Full fine-tuning | LoRA |
72+
|---|:---:|:---:|:---:|:---:|
73+
|[Phi-3-mini 3.8B](./Phi3/README.md#fine-tuning-on-intel-core-ultra-processors-with-intel-arc-graphics)|"Phi-3-mini-4k-instruct"| 🟩 | 🟩 |🟩 |
74+
75+
6776
- 🟩 signifies that it is supported.
6877

6978
- 🟥 signifies that it is not supported yet.
7079

80+
\* Intel® Core™ Ultra Processors with Intel® Arc™ Graphics: support Phi-3-Mini 3.8B.
81+
82+
7183

7284
## Installation
7385

docs/tutorials/features/torch_compile_gpu.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph c
1313
- `intel_extension_for_pytorch` : v2.3
1414
- `triton` : >= v3.0.0
1515

16+
17+
Install [Intel® oneAPI Base Toolkit 2024.2.1](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).
18+
1619
Follow [Intel® Extension for PyTorch\* Installation](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/) to install `torch` and `intel_extension_for_pytorch` firstly.
1720

1821
Triton could be directly installed using the following command:
@@ -21,8 +24,17 @@ Triton could be directly installed using the following command:
2124
pip install --pre pytorch-triton-xpu==3.0.0+1b2f15840e --index-url https://download.pytorch.org/whl/nightly/xpu
2225
```
2326

27+
Remember to activate the oneAPI basekit by following commands.
28+
29+
```bash
30+
# {dpcpproot} is the location for dpcpp ROOT path and it is where you installed oneAPI DPCPP, usually it is /opt/intel/oneapi/compiler/latest or ~/intel/oneapi/compiler/latest
31+
source {dpcpproot}/env/vars.sh
32+
```
33+
34+
2435
# Example Usage
2536

37+
2638
## Inferenece with torch.compile
2739

2840
```python

docs/tutorials/llm.rst

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,8 @@ LLM Inference
6767

6868
*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.
6969

70-
LLM fine-tuning
71-
~~~~~~~~~~~~~~~
70+
LLM fine-tuning on Intel® Data Center Max 1550 GPU
71+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7272

7373
.. list-table::
7474
:widths: auto
@@ -79,42 +79,48 @@ LLM fine-tuning
7979
- Mixed Precision (BF16+FP32)
8080
- Full fine-tuning
8181
- LoRA
82-
- Intel® Data Center Max 1550 GPU
83-
- Intel® Core™ Ultra Processors with Intel® Arc™ Graphics
8482
* - Llama2
8583
- "meta-llama/Llama-2-7b-hf"
8684
- ✅
8785
- ✅
8886
- ✅
89-
- ✅
90-
- ✅
9187
* - Llama2
9288
- "meta-llama/Llama-2-70b-hf",
9389
- ✅
9490
- ❎
9591
- ✅
96-
- ✅
97-
- ❎
9892
* - Llama3
9993
- "meta-llama/Meta-Llama-3-8B"
10094
- ✅
10195
- ✅
10296
- ✅
103-
- ✅
104-
- ✅
10597
* - Qwen
10698
- "Qwen/Qwen-7B"
10799
- ✅
108100
- ✅
109101
- ✅
110-
- ✅
111-
- ❎
112102
* - Phi-3-mini 3.8B
113103
- "Phi-3-mini-4k-instruct"
114104
- ✅
115105
- ✅
116106
- ✅
117-
- ❎
107+
108+
LLM fine-tuning on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics
109+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110+
111+
.. list-table::
112+
:widths: auto
113+
:header-rows: 1
114+
115+
* - Model Family
116+
- Verified < MODEL ID > (Huggingface hub)
117+
- Mixed Precision (BF16+FP32)
118+
- Full fine-tuning
119+
- LoRA
120+
* - Phi-3-mini 3.8B
121+
- "Phi-3-mini-4k-instruct"
122+
- ✅
123+
- ✅
118124
- ✅
119125

120126
Check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.3.110/examples/gpu/llm>`_ for instructions to install/setup environment and example scripts..
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
## Llama2 fine-tuning
2+
3+
4+
5+
### Download a Model
6+
During the execution, you may need to log in your Hugging Face account to download model files from online mode. Refer to [HuggingFace Login](https://huggingface.co/docs/huggingface_hub/quick-start#login)
7+
8+
```
9+
huggingface-cli login --token <your_token_here>
10+
```
11+
12+
**Note**: If you have download a Llama2 model from Meta official Github, you can also convert it to huggingface format by following the [guide](https://huggingface.co/docs/transformers/main/en/model_doc/llama2#usage-tips).
13+
14+
### Download a Dataset
15+
16+
For Alpaca dataset, you can get here: [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json).
17+
```
18+
git clone https://github.com/tatsu-lab/stanford_alpaca
19+
cd standford_alpaca
20+
mv alpaca_data.json <Llama2_folder>
21+
```
22+
23+
24+
**Note**: During the execution, you need to log in your wandb account. Refer to [Wandb Login](https://docs.wandb.ai/ref/cli/wandb-login)
25+
```
26+
wandb login
27+
```
28+
29+
### Fine-tuning on multi-GPU
30+
31+
**Note**:
32+
The default `fsdp_config.yml` is set with 1 machine with 4 cards 8 tiles, If you are using different setting, please change the `num_process` accordingly.
33+
34+
#### Full-finetuning
35+
36+
37+
Example: Llama 2 7B full fine-tuning with Alpaca dataset, you can change the model name/path for another Llama2 Model.
38+
39+
40+
**Note**:
41+
We provide examples for Alpaca dataset with 52k data and guanaco-llama2-1k dataset from Hugging Face. We recommend [Alpaca dataset](#download-a-dataset), which has been recognized by some popular projects.
42+
Remove the flags `--data_path` in fine-tuning command will load the guanaco-llama2-1k dataset from Hugging Face by default in `llama2_ft.py`.
43+
44+
45+
46+
```bash
47+
export CCL_PROCESS_LAUNCHER=none
48+
export TORCH_LLM_ALLREDUCE=1
49+
50+
export model='meta-llama/Llama-2-7b-hf'
51+
52+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
53+
--model_name_or_path ${model} \
54+
--data_path ./alpaca_data.json \
55+
--bf16 True \
56+
--use_flashattn True \
57+
--output_dir ./result \
58+
--num_train_epochs 3 \
59+
--per_device_train_batch_size 4 \
60+
--per_device_eval_batch_size 1 \
61+
--gradient_accumulation_steps 4 \
62+
--evaluation_strategy "no" \
63+
--save_strategy "steps" \
64+
--save_steps 2000 \
65+
--save_total_limit 1 \
66+
--learning_rate 2e-5 \
67+
--weight_decay 0. \
68+
--warmup_ratio 0.03 \
69+
--lr_scheduler_type "cosine" \
70+
--logging_steps 1 \
71+
--optim "adamw_torch_fused" \
72+
```
73+
74+
75+
#### LoRA finetuning
76+
77+
Example: Llama 2 7B LoRA fine-tuning with Alpaca dataset, you can change the model name/path for another Llama2 Model.
78+
79+
**Note**:
80+
We provide examples for Alpaca dataset with 52k data and guanaco-llama2-1k dataset from Hugging Face. We recommend [Alpaca dataset](#download-a-dataset), which has been recognized by some popular projects.
81+
Remove the flags `--data_path` in fine-tuning command will load the guanaco-llama2-1k dataset from Hugging Face by default in `llama2_ft.py`.
82+
83+
84+
```bash
85+
export CCL_PROCESS_LAUNCHER=none
86+
export TORCH_LLM_ALLREDUCE=1
87+
88+
export model='meta-llama/Llama-2-7b-hf'
89+
90+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
91+
--model_name_or_path ${model} \
92+
--data_path ./alpaca_data.json \
93+
--bf16 True \
94+
--use_flashattn True \
95+
--use_peft True \
96+
--output_dir ./result \
97+
--num_train_epochs 1 \
98+
--per_device_train_batch_size 1 \
99+
--per_device_eval_batch_size 1 \
100+
--gradient_accumulation_steps 1 \
101+
--evaluation_strategy "no" \
102+
--save_strategy "steps" \
103+
--save_steps 2000 \
104+
--save_total_limit 1 \
105+
--learning_rate 2e-5 \
106+
--weight_decay 0. \
107+
--warmup_ratio 0.03 \
108+
--lr_scheduler_type "cosine" \
109+
--logging_steps 1 \
110+
--optim "adamw_torch_fused"
111+
```
File renamed without changes.

examples/gpu/llm/fine-tuning/Llama2/run_llama2_70b_fsdp.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ export TORCH_LLM_ALLREDUCE=1
55
# llama2-70b alpaca dataset peft lora
66
Run_llama2-70b_fsdp_alpaca_dataset_peft() {
77

8-
accelerate launch --config_file "fsdp_config.yaml" train.py \
8+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
99
--model_name_or_path ${model} \
1010
--data_path ./alpaca_data.json \
1111
--bf16 True \
@@ -31,7 +31,7 @@ Run_llama2-70b_fsdp_alpaca_dataset_peft() {
3131
# llama2-70b huggingface dataset peft lora
3232
Run_llama2-70b_fsdp_huggingface_dataset_peft() {
3333

34-
accelerate launch --config_file "fsdp_config.yaml" train.py \
34+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
3535
--model_name_or_path ${model} \
3636
--bf16 True \
3737
--use_flashattn True \
@@ -56,7 +56,7 @@ Run_llama2-70b_fsdp_huggingface_dataset_peft() {
5656

5757
main() {
5858

59-
model=meta-llama/Llama-2-70b-hf
59+
model="meta-llama/Llama-2-70b-hf"
6060

6161
Run_llama2-70b_fsdp_alpaca_dataset_peft
6262
# Run_llama2-70b_fsdp_huggingface_dataset_peft

examples/gpu/llm/fine-tuning/Llama2/run_llama2_7b_fsdp.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ export TORCH_LLM_ALLREDUCE=1
77
## alpaca dataset full-ft
88
Run_llama2-7b_fsdp_alpaca_dataset() {
99

10-
accelerate launch --config_file "fsdp_config.yaml" train.py \
10+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
1111
--model_name_or_path ${model} \
1212
--data_path ./alpaca_data.json \
1313
--bf16 True \
@@ -33,7 +33,7 @@ Run_llama2-7b_fsdp_alpaca_dataset() {
3333
## alpaca dataset peft lora
3434
Run_llama2-7b_fsdp_alpaca_dataset_peft() {
3535

36-
accelerate launch --config_file "fsdp_config.yaml" train.py \
36+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
3737
--model_name_or_path ${model} \
3838
--data_path ./alpaca_data.json \
3939
--bf16 True \
@@ -59,7 +59,7 @@ Run_llama2-7b_fsdp_alpaca_dataset_peft() {
5959
## huggingface dataset full-ft
6060
Run_llama2-7b_fsdp_huggingface_dataset() {
6161

62-
accelerate launch --config_file "fsdp_config.yaml" train.py \
62+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
6363
--model_name_or_path ${model} \
6464
--bf16 True \
6565
--use_flashattn True \
@@ -85,7 +85,7 @@ Run_llama2-7b_fsdp_huggingface_dataset() {
8585
## huggingface dataset peft lora
8686
Run_llama2-7b_fsdp_huggingface_dataset_peft() {
8787

88-
accelerate launch --config_file "fsdp_config.yaml" train.py \
88+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
8989
--model_name_or_path ${model} \
9090
--bf16 True \
9191
--use_flashattn True \

examples/gpu/llm/fine-tuning/Llama2/run_llama2_converge.sh

Lines changed: 3 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,13 @@
1-
21
export CCL_PROCESS_LAUNCHER=none
32

4-
# profiling set
5-
# export PROFILE=1
6-
# export KINETO=1
7-
83
# settings for torch-ccl
94
export TORCH_LLM_ALLREDUCE=1
105

11-
# torch-ccl verbose
12-
# export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1
13-
14-
# oneccl runtime
15-
# source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
16-
# source /home2/zhuhong/LLM/ccl-inference-dev-3/build/_install/env/setvars.sh
17-
186
## alpaca dataset full-ft
197
Run_llama2-7b_fsdp_alpaca_converge() {
208

219
model='meta-llama/Llama-2-7b-hf'
22-
23-
torchrun --nproc_per_node=8 --master_port='29900' train.py \
10+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
2411
--model_name_or_path ${model} \
2512
--data_path ./alpaca_data.json \
2613
--bf16 True \
@@ -50,7 +37,7 @@ Run_llama2-7b_fsdp_alpaca_peft_converge() {
5037

5138
model='meta-llama/Llama-2-7b-hf'
5239

53-
torchrun --nproc_per_node=8 --master_port='29900' train.py \
40+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
5441
--model_name_or_path ${model} \
5542
--data_path ./alpaca_data.json \
5643
--bf16 True \
@@ -81,7 +68,7 @@ Run_llama2-70b_fsdp_alpaca_peft_converge() {
8168

8269
model='meta-llama/Llama-2-70b-hf'
8370

84-
accelerate launch --config_file "fsdp_config.yaml" train.py \
71+
accelerate launch --config_file "fsdp_config.yaml" llama2_ft.py \
8572
--model_name_or_path ${model} \
8673
--data_path ./alpaca_data.json \
8774
--bf16 True \

0 commit comments

Comments
 (0)