@@ -35,8 +35,8 @@ Currently this project is working on progress. And the code is not verified yet.
3535pip install bert-pytorch
3636```
3737
38+ ## Quickstart
3839
39- ## Usage
4040** NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator**
4141```
4242Welcome to the \t the jungle \n
@@ -47,32 +47,16 @@ I can stay \t here all night \n
4747``` shell
4848bert-vocab -c data/corpus.small -o data/corpus.small.vocab
4949```
50- ``` shell
51- usage: bert-vocab [-h] -c CORPUS_PATH -o OUTPUT_PATH [-s VOCAB_SIZE]
52- [-e ENCODING] [-m MIN_FREQ]
53- ```
50+
5451### 2. Building BERT train dataset with your corpus
5552``` shell
5653bert-dataset -d data/corpus.small -v data/corpus.small.vocab -o data/dataset.small
5754```
5855
59- ``` shell
60- usage: bert-dataset [-h] -v VOCAB_PATH -c CORPUS_PATH [-e ENCODING] -o
61- OUTPUT_PATH [-w WORKERS]
62- ```
63-
6456### 3. Train your own BERT model
6557``` shell
6658bert -d data/dataset.small -v data/corpus.small.vocab -o output/
6759```
68- ``` shell
69- usage: bert [-h] -d TRAIN_DATASET [-t TEST_DATASET] -v VOCAB_PATH -o
70- OUTPUT_DIR [-hs HIDDEN] [-n LAYERS] [-a ATTN_HEADS] [-s SEQ_LEN]
71- [-b BATCH_SIZE] [-e EPOCHS] [-w NUM_WORKERS]
72- [--corpus_lines CORPUS_LINES] [--lr LR]
73- [--adam_weight_decay ADAM_WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
74- [--adam_beta2 ADAM_BETA2] [--log_freq LOG_FREQ] [-c CUDA]
75- ```
7660
7761## Language Model Pre-training
7862
@@ -119,7 +103,6 @@ not directly captured by language modeling
1191032 . Randomly 50% of next sentence, gonna be unrelated sentence.
120104
121105
122-
123106## Author
124107Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatter.co.kr )
125108
0 commit comments