Fairseq Training Breakdown - Run training for fun and education!

Mark_Liqid · June 4, 2021, 9:45pm

Ensure the system has an nvidia GPU and that nouveau is not loaded. This is for Ubuntu 18.04 - it may not run on your card… there are some bugs.

You may see :

RuntimeError: view size is not compatible with input tensor’s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(…) instead.

If you do I do not have a resolution.

Install Cuda Toolkit :

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu

Download the newest 18.04 version and install it. The run file is recommended.

Install using sudo bash and then ./filename.run

Install Docker :
– Breakdown of–
Install Docker Engine on Ubuntu
(Install Docker Engine on Ubuntu | Docker Documentation)

sudo apt install \ apt-transport-https \ ca-certificates \ curl \ gnupg \ lsb-release

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo \ "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update sudo apt install install docker-ce docker-ce-cli containerd.io -y

Install Nvidia-Docker :

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt update sudo apt install nvidia-docker2 -y

This workload will train a model that will “translate” English to German and vice versa. I also assume you have CUDA tookit, docker.ce and Nvidia-docker 2 installed. This is a breakdown of the Readme.md at (DeepLearningExamples/PyTorch/Translation/Transformer at master · NVIDIA/DeepLearningExamples · GitHub)

Setup :

Create a directory for your data to be stored in.

sudo mkdir /data

Clone the DeepLearningExamples repo from Github in whatever directory you want.

git clone https://github.com/NVIDIA/DeepLearningExamples.git

Navigate to the model directory.

cd /DeepLearningExamples/PyTorch/Translation/Transformer

Pull the Pytorch container :

docker pull nvcr.io/nvidia/pytorch:21.05-py3

Now launch the container with your data directory mounted. The sample command will run a container as a daemon and it will be backgrounded upon launch.

nvidia-docker run -itd --rm --ipc=host -v /datas/:/data/wmt14_en_de_joined_dict (YOUR PYTORCH CONTAINER) bash

Attach to your container.

First list the current images that are running.

sudo docker ps -a

The output will look something like this -

sudo docker ps -a

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 36f4c752321d9 transformer_pyt:latest "/usr/local/bin/nvid…" 4 seconds ago Up 3 seconds 6006/tcp, 8888/tcp great_montalcini

Notice the container’s name. This is randomly generated. Now attach to the container.

sudo docker attach great_montalcini

You will be connected to the container and it will be stdin/stdout. BEWARE that if you exit, ctrl+c or anything you will hose whatever process is running or kill the container. In order to disconnect without killing the container do ctrl+p and then ctrl+q - it will confirm that you’ve disconnected.

Now that you’re in the container you’ll need to get the training data and process it. Nvidia has provided a push-button-get-bacon script to do that.

scripts/run_preprocessing.sh

This will take a while, let it finish. In order to something else use ctrl+p and then ctr+q. You can reconnect to the container to check it’s progress using the attach command.

After the data is downloaded and processed, run training in the container.

python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py /data/wmt14_en_de_joined_dict \ --arch transformer_wmt_en_de_big_t2t \ --share-all-embeddings \ --optimizer adam \ --adam-betas '(0.9, 0.997)' \ --adam-eps "1e-9" \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 0.0 \ --warmup-updates 4000 \ --lr 0.0006 \ --min-lr 0.0 \ --dropout 0.1 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 5120 \ --seed 1 \ --fuse-layer-norm \ --amp \ --amp-level O2 \ --save-dir /workspace/checkpoints \ --distributed-init-method env://

WARNING : Notice the following parameter in the massive commandline example from NVIDIA “ --distributed-init-method env:// “ if you are not running multi-GPU training you must remove this from the command.

SECOND-BREAKFAST-WARNING : Notice the following parameter in the first line of the example “ --nproc_per_node 8 “ change this to match your GPU count. 1 GPU - nproc_per_node 1.

Let the training run. It will change loss and tune itself as best as it can for your GPU(s) VRAM. You can further tune the run by changing parameters. The finished model will be pretty huge, 1.3+ TB so… yay?

Here is a list of parameters you can set and what they do.

--arch - select the specific configuration for the model. You can select between various predefined hyper parameters values like number of encoder/decoder blocks, dropout value or size of hidden state representation.

--share-all-embeddings - use the same set of weights for encoder and decoder words embedding.

--optimizer - choose optimization algorithm.

--clip-norm - set a value that gradients will be clipped to.

--lr-scheduler - choose learning rate change strategy.

--warmup-init-lr - start linear warmup with a learning rate at this value.

--warmup-updates - set number of optimization steps after which linear warmup will end.

--lr - set learning rate.

--min-lr - prevent learning rate to fall below this value using arbitrary learning rate schedule.

--dropout - set dropout value.

--weight-decay - set weight decay value.

--criterion - select loss function.

--label-smoothing - distribute value of one-hot labels between all entries of a dictionary. Value set by this option will be a value subtracted from one-hot label.

--max-tokens - set batch size in terms of tokens.

--max-sentences - set batch size in terms of sentences. Note that then the actual batchsize will vary a lot more than when using --max-tokens option.

--seed - set random seed for NumPy and PyTorch RNGs.

--max-epochs - set the maximum number of epochs.

--online-eval - perform inference on test set and then compute BLEU score after every epoch.

--target-bleu - works like --online-eval and sets a BLEU score threshold which after being attained will cause training to stop.

--amp - use mixed precision.

--save-dir - set directory for saving checkpoints.

--distributed-init-method - method for initializing torch.distributed package. You can either provide addresses with the tcp method or use the envionment variables initialization with env method

--update-freq - use gradient accumulation. Set number of training steps across which gradient will be accumulated.