Training Argos Translate Models using vast.ai

Training requires CUDA GPU for processing. For this tutorial, we will be renting a GPU/CUDA enabled server instance on vast.ai using a prepared Docker image specifically created for training Argostranslate models. A future tutorial will be made for training on your own debian-based PC at home (the init script assumes apt package manager, but can manually do same steps on any comparable linux distribution).

First off, checkout and read the information provided by argosopentech directly here: https://github.com/argosopentech/argos-train and watch the video turorial at https://odysee.com/@argosopentech:7/training-an-Argos-Translate-model-tutorial-2022:2?r=DMnK7NqdPNHRCfwhmKY9LPow3PqVUUgw

This tutorial is mainly just a wordy version of those instructions/video tutorial with some more step-by-step details added from my experience learning to train.

I. Preparing your Dataset(s)

A main source for translation data can be found from the OPUS project. You can train a model from a single dataset, or a combination of multiple datasets.

Argos Tranlsate currently uses a pivot method of translation. To translate French to German, it will first translate French to English and then from English to German. But we can use the same dataset(s) to create models for both en -> xy and xy -> en.

You can, of course, choose your own target non-English language. In this example I will be doing Greek / English.

A .argosdata file is just a ZIP file with a renamed extension. The structure is as follows:

    [dir] data-{datasource}-{source language}_{target language}
        - source
        - target
        - metadata.json
        - LICENSE
        - README
    

We will get most of this from the dataset we download from OPUS, but we will create the metadata.json file with the relevant information needed for the training.

Training will use the plain-text MOSES/GIZA++ download file of our language combination. Choose your source data (not all sources have all language combinations), here we will use the EUROPARL data.

Scroll down to the bottom grid and we want to select our language pair in the bottom left triangle

Find the row and column for your language. Here, row en column el ([1.3M]). Make note of the sentence pairs number in the tooltip - in this case 1292180 - we'll need that in our metadata.json file.

Save the file. I suggest renaming it more directly. For example data-europarl-en_el-orig.zip so that you know which source you used (remember, we can use multiple sources for a single training), and it's semi-ready for creating the .argosdata zip from - once we make a few adjustments...

Unzip the file into it's own, appropriately named directory - /data-europarl-en_el - and enter the new directory. You should have a file listing like:

-[dir]data-europarl-en_el
    - README
    - LICENSE
    - Europarl.el-en.xml    <---- We won't need this
    - Europarl.el-en.en     <---- This will be our source
    - Europarl.el-en.el     <---- This will be our target

While we are going to create both Greek to English and English to Greek, it is best to keep standard of English as original source, and new language as original target. You will see in the training section later that the argos-train program can create both directions from same dataset.

So...

  1. Delete the .xml file, it's not needed.
  2. Rename the file ending in .en to source (no extenstion)
  3. Rename the file ending in .el (or your chosen language) to target (again, no extenstion)
  4. Create a new file in this directory named metadata.json

Now open metadata.json in your favorite text editor - notepad, notepadd++, gedit, Atom, nano...

And paste this in as a template:

{
  "name": "Europarl",
  "type": "data",
  "from_code": "en",
  "to_code": "da",
  "size": 1991647,
  "reference": " J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)"
}
  • Adjust name and reference in relation to the data source used (take from top of source page...Eurparl, CCAligned, WikiMatrix, etc).
  • Leave from_code as "en" and change "to_code" to your target language - in this case "to_code": "el"
  • Change size to the number of sentence pairs in this data source. See above.. the tooltip when hovering over the file you are downloading from OPUS. In this case the en-el source has 1292180 sentence pairs.

New metadata.json file for this example:

{
  "name": "Europarl",
  "type": "data",
  "from_code": "en",
  "to_code": "el",
  "size": 1292180,
  "reference": " J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)"
}

* Hang on to this metadata, we will need this information again during the training setup. (Training data-index.json)

Now move back up a directory and ZIP the whole directory. data-europarl-en_el to data-europarl-en_el.zip and then change the extension to .argosdata so you now have the file data-europarl-en_el.argosdata

Conratulations! You now have a trainable dataset.

Now... upload it to a web-accessible location so the training scripts can download it. A location on your own hosting, or other publicly accesible location.

II. Setup Vast.ai for Training

While this can also be done on a home PC with a linux OS and CUDA enabled GPU, it is process intensive. A tutorial for training "at home" will be made eventually, but a cheap GPU-enabled temporary cloud server is easy and efficient.

Visit vast.ai but let's not setup a server quite yet. Click Sign In to create a new account and follow the usual steps of verifying your email address and such.

Once you are all set, login and go to Billing. Add a card. Once it's been added, add some credit. 10USD is more than plenty for a few trainings.

Now that you have the account setup and some credit to rent, click on the Create link in the Client section of the sidebar menu.

First off, we want to use the nicely, already prepared Docker image from Argos Open Tech. So in the Instance Configuration box, click [EDIT IMAGE & CONFIG] button. In the new window, scroll down to where you can enter a custom Docker image:

Enter argosopentech/argostrain for custom image, and click Select

Disk Space is up to you, relative to which and how many datasets you might want to use in training. Some can be over 3GB each.. I'd be safe and set this at least to 25-50GB

Now that the configuration setup is ready, we just need to pick a server to rent. Again, don't need overly fancy stuff. You'll be waiting and doing other stuff while it trains anyway. A 1X RTX 3090 has worked fine for me for about 0.30-0.60USD/hour.

Depending on the number of data files you will be using for training and their size, you may want to check and opt for a server with higher upload/download speeds.

You'll need to create an SSH key pair and paste the public key into Vast.ai. That's beyond the scope of this tutorial but in general for Linux, OSX and Windows 10+ from shell/command prompt run:

ssh-keygen
It will suggest the default location and filename of id_rsa as default. Copy the path, but change the filename to something like vastai. Go to the suggested directory (C:\Users\myusr\.ssh, or ~/.ssh, etc) and you should have a private key vastai and a public key vastai.pub. Open the vastai.pub file, then copy and paste this public key into the Vast.ai prompt.

Click Rent to create your new server instance. You'll see a popup notification about Instance Created. Go to CLIENT -> Instances from the left-side menu and you will see your server instance being created. It will take a minute or two as it sets up the server, installs base stuff as well as the argosopentech/argostrain Docker setup.

Once it's done setting up the instance your server is now ready and you will have a blue Connect button. Click Connect and Vast.ai will give you the SSH connection info. You may, of course, connect your SSH however you like - puTTY, MobaXTerm, etc. Make sure to add the vastai private key. Vast.ai gives will give you the direct ssh command.

Copy and paste this into a shell/terminal (Linux/OS) or command prompt (Windows) to connect to your server instance.

node1:~$ ssh -p 12345 root@ssh4.vast.ai -L 8080:localhost:8080

You are now root user in your new server instance. For me, I've found it helpful to go ahead and make sure the Python vitual environment package is installed. So go ahead and do that while you are root:

node1:~$ apt-get install virtualenv nano

This will also install additional related python packages as well that the argostrain init script would install anyway. And the nano text editor (I prefer, feel free to use your own.)

Now we want to switch to the user that the Docker image setup for our training and make sure we are in that user directory

node1:~$ su argosopentech
node1:~$ cd

III. Training your Model

Run the argos-train-init script. This will install additional required system packages, python libraries, and compile some necessary C libraries:

argosopentech@8856a5cd4738:~$ ./argos-train-init

Once the init is finished, we want to move into a virtual environment to do the training - you should then see (env) in front of your shell prompt - and change into our main running directoy ~/argos-train/

argosopentech@8856a5cd4738:~$ source ~/env/bin/activate
(env) argosopentech@8856a5cd4738:~$ cd ~/argos-train
(env) argosopentech@8856a5cd4738:~/argos-train$

Remove the default data-index.json file in the current directory. And edit a new one:

(env) argosopentech@8856a5cd4738:~$ nano data-index.json
This will be a JSON array of the source argosdata files you want to use in this training.

Each entry will be the same as your metadata.json file, but with link(s) to where you placed the .argosdata files (See Argosdata metadata.json for reference)
New data-index.json file for this example:

[
{
  "name": "Europarl",
  "type": "data",
  "from_code": "en",
  "to_code": "el",
  "size": 1292180,
  "reference": " J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)",
  "links": [
    "http://link to your/data-europarl-en_el.argosdata"
  ]
}
]

Now run the argos-train. We will do English to Target (Greek here) first, so follow the prompts:

(env) argosopentech@b8f995ca3e6e:~/argos-train$ argos-train
From code (ISO 639): en
To code (ISO 639): el
From name: English
To name: Greek
Version: 1.0

Barring any errors, the training will begin. First downloading a local copy of your .argosdata data file and doing some pre-processing of that data. If you decide to use multiple source data - Europarl plus CCMatrix plus Paracrawl - it will repeat this process for each source .argosdata included in the data-index.json file.

eruoparl-en_el
Downloading run/cache/europarl-en_el.argosdata
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 81 3373M   81 2757M    0     0  3137k      0  0:18:20  0:15:00  0:03:20 1170k

Once all of the data files that are being used have been downloaded and pre-processed, the real training into a .argosmodel will begin. It will output some benchmarks every few cycles, but overall the final training can take from up to an hour to even a day depending on the number and size of the sources you are using for the training.

Once the training is finished, it will let you know that the .argosmodel file has been saved.

Package saved to /home/argosopentech/argos-train/run/translate-en_el-1_0.argosmodel
(env) argosopentech@4012fbd31806:~/argos-train$

Go ahead and save this file somewhere. For example, open a new shell/terminal/prompt and use scp to download it:

node1:~$ mkdir argosmodels
node1:~$ scp -P 12345 root@ssh4.vast.ai:/home/argosopentech/argos-train/run/translate-en_el.argosmodel ~/argosmodels

Use the same port and host given by Vast.ai that you used to connect to ssh.

Now that we have a model for translating from English to Greek, we could go ahead and install it for argostranslate and LibreTranslate but we would on be able to translate from other installed languages to our new language (pivoting through English translation to get from one to the other). So while we have our data sources already, go ahead and translate it the other direction - en to el - now.

Remove the created /run/source and /run/target files created in the first training. Rerun argos-train and just switch the language codes/names at the prompt:

(env) argosopentech@4012fbd31806:~/argos-train$ rm run/source
(env) argosopentech@4012fbd31806:~/argos-train$ rm run/target
(env) argosopentech@4012fbd31806:~/argos-train$ argos-train
From code (ISO 639): el
To code (ISO 639): en
From name: Greek
To name: English
Version: 1.0

This will now train a new model for Greek to English. Once finished, repeat the steps above to download the newly trained model. You should now have both translate-en_el-1.0.argosmodel and translate-el_en-1.0.argosmodel files.

 

IV. Using Your Models

1. Use it directly

You don't have to be running your own LibreTranslate server to train a language. Odds are you might be though!

Upload your new .argosmodel files to your server. You'll need to install the models as the same user that the libretranslate server is running as. If you're not sure, run ps aux | grep libretranslate from the shell which will show the username the libretranslate process is running under.

I find it easiest to put the .argosmodel files in this users home directory.

Create a new file in the same directory of your model files named addModels.py with the following code

#!/usr/bin/python3

# run this script as the user which libretranslate service is running on
# do: ps aux | grep -i libretranslate from shell if unsure
import glob
from argostranslate import package, translate

models_path = './'
models_list = glob.glob(models_path + '*.argosmodel')

for model in models_list:
        print ('Adding model: ' + model)
        package.install_from_path(model)

print("Installed languages:")
installed_languages = translate.get_installed_languages()
for i in range(0,len(installed_languages)):
        print(installed_languages[i])
    

Now run the file to import your models

$ chmod 755 addModels.py
$ ./addModels.py

This will add your new language models to argostranslate, which then the libretranslate server can use. Restart your libretranslate server now. This will depend on how you setup your server. I set mine up as a systemd script so

$ systemctl restart libretranslate

Visit your LibreTranslate frontend web page, or try it out with your backend libretranslate API. Your new language should now be available to translate from or to.

2. Share it with the community

If you haven't been there yet, visit the LibreTranslate Community website. Sign up, welcome!

You can create a new thread about your new language models, but there is also already a Language Support topic for both requesting new translations and posting about newly made translations.

In either case, make sure to include links to both the .argosdata file(s) you used for training and to the final .argosmodel files from the training.

Credits / Acknowledgements