Training requires CUDA GPU for processing. For this tutorial, we will be renting a GPU/CUDA enabled server instance on vast.ai using a prepared Docker image specifically created for training Argostranslate models. A future tutorial will be made for training on your own debian-based PC at home (the init script assumes apt package manager, but can manually do same steps on any comparable linux distribution).
First off, checkout and read the information provided by argosopentech directly here: https://github.com/argosopentech/argos-train and watch the video turorial at https://odysee.com/@argosopentech:7/training-an-Argos-Translate-model-tutorial-2022:2?r=DMnK7NqdPNHRCfwhmKY9LPow3PqVUUgw
This tutorial is mainly just a wordy version of those instructions/video tutorial with some more step-by-step details added from my experience learning to train.
A main source for translation data can be found from the OPUS project. You can train a model from a single dataset, or a combination of multiple datasets.
Argos Tranlsate currently uses a pivot method of translation. To translate French to German, it will first translate French to English and then from English to German. But we can use the same dataset(s) to create models for both en -> xy and xy -> en.
You can, of course, choose your own target non-English language. In this example I will be doing Greek / English.
A .argosdata
file is just a ZIP file with a renamed extension. The structure is as follows:
[dir] data-{datasource}-{source language}_{target language} - source - target - metadata.json - LICENSE - README
We will get most of this from the dataset we download from OPUS, but we will create the metadata.json
file with the relevant information needed for the training.
Training will use the plain-text MOSES/GIZA++ download file of our language combination. Choose your source data (not all sources have all language combinations), here we will use the EUROPARL data.
Scroll down to the bottom grid and we want to select our language pair in the bottom left triangle
Find the row and column for your language. Here, row en column el ([1.3M]). Make note of the sentence pairs number in the tooltip - in this case 1292180 - we'll need that in our metadata.json
file.
Save the file. I suggest renaming it more directly. For example data-europarl-en_el-orig.zip
so that you know which source you used (remember, we can use multiple sources for a single training), and it's semi-ready for creating the .argosdata
zip from - once we make a few adjustments...
Unzip the file into it's own, appropriately named directory - /data-europarl-en_el
- and enter the new directory. You should have a file listing like:
-[dir]data-europarl-en_el - README - LICENSE - Europarl.el-en.xml <---- We won't need this - Europarl.el-en.en <---- This will be our source - Europarl.el-en.el <---- This will be our target
While we are going to create both Greek to English and English to Greek, it is best to keep standard of English as original source, and new language as original target. You will see in the training section later that the argos-train program can create both directions from same dataset.
So...
.en
to source
(no extenstion).el
(or your chosen language) to target
(again, no extenstion)metadata.json
Now open metadata.json
in your favorite text editor - notepad, notepadd++, gedit, Atom, nano...
And paste this in as a template:
{ "name": "Europarl", "type": "data", "from_code": "en", "to_code": "da", "size": 1991647, "reference": " J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)" }
"to_code": "el"
New metadata.json
file for this example:
{ "name": "Europarl", "type": "data", "from_code": "en", "to_code": "el", "size": 1292180, "reference": " J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)" }
* Hang on to this metadata, we will need this information again during the training setup. (Training data-index.json
)
Now move back up a directory and ZIP the whole directory. data-europarl-en_el
to data-europarl-en_el.zip
and then change the extension to .argosdata
so you now have the file data-europarl-en_el.argosdata
Conratulations! You now have a trainable dataset.
Now... upload it to a web-accessible location so the training scripts can download it. A location on your own hosting, or other publicly accesible location.
While this can also be done on a home PC with a linux OS and CUDA enabled GPU, it is process intensive. A tutorial for training "at home" will be made eventually, but a cheap GPU-enabled temporary cloud server is easy and efficient.
Visit vast.ai but let's not setup a server quite yet. Click Sign In to create a new account and follow the usual steps of verifying your email address and such.
Once you are all set, login and go to Billing. Add a card. Once it's been added, add some credit. 10USD is more than plenty for a few trainings.
Now that you have the account setup and some credit to rent, click on the Create link in the Client section of the sidebar menu.
First off, we want to use the nicely, already prepared Docker image from Argos Open Tech. So in the Instance Configuration box, click [EDIT IMAGE & CONFIG] button. In the new window, scroll down to where you can enter a custom Docker image:
Enter argosopentech/argostrain
for custom image, and click Select
Disk Space is up to you, relative to which and how many datasets you might want to use in training. Some can be over 3GB each.. I'd be safe and set this at least to 25-50GB
Now that the configuration setup is ready, we just need to pick a server to rent. Again, don't need overly fancy stuff. You'll be waiting and doing other stuff while it trains anyway. A 1X RTX 3090 has worked fine for me for about 0.30-0.60USD/hour.
Depending on the number of data files you will be using for training and their size, you may want to check and opt for a server with higher upload/download speeds.
You'll need to create an SSH key pair and paste the public key into Vast.ai. That's beyond the scope of this tutorial but in general for Linux, OSX and Windows 10+ from shell/command prompt run:
ssh-keygenIt will suggest the default location and filename of
id_rsa
as default. Copy the path, but change the filename to something like vastai
. Go to the suggested directory (C:\Users\myusr\.ssh, or ~/.ssh, etc) and you should have a private key vastai
and a public key vastai.pub
. Open the vastai.pub
file, then copy and paste this public key into the Vast.ai prompt.
Click Rent to create your new server instance. You'll see a popup notification about Instance Created. Go to CLIENT -> Instances from the left-side menu and you will see your server instance being created. It will take a minute or two as it sets up the server, installs base stuff as well as the argosopentech/argostrain
Docker setup.
Once it's done setting up the instance your server is now ready and you will have a blue Connect button. Click Connect and Vast.ai will give you the SSH connection info. You may, of course, connect your SSH however you like - puTTY, MobaXTerm, etc. Make sure to add the vastai private key. Vast.ai gives will give you the direct ssh
command.
Copy and paste this into a shell/terminal (Linux/OS) or command prompt (Windows) to connect to your server instance.
node1:~$ ssh -p 12345 root@ssh4.vast.ai -L 8080:localhost:8080
You are now root
user in your new server instance. For me, I've found it helpful to go ahead and make sure the Python vitual environment package is installed. So go ahead and do that while you are root:
node1:~$ apt-get install virtualenv nano
This will also install additional related python packages as well that the argostrain init script would install anyway. And the nano text editor (I prefer, feel free to use your own.)
Now we want to switch to the user that the Docker image setup for our training and make sure we are in that user directory
node1:~$ su argosopentech node1:~$ cd
Run the argos-train-init
script. This will install additional required system packages, python libraries, and compile some necessary C libraries:
argosopentech@8856a5cd4738:~$ ./argos-train-init
Once the init is finished, we want to move into a virtual environment to do the training - you should then see (env)
in front of your shell prompt - and change into our main running directoy ~/argos-train/
argosopentech@8856a5cd4738:~$ source ~/env/bin/activate (env) argosopentech@8856a5cd4738:~$ cd ~/argos-train (env) argosopentech@8856a5cd4738:~/argos-train$
Remove the default data-index.json
file in the current directory. And edit a new one:
(env) argosopentech@8856a5cd4738:~$ nano data-index.jsonThis will be a JSON array of the source
argosdata
files you want to use in this training.
Each entry will be the same as your metadata.json
file, but with link(s) to where you placed the .argosdata
files (See Argosdata metadata.json
for reference)
New data-index.json
file for this example:
[ { "name": "Europarl", "type": "data", "from_code": "en", "to_code": "el", "size": 1292180, "reference": " J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)", "links": [ "http://link to your/data-europarl-en_el.argosdata" ] } ]
Now run the argos-train
. We will do English to Target (Greek here) first, so follow the prompts:
(env) argosopentech@b8f995ca3e6e:~/argos-train$ argos-train From code (ISO 639): en To code (ISO 639): el From name: English To name: Greek Version: 1.0
Barring any errors, the training will begin. First downloading a local copy of your .argosdata
data file and doing some pre-processing of that data. If you decide to use multiple source data - Europarl plus CCMatrix plus Paracrawl - it will repeat this process for each source .argosdata
included in the data-index.json
file.
eruoparl-en_el Downloading run/cache/europarl-en_el.argosdata % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 81 3373M 81 2757M 0 0 3137k 0 0:18:20 0:15:00 0:03:20 1170k
Once all of the data files that are being used have been downloaded and pre-processed, the real training into a .argosmodel
will begin. It will output some benchmarks every few cycles, but overall the final training can take from up to an hour to even a day depending on the number and size of the sources you are using for the training.
Once the training is finished, it will let you know that the .argosmodel
file has been saved.
Package saved to /home/argosopentech/argos-train/run/translate-en_el-1_0.argosmodel (env) argosopentech@4012fbd31806:~/argos-train$
Go ahead and save this file somewhere. For example, open a new shell/terminal/prompt and use scp
to download it:
node1:~$ mkdir argosmodels node1:~$ scp -P 12345 root@ssh4.vast.ai:/home/argosopentech/argos-train/run/translate-en_el.argosmodel ~/argosmodels
Use the same port and host given by Vast.ai that you used to connect to ssh.
Now that we have a model for translating from English to Greek, we could go ahead and install it for argostranslate
and LibreTranslate
but we would on be able to translate from other installed languages to our new language (pivoting through English translation to get from one to the other). So while we have our data sources already, go ahead and translate it the other direction - en to el - now.
Remove the created /run/source and /run/target files created in the first training. Rerun argos-train
and just switch the language codes/names at the prompt:
(env) argosopentech@4012fbd31806:~/argos-train$ rm run/source (env) argosopentech@4012fbd31806:~/argos-train$ rm run/target (env) argosopentech@4012fbd31806:~/argos-train$ argos-train From code (ISO 639): el To code (ISO 639): en From name: Greek To name: English Version: 1.0
This will now train a new model for Greek to English. Once finished, repeat the steps above to download the newly trained model.
You should now have both translate-en_el-1.0.argosmodel
and translate-el_en-1.0.argosmodel
files.
You don't have to be running your own LibreTranslate server to train a language. Odds are you might be though!
Upload your new .argosmodel
files to your server. You'll need to install the models as the same user that the libretranslate server is running as. If you're not sure, run ps aux | grep libretranslate
from the shell which will show the username the libretranslate process is running under.
I find it easiest to put the .argosmodel
files in this users home directory.
Create a new file in the same directory of your model files named addModels.py
with the following code
#!/usr/bin/python3 # run this script as the user which libretranslate service is running on # do: ps aux | grep -i libretranslate from shell if unsure import glob from argostranslate import package, translate models_path = './' models_list = glob.glob(models_path + '*.argosmodel') for model in models_list: print ('Adding model: ' + model) package.install_from_path(model) print("Installed languages:") installed_languages = translate.get_installed_languages() for i in range(0,len(installed_languages)): print(installed_languages[i])
Now run the file to import your models
$ chmod 755 addModels.py $ ./addModels.py
This will add your new language models to argostranslate, which then the libretranslate server can use. Restart your libretranslate server now. This will depend on how you setup your server. I set mine up as a systemd script so
$ systemctl restart libretranslate
Visit your LibreTranslate frontend web page, or try it out with your backend libretranslate API. Your new language should now be available to translate from or to.
If you haven't been there yet, visit the LibreTranslate Community website. Sign up, welcome!
You can create a new thread about your new language models, but there is also already a Language Support topic for both requesting new translations and posting about newly made translations.
In either case, make sure to include links to both the .argosdata
file(s) you used for training and to the final .argosmodel
files from the training.