Alina Karakanta

Machine Translation for Languages of Limited Diffusion
Alina Karakanta, Universit ̈at des Saarlandes

Machine translation (MT) is greatly changing the translator’s profession with significant
implications for translator training. Languages of limited diffusion (LoLD) have not yet fully
reaped the benefits of MT due to the lack of digital resources, such as monolingual and parallel
corpora and language-specific processing tools. Recently, attempts have been made to improve
the availability and quality of MT for LoLD with data collection and adaptation techniques
and new MT methods. In this presentation, we briefly discuss state-of-the-art methods for
improving MT for LoLD. One such method is triangulation, the process of using one or more
intermediate (usually major) languages as pivot (Cohn and Lapata, 2007; Razmara and Sarkar,

2013; Durrani and Koehn, 2014). For Neural Machine Translation, training a model on a high-
resource language pair and resuming training on the LoLD pair, a process known as transfer

learning (Zoph et al., 2016), has shown promising improvements in quality. Lately, zero-shot
approaches (Johnson et al., 2016) do not rely on explicit parallel data in a specific language
pair, but translate between multiple languages with only one single model.
We further refer to techniques for increasing the data available for training MT systems,
by extracting parallel sentences or bilingual dictionaries from comparable corpora such as
Wikipedia (Munteanu et al., 2004; Fiser and Ljubesic, 2011; Irvine and Callison-burch, 2013),
taking advantage of data from closely-related languages (Nakov and Ng, 2012; Currey et al.,
2016; Karakanta et al., 2017) and using monolingual data either as the language model or as
synthetic/back-translated parallel data (G ̈ul ̧cehre et al., 2015; Sennrich et al., 2015). Lastly,
we hope that these improvements in MT quality will foster a stronger collaboration between
translators and MT researchers, leading in high-quality resources for LoLD, and we propose
ways to facilitate it. 1

Keywords

translation technology, machine translation, languages of limited diffusion, resources

Alina Karakanta is currently a Research Assistant at the Department of Language Science and Technology, University of Saarland, Germany. Her research concentrates on human and machine translation, especially in low-resource scenarios, as friends and not as foes, and how they can improve each other. Her work has been published in major conferences (ACL SRW 2016) and journals (Machine translation-special issue on low-resource languages). In addition, I have served in the programme/organising committee for top NLP workshops (ACL SRW 2018, LoResMT 2018). She received her degree in Translation Studies at the Ionian University and later, a second degree in Interpreting Studies from the same university. Searching for innovation in language research, she obtained a M.Sc. in Computational Linguistics at the University of Saarland. Since 2013, she has been an official translator for Greek, Romanian, English and German, while she has also worked as a project manager for a global translation company. I share my experience participating in mentoring programmes and organising training sessions and lectures on translation technologies for students and professionals.

References
Cohn, Trevor and Mirella Lapata (2007). “Machine Translation by Triangulation: Making Effective
Use of Multi-Parallel Corpora”. In: In Proc. of the ACL.
1The form of this contribution is an oral presentation, but can also be adapted as a speed presentation.

Currey, Anna, Alina Karakanta, and Jon Dehdari (2016). “Using Related Languages to Enhance
Statistical Language Models”. In: Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Student Research Workshop. San
Diego, CA, USA: Association for Computational Linguistics, pp. 24–31. url: https://www.
aclweb.org/anthology/N/N16/#2000.
Durrani, Nadir and Philipp Koehn (2014). “Improving Machine Translation via Triangula- tion and
Transliteration”. In: Proceedings of the 17th Annual Conference of the European Association

for Machine Translation. Dubrovnik, Croatia: EAMT’14, pages 71–78. url: http://www.mt-
archive.info/10/EAMT-2014-Durrani.pdf.

Fiser, Darja and Nikola Ljubesic (2011). “Bilingual lexicon extraction from comparable corpora for
closely related languages”. In: Proceedings of Recent Advances in Natural Language Processing.
Hissar, Bulgaria, pp. 125–131.

G ̈ul ̧cehre, C ̧ aglar et al. (2015). “On Using Monolingual Corpora in Neural Machine Translation”.
In: CoRR abs/1503.03535. url: http://arxiv.org/abs/1503.03535.
Irvine, Ann and Chris Callison-burch (2013). Combining Bilingual and Comparable Corpora for
Low Resource Machine Translation.

Johnson, Melvin et al. (2016). “Google’s Multilingual Neural Machine Translation System: En-
abling Zero-Shot Translation”. In: ArXiv Preprint. url: https://arxiv.org/abs/1611.04558.

Karakanta, Alina, Jon Dehdari, and Josef van Genabith (2017). “Neural machine translation for
low-resource languages without parallel corpora”. In: Machine Translation. issn: 1573-0573.
doi: 10.1007/s10590-017-9203-5. url: https://doi.org/10.1007/s10590-017-9203-5.
Munteanu, Dragos Stefan, Alexander M. Fraser, and Daniel Marcu (2004). “Improved Machine

Translation Performance via Parallel Sentence Extraction from Comparable Corpora”. In: Hu-
man Language Technology Conference of the North American Chapter of the Association for

Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA, May 2-7, 2004,
pp. 265–272. url: http://aclweb.org/anthology/N/N04/N04-1034.pdf.

Nakov, Preslav and Hwee Tou Ng (2012). “Improving statistical machine translation for a resource-poor languages using related resource-rich languages”. In: Journal of Artificial Intelligence Research, pp. 179–222.

Razmara, Majid and Anoop Sarkar (2013). “Ensemble triangulation for statistical machine trans-
lation”. In: In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 252–260.

Sennrich, Rico, Barry Haddow, and Alexandra Birch (2015). “Improving Neural Machine Trans-
lation Models with Monolingual Data”. In: CoRR abs/1511.06709. url: http://arxiv.org/abs/1511.06709.

Zoph, Barret et al. (2016). “Transfer Learning for Low-Resource Neural Machine Translation”.
In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics. Austin, Texas, 15681575. url: https://aclweb.
org/anthology/D16-1163.