Software:Apertium

From HandWiki
Short description: Open-source rule-based machine translation platform
Apertium
Apertium logo.svg
Apertium-tolk.png
Apertium-tolk, a simple desktop user interface for Apertium that translates as the user types
Repositorygithub.com/apertium
Written inC++
Operating systemPOSIX compatible and Windows NT (limited support)
Available in35 languages, see below
TypeRule-based machine translation
LicenseGNU General Public License
Websitewww.apertium.org

Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.

Overview

Apertium is a transfer-based machine translation system, which uses finite state transducers for all of its lexical transformations, and Constraint Grammar taggers as well as hidden Markov models or Perceptrons for part-of-speech tagging / word category disambiguation.[1] A structural transfer component is responsible for word movement and agreement; most Apertium language pairs up until now have used "chunking" or shallow transfer rules, though newer pairs use (possibly recursive) rules defined in a Context-free grammar.[2]

Many existing machine translation systems available at present are commercial or use proprietary technologies, which makes them very hard to adapt to new usages. Apertium code and data is free software and uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth.

At present (December 2020), Apertium has released 51 stable language pairs,[3] delivering fast translation with reasonably intelligible results (errors are easily corrected). Being an open-source project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.

History

Apertium originated as one of the machine translation engines in the project OpenTrad, which was funded by the Spain government, and developed by the Transducens research group at the Universitat d'Alacant. It was originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. To create a new machine translation system, one just has to develop linguistic data (dictionaries, rules) in well-specified XML formats.

Language data developed for it (in collaboration with the Universidade de Vigo, the Universitat Politècnica de Catalunya and the Universitat Pompeu Fabra) currently support (in stable version) the Arabic, Aragonese, Asturian, Basque, Belarusian, Breton, Bulgarian, Catalan, Crimean Tatar, Danish, English, Esperanto, French, Galician, Hindi, Icelandic, Indonesian, Italian, Kazakh, Macedonian, Malaysian, Maltese, Northern Sami, Norwegian (Bokmål and Nynorsk), Occitan, Polish, Portuguese, Romanian, Russian, Sardinian, Serbo-Croatian, Silesian, Slovene, Spanish, Swedish, Tatar, Ukrainian, Urdu, and Welsh languages. A full list is available below. Several companies are also involved in the development of Apertium, including Prompsit Language Engineering, Imaxin Software and Eleka Ingeniaritza Linguistikoa.

The project has taken part in the 2009,[4] 2010,[5] 2011,[6] 2012,[7] 2013[8] and 2014[9] editions of Google Summer of Code and the 2010,[10] 2011,[11] 2012,[12] 2013,[13] 2014,[14] 2015,[15] 2016[16] and 2017[17] editions of Google Code-In.

Translation methodology

Pipeline of Apertium machine translation system

This is an overall, step-by-step view how Apertium works.

The diagram displays the steps that Apertium takes to translate a source-language text (the text we want to translate) into a target-language text (the translated text).

  1. Source language text is passed into Apertium for translation.
  2. The deformatter removes formatting markup (HTML, RTF, etc.) that should be kept in place but not translated.
  3. The morphological analyser segments the text (expanding elisions, marking set phrases, etc.), and looks up segments in the language dictionaries, returning dictionary forms and tags for all matches. In pairs that involve agglutinative morphology, including a number of Turkic languages, a Helsinki Finite State Transducer (HFST) is used. Otherwise, an Apertium-specific finite state transducer system called lttoolbox,[18] is used.
  4. The morphological disambiguator (the morphological analyser and the morphological disambiguator together form the part of speech tagger) resolves ambiguous segments (i.e., when there is more than one match) by choosing one match. Apertium uses Constraint Grammar rules (with the vislcg3 parser[19]) for most of its language pairs.
  5. Retokenisation uses a finite state transducer to match sequences of lexical units and may reorder or translate tags (often used for translating idiomatic expressions into something that more approaches the target language grammar)
  6. Lexical transfer looks up disambiguated source-language basewords to find their target-language equivalents (i.e., mapping source language to target language). For lexical transfer, Apertium uses an XML-based dictionary format called bidix.[20]
  7. Lexical selection chooses between alternative translations when the source text word has alternative meanings. Apertium uses a specific XML-based technology, apertium-lex-tools,[21] to perform lexical selection.
  8. Structural transfer (i.e., it is an XML format that allows writing complex structural transfer rules) can consist of one-step chunking transfer, three-step chunking transfer or a CFG-based transfer module. The chunking modules flag grammatical differences between the source language and target language (e.g. gender or number agreement) by creating a sequence of chunks containing markers for this. They then reorder or modify chunks in order to produce a grammatical translation in the target-language. The newer CFG-based module matches input sequences into possible parse trees, selecting the best-ranking one and applying transformation rules on the tree.
  9. The morphological generator uses the tags to deliver the correct target language surface form. The morphological generator is a morphological transducer,[22] just like the morphological analyser. A morphological transducer both analyses and generates forms.
  10. The post-generator makes any necessary orthographic changes due to the contact of words (e.g. elisions).
  11. The reformatter replaces formatting markup (HTML, RTF, etc.) that was removed by the deformatter in the first step.
  12. Apertium delivers the target-language translation.

Language pairs

List of currently stable language pairs, hover over the language codes to see the languages that they represent.

af ar an ast eu br bg ca da nl en eo fi fr gl de hin is id it kaz mk ms mt sme nb nn oc pt ro sc hbs slv es sv tat urd cy
Afrikaans No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No No No No No No No No No No No
Arabic No No No No No No No No No No No No No No No No No No No No No No Yes (←) No No No No No No No No No No No No No No
Aragonese No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No
Asturian No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No
Basque No No No No No No No No No Yes (→) No No No No No No No No No No No No No No No No No No No No No No Yes (→) No No No No
Breton No No No No No No No No No No No No Yes (→) No No No No No No No No No No No No No No No No No No No No No No No No
Bulgarian No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No
Catalan No No Yes (⇄) No No No No No No Yes (⇄) Yes (→) No Yes (⇄) No No No No No Yes (←) No No No No No No No Yes (⇄) Yes (⇄) No Yes (→) No No Yes (⇄) No No No No
Danish No No No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) Yes (⇄) No No No No No No No Yes (←) No No No
Dutch Yes (⇄) No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
English No No No No Yes (←) No No Yes (⇄) No No Yes (⇄) No No Yes (⇄) No No Yes (←) No No No Yes (←) No No No No No No No No No Yes (←) No Yes (⇄) No No No Yes (←)
Esperanto No No No No No No No Yes (←) No No Yes (⇄) No Yes (←) No No No No No No No No No No No No No No No No No No No No No No No No
Finnish No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No No No No No
French No No No No No Yes (←) No Yes (⇄) No No No Yes (→) No No No No No No No No No No No No No No Yes (→) No No No No No No Yes (⇄) No No No
Galician No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No Yes (⇄) No No No No Yes (⇄) No No No No
German No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No No No No No No No
Hindi No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No
Icelandic No No No No No No No No No No Yes (→) No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No
Indonesian No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No
Italian No No No No No No No Yes (→) No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No
Kazakh No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No
Macedonian No No No No No No Yes (⇄) No No No Yes (→) No No No No No No No No No No No No No No No No No No No Yes (←) No No No No No No
Malaysian No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No
Maltese No Yes (→) No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
Northern Sami No No No No No No No No No No No No No No No No No No No No No No No No Yes (→) No No No No No No No No No No No No
Norwegian (Bokmål) No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No Yes (←) Yes (⇄) No No No No No No No No No No No
Norwegian (Nynorsk) No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No
Occitan No No No No No No No Yes (⇄) No No No No No Yes (←) No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No
Portuguese No No No No No No No Yes (⇄) No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No Yes (⇄) No No No No
Romanian No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No Yes (←) No No No No
Sardinian No No No No No No No Yes (←) No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No
Serbo-Croatian No No No No No No No No No No Yes (→) No No No No No No No No No No Yes (→) No No No No No No No No No Yes (⇄) No No No No No
Slovenian No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No No
Spanish No No Yes (⇄) Yes (⇄) Yes (←) No No Yes (⇄) No No Yes (⇄) Yes (→) No Yes (⇄) Yes (⇄) No No No No No No No No No No No No Yes (⇄) Yes (⇄) Yes (←) No No No No No No No
Swedish No No No No No No No No Yes (→) No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No No
Tatar No No No No No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No
Urdu No No No No No No No No No No No No No No No No Yes (⇄) No No No No No No No No No No No No No No No No No No No No
Welsh No No No No No No No No No No Yes (→) No No No No No No No No No No No No No No No No No No No No No No No No No No

See also


Notes

  1. Francis M. Tyers (2010) "Rule-based Breton to French machine translation ". 'Proceedings of the 14th Annual Conference of the European Association of Machine Translation, EAMT10', pp. 174--181
  2. Khanna, Tanmai; Washington, Jonathan N.; Tyers, Francis M.; Bayatlı, Sevilay; Swanson, Daniel G.; Pirinen, Tommi A.; Tang, Irene; Alòs i Font, Hèctor (1 December 2021). "Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages". Machine Translation 35 (4): 475–502. doi:10.1007/s10590-021-09260-6. 
  3. "Apertium". https://wiki.apertium.org/wiki/Main_Page. 
  4. "Accepted organizations for Google Summer of Code 2009". https://www.google-melange.com/gsoc/org/list/public/google/gsoc2009. 
  5. "Accepted organizations for Google Summer of Code 2010". https://www.google-melange.com/gsoc/org/list/public/google/gsoc2010. 
  6. "Accepted organizations for Google Summer of Code 2011". https://www.google-melange.com/gsoc/org/list/public/google/gsoc2011. 
  7. "Accepted organizations for Google Summer of Code 2012". https://www.google-melange.com/gsoc/org/list/public/google/gsoc2012. 
  8. "Accepted organizations for Google Summer of Code 2013". https://www.google-melange.com/gsoc/org/list/public/google/gsoc2013. 
  9. "Accepted organizations for Google Summer of Code 2014". https://www.google-melange.com/gsoc/org/list/public/google/gsoc2014. 
  10. "Accepted organizations for Google Code-in 2010". https://www.google-melange.com/archive/gci/2010. 
  11. "Accepted organizations for Google Code-in 2011". https://www.google-melange.com/archive/gci/2011. 
  12. "Accepted organizations for Google Code In 2012". https://www.google-melange.com/archive/gci/2012. 
  13. "Accepted organizations for Google Code-in 2013". https://www.google-melange.com/archive/gci/2013. 
  14. "Accepted organizations for Google Code-in 2014". https://www.google-melange.com/archive/gci/2014. 
  15. "Accepted organizations for Google Code-in 2015". https://codein.withgoogle.com/archive/2015/organization/. 
  16. "Accepted organizations for Google Code-in 2016". https://codein.withgoogle.com/organizations/. 
  17. "Accepted organizations for Google Code-in 2017". https://codein.withgoogle.com/organizations/. 
  18. "Lttoolbox - Apertium". http://wiki.apertium.org/wiki/Lttoolbox. 
  19. "VISL". http://beta.visl.sdu.dk/visl/vislcg-doc.html. 
  20. "Bilingual dictionary - Apertium". http://wiki.apertium.org/wiki/Bidix. 
  21. "Constraint-based lexical selection module - Apertium". http://wiki.apertium.org/wiki/Constraint-based_lexical_selection_module. 
  22. "Morphological dictionary - Apertium". http://wiki.apertium.org/wiki/Morphological_dictionary. 

References

  • Corbí-Bellot, M. et al. (2005) "An open-source shallow-transfer machine translation engine for the romance languages of Spain" in Proceedings of the European Association for Machine Translation, 10th Annual Conference, Budapest 2005, pp. 79–86
  • Armentano-Oller, C. et al. (2006) "Open-source Portuguese-Spanish machine translation" in Lecture Notes in Computer Science 3960 [Computational Processing of the Portuguese Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006], p 50–59.
  • Forcada, M. L. et al. (2010) "Documentation of the Open-Source Shallow-Transfer Machine Translation Platform Apertium" in Departament de Llenguatges i Sistemes Informatics, University of Alacant.
  • Forcada, M. L. et al. (2011) "Apertium: a free/open-source platform for rule-based machine translation". in "doi:10.1007/s10590-011-9090-0

External links

End-user services and software

(All services are based on the Apertium engine)

Online translation websites

Offline applications