Navigation path

Language Technology Resources

Introduction

In October 2012, the European Union (EU) agency 'European Centre for Disease Prevention and Control' (ECDC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-five languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe this resource, which bears the name ECDC Translation Memory, short ECDC-TM.

Translation Memories are parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are important linguistic resources that can be used for a variety of purposes, including:

  • training automatic systems for statistical machine translation (SMT);
  • producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
  • training and testing multilingual information extraction software;
  • checking translation consistency automatically;
  • testing and benchmarking alignment software (for sentences, words, etc.).

The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding advantage of the various parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovene-Finnish, etc.).

The ECDC-TM is relatively small compared to the JRC-Acquis and to DGT-TM, but it has the advantage that it focuses on a very different domain, namely that of public health. Also, it includes translation units for the languages Irish (Gaelige, GA), Norwegian (Norsk, NO) and Icelandic (IS).

Languages / File Format

ECDC-TM covers 25 languages: the 23 official languages of the EU plus Norwegian (Norsk) and Icelandic. ECDC-TM was created by translating from English into the following 24 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Gaelige (Irish), German, Greek, Finnish, French, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian (NOrsk), Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. The JRC then combined these 24 translation memory files to produce one large translation memory, allowing to also extract translation units for other language pairs.
All documents and sentences were thus originally written in English. They were then translated into the other languages by professional translators from the Translation Centre CdT in Luxembourg.

The documents are distributed in the widely used Translation Memory eXchange (TMX) format. They are encoded in the UTF-8 character set. The files have the following structure:

<tu>
<tuv xml:lang="EN">
<seg>Vaccination against hepatitis C is not yet available.</seg>
</tuv>
<tuv xml:lang="BG">
<seg>Засега няма ваксина срещу хепатит С.</seg>
</tuv>
<tuv xml:lang="CS">
<seg>Očkování proti hepatitidě C zatím není k dispozici.</seg>
</tuv>


...

 

<tuv xml:lang="SV">
<seg>Det finns ännu inget vaccin mot hepatit C.</seg>
</tuv>
</tu>

Text types / Domain

ECDC-TM was built on the basis of the website of the European Centre for Disease Prevention and Control (ECDC). The major part of the documents talks about health-related topics (anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages also describe the organisation ECDC (e.g. its organisation, job opportunities) and its activities (e.g. epidemic intelligence, surveillance). The file ECDC-domains.xlsx gives further details.

Statistics for the ECDC Translation Memory

The following table shows the size of ECDC Translation Memory per language: the number of translation units, the number of words and characters of the whole corpus and the average number of words and characters per translation unit. For details, there is also a file containing the statistics on the size of the ECDC-TM per language pair.

 

Language

No. of TUs

No. of words

No. of Chars

No. of words per TU

No. of chars per TU

BG

2567

53557

293635

20±37.02

114±100.02

CS

2562

45564

271290

17±32.44

105±93.31

DA

2577

41955

261529

16±28.41

101±90.24

DE

2560

43187

306148

16±25.99

119±92.17

EL

2530

50658

317722

20±24.85

125±93.88

EN

3919

72085

395269

18±24.12

100±92.98

ES

2564

52406

300495

20±25.06

117±93.49

ET

2581

39435

255112

15±28.36

98±92.87

FI

2617

38467

277958

14±27.62

106±92.10

FR

2561

50106

303936

19±26.88

118±92.49

GA

1356

22619

143006

16±26.40

105±91.99

HU

2571

45744

290470

17±28.39

112±92.69

IS

2511

42005

256966

16±27.68

102±91.99

IT

2534

47038

295964

18±27.08

116±92.13

LT

2545

102229

347591

40±83.88

136±129.47

LV

2542

48095

273604

18±82.86

107±128.24

MT

2539

61855

315865

24±80.75

124±126.89

NL

2510

46666

292721

18±78.82

116±125.35

NO

2537

40149

254315

15±76.83

100±123.45

PL

2546

91237

347955

35±83.69

136±128.72

PT

2531

49239

294449

19±81.78

116±127.24

RO

2555

46999

292453

18±80.00

114±125.90

SK

2525

88810

323179

35±85.24

127±129.73

SL

2545

84756

308808

33±89.63

121±132.99

SV

2527

39442

259710

15±87.91

102±131.39

ALL

63,912

1,344,303

7,280,150

 

Size of ECDC's Translation Memory (expressed as the number of translation units, number of words and number of characters) per language for each of the 25 European languages (all 23 official EU languages plus Icelandic and Norwegian).

Terms of Use

By downloading or using the ECDC-Translation Memory, you are bound by the ECDC-TM usage conditions (PDF).

Further Translation Memories (and more) available on our site

The public release of the ECDC-Translation Memory follows the release of various other multilingual resources via the JRC's website. These include the JRC-Acquis parallel corpus since 2006 (22 languages); the DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the JRC-Names multilingual and multi-script name variant list and related software (since 2011); and the JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012. For details and other, smaller linguistic resources, see the JRC-Resources page.
Further multilingual linguistic resources will be made available in the future.

Download the ECDC Translation Memory

The distribution of the ECDC Translation Memory consists of a single zip file (ECDC-TM.zip), which can be downloaded by clicking on the link below. In the zip file, you find: the main file ECDC.tmx, containing the aligned translation units for all languages; the DTD file, which should be kept in the same directory; a PDF file with the statistics on the corpus; a PDF document describing the terms of use.
Should you be interested in the full-text version of the English files that were used to produce the translation memory, you can download these also. If needed, you can furthermore download a Java utility that allows you to extract a TMX file containing only one single language pair and to produce statistics on the number of translation units.

 

ECDC-TM (October 2012)

Download size

ECDC-TM.zip

3.7MB

Referring to this resource

There is not currently any scientific-technical description of the ECDC Translation Memory ECDC-TM, so please simply refer to this web page with the address:

Acknowledgement and Contact

The ECDC Translation Memory was offered by the European Centre for Disease Prevention and Control (ECDC). The original files - one for each of the 24 language pairs - were cleaned and combined by Mohamed Ebrahim from the European Commission's Joint Research Centre JRC.
The European Centre for Disease Prevention and Control (ECDC) is an EU agency whose aim is to strengthen Europe's defences against infectious diseases. It was established in 2008 and it is seated in Stockholm, Sweden.
The ECDC's mission: According to the Article 3 of the founding Regulation, ECDC's mission is to identify, assess and communicate current and emerging threats to human health posed by infectious diseases. In order to achieve this mission, ECDC works in partnership with national health protection bodies across Europe to strengthen and develop continent-wide disease surveillance and early warning systems. By working with experts throughout Europe, ECDC pools Europe's health knowledge, so as to develop authoritative scientific opinions about the risks posed by current and emerging infectious diseases.
The Joint Research Centre (JRC) is a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further smaller linguistic resources.
The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM collects and aggregates about 150,000 online news articles per day in 50 languages from about 3500 news portals world-wide (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every ten minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's four publicly accessible media monitoring applications are:

  • NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
  • MedISys: EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
  • NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.
  • EMM-Labs: A collection of more experimental text analysis applications in up to 50 languages not yet entirely integrated with the main Europe Media Monitor pages. EMM-Labs includes tools for event extraction (event scenario template filling), multi-document summarisation, social networks, news maps, media impact analysis, machine translation and more.

For more information on ECDC-TM, you can contact the following persons:
 

Web Editor for Multilingual Content
Email address: webmaster@ecdc.europa.eu
European Centre for Disease Prevention and Control (ECDC)
Tomtebodavägen 11A
171 83 Stockholm, Sweden
URL: http://www.ecdc.europa.eu

 

Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
IPSC - GlobeSec - OPTIMA
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)
More information on the JRC and its activities on language technology.