21 Jun 2012
New software for automatic multilingual indexing of parliamentary documents
Librarians can sigh with relief. No more manual indexing of thousands of documents: new software is now freely available to automatically categorise parliamentary documents in 22 official EU languages according to EuroVoc, the EU's multilingual thesaurus. The software tool called JEX, 'JRC EuroVoc Indexer', developed by the Joint Research Centre (JRC), the European Commission's science service, can make the work of national parliaments' libraries and documentation centres easier and, in turn, facilitates citizens' access to legislation across EU borders.
To be able to retrieve relevant documents efficiently – even if written in a different language – libraries need to categorise their documents using a closed set of subject domain classes, i.e. a controlled vocabulary from a thesaurus. EuroVoc is the standard thesaurus used in most EU institutions and also in many EU Member States. It contains over 6,700 classes covering the activities of the EU, in particular those of the European Parliament. The EuroVoc labels have been translated one-to-one into all EU languages.
Currently, most parliamentary libraries manually assign EuroVoc subject domain labels to their documents, which is a slow and expensive process. The JRC software tool JEX can automatically or semi-automatically categorise documents according to the thousands of EuroVoc classes in 22 official EU languages, thus significantly improving the work speed and efficiency, while assuring consistency in the classification.
Due to the high number of different EuroVoc classes and the documents belonging on average to six different classes, automatic indexing of documents presents significant challenges. The innovative 'profile-based category ranking' method developed by JRC researchers allows the software to automatically learn from previously manually indexed documents and to predict the most appropriate classes for new documents. JEX is able to learn which words in a document belonging to a certain category are particularly typical for that category; if many of these words are then found in a new document, there are good chances that the new document should be classified according to the same class.
JEX can be used as a fully automatic system or as an interactive application where the librarian has the possibility to correct the automatic results, benefitting, in this case, both from the machine's speed and consistency and from the human specialist's accuracy. The software can be trained and periodically re-trained by the users, simply by providing their own (ever growing) text collections that have previously been indexed manually.
JEX is freely available for download at http://langtech.jrc.ec.europa.eu/Eurovoc.html
The software can also be used as an ingredient for further multilingual Language Technology applications, for example to detect document translations or plagiarised text. By helping the development of Language Technology applications for all EU languages, this software tool contributes to the European Commission’s general effort to support multilingualism and re-use of Commission information .
Other recently released JRC software tools, such as the 'JRC-Names' for automatic recognition of names in texts and the 'DGT-Translation Memory' for the Acquis Communautaire, are available at http://langtech.jrc.ec.europa.eu/.
JRC's language tools are highly multilingual language resources which cover a wide range beyond the most commonly used languages, including for example Estonian, Hungarian, Lithuanian, Maltese, and Slovak.
EuroVoc currently exists not only in 22 official EU languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish), but also in Basque, Catalan, Croatian, Russian and Serbian. Further non-official translations exist.
Further information on EuroVoc can be found at: http://eurovoc.europa.eu
META Prize 2012
The Multilingual Europe Technology Alliance (META) has recognised JRC's Activities on Language Technology with the META Prize 2012 at the META-FORUM 2012 in Brussels on 20 June 2012. The META Prize is awarded to outstanding products, services and organisations that actively contribute to the European Multilingual Information Society.