Science will soon speak one language (fiction)

January 28, 2013

A fictional article I wrote for the Swiss Federal University Lausanne (EPFL).

EPFL will join forces with the search-engine giant Google as well as other scientific partners to develop a translation tool for scientific literature. EPFL’s Probabilistic Machine Learning Lab will be a part of a global initiative called linguaSCIENCE that aims at offering scientists the ability to translate scientific papers in English into their native language and vice-versa. This tool aims to overcome the language barrier often faced by scientists from non-English speaking countries, and thus promote global scientific collaboration. 

facebook map

Collaboration lost in translation?

Science has always been about collaboration. The globalisation of science has resulted in scientists from all over the world working with each other to add to the body of science. It is now almost the norm to have scientific papers with multiple authors of diverse nationalities. However, the extent of this collaboration could be overestimated.

In the spirit of the well-circulated Facebook friendship map by Paul Butler, research analyst Olivier Beauchesne at Science-Metrix examined scientific collaboration around the world from 2005 to 2009 by extracting and aggregating scientific collaboration between cities all over the world (see fig.). Looking at the map there doesn’t seem to be much collaboration outside of the United States and Europe. Beauchesne is unsure if that’s because of a limited dataset or really because there’s little collaboration in those areas. Could language be one of the contributing factors?

Machine learning to the rescue

If so, “statistical machine learning” (SML) could help prevent science from being lost in translation. SML is the process of seeking patterns in large amounts of text. It has already helped Google carve a niche for itself in translation with Google Translate, a free translation service that provides instant translations between 64 different languages.

When Google Translate generates a translation, it looks for patterns in hundreds of millions of documents to help decide on the best translation. By detecting patterns in documents that have already been translated by human translators, Google Translate can make intelligent guesses as to what an appropriate translation should be. Examples of human-translated documents used include those produced by the United Nations and the European Parliament.

However, SML has its limitations. The more human-translated documents that Google Translate can analyse in a specific language, the better the translation quality will be. This is why translation accuracy will sometimes vary across languages.

EPFL is part of the solution

EPFL’s role in the project is to help improve translation accuracy for languages that are relatively poor in the availability of human-translated documents, particularly in the area of scientific literature. EPFL’s expertise in Bayesian inference (which is principled way of combining new evidence with prior information) could hold the key to improving translation accuracy. Bayesian inference is already being applied in fields as diverse as spam filtering and analysing evidence in a courtroom.

Applying Bayesian inference to statistical machine learning could give global scientific collaboration a turbo boost”, says Michael Singer of the EPFL’s Probabilistic Machine Learning Lab. Singer adds that Bayesian inference has been shown to outperform other standard methods (such as expectation-maximization method), especially in the scientific domain. This is because standard methods take into account only the most likely point in the list of word-translation probabilities, but do not consider contributions from other points. Bayesian inference helps provide a bigger picture and better results on the long-term, which could make all the difference when translating languages lacking a significant body of scientific literature. Bayesian inference and EPFL could thus provide a valuable contribution to help improve the accuracy of linguaSCIENCE and overcome the language barrier in science.

REFERENCES

Mermer C and Saraclar M (2011). Bayesian Word Alignment for Statistical Machine Translation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Volume 2 Pages 182-187. ISBN: 978-1-932432-88-6 .

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: