How we can source reliable news: tackling the ‘disinfodemic’ crisis

Supervised machine learning vs deep learning architecture

By: Andrii Elyiv, Nikhil Aggarwal & Aldo Visibelli

One of the recent trends exacerbated by the spread of the Internet and social media in particular is the broader reach of fake news. On the 15th of April 2020 a report by the nonprofit group Avaaz, which labelled Facebook as an “epicenter of coronavirus misinformation”, cited numerous posts containing dangerous health advice and fake cures. The company pushed back on this accusation, saying it’s removed a great amount of pieces of misinformation in the past weeks. For less concerning content it cited statistics suggesting warning labels have a real effect (see image above).

The United Nations recently wrote on their news portal (UN News) that unreliable and false information is spreading around the world to such an extent that some commentators are now referring to the new avalanche of misinformation that’s accompanied the COVID-19 pandemic as a ‘disinfodemic’. Fake news are present across a broad range of topics: “There seems to be barely an area left untouched by disinformation in relation to the COVID-19 crisis, ranging from the origin of the coronavirus, through to unproven prevention and ‘cures’, and encompassing responses by governments, companies, celebrities and others.”

Even when fake news aren’t as consequential as often feared, it’s better to be able to spot them. There are various ways to identify misinformation circulating on the web. One option for example is to question the source. The source is indeed a valid cue: many successful pieces of fake news circulating on WhatsApp about Covid-19, writes Hugo Mercier on the Guardian, start with “A friend who has an uncle in Wuhan” or “A friend whose dad works at the Centre for Disease Control”.

Assessing and reviewing the sources of a great pool of news and information may however be a redundant and overwhelming task. For this reason there is growing pressure towards online publishers and communication means to find automized real-time solutions identifying reliable news. Connexun’s brand new technology seeks to merely source reliable news.

First and foremost Connexun screened and examined closely the list of sources it is crawling. Rather than merely focusing on the total number of scraped sources, the quality and reliability of the sources are a central tenet of our technology. Handpicking and scrutinizing the quality of the content of the means of information was a first step towards the development of a solid pool of sources. The quality of the content of the online publishers under scrutiny are indeed central to the value provided by our news intelligence engine.

Secondly, its clustering technology and rankings give visibility to news published by media outlets and online sources from different countries, and perhaps also discussing the same topic in distinct languages. It is in fact highly unlikely for fake news to be published by different sources, in different countries and distinct languages. Our robust clusters indeed include news from different sources, belonging to a wide range of countries, in distinct idioms.

Traditionally fake news could be recognized by Natural language processing methods using supervised machine learning. This approach requires the training of a human labelled sample of real and fake news on similar topics to better highlight clear distinctions between them. The main goal is to find useful features and vectorize them to differentiate fake news from real news. Bag-of-words and Term Frequency–Inverse Document Frequency (TF-IDF) models are commonly used in news classification where the frequency of occurrence of each word or phrases (n-grams) is used as a feature to train a classifier. An assumption is that fake news have a specific combination and frequency of words. For example, real news use the verb “said” more often than fake ones, because in most real journalistic publications sources are quoted directly like “Prime Minister of Italy said […]”.

As far as Naive Bayes, Random forest classifiers, Support-vector machine (SVM) is employed. Normally, accuracy of these models is under 90%. Accuracy shows the percentage of True positive (human labelled fake news which were reconstructed by the model as fake) and True negative (human labelled real news which were reconstructed by the model as real) cases amongst total observations.

A more advanced tecnique is to detect fake news using Deep Learning architecture, for example: Long short-term memory (LSTM) which is a subclass of the recurrent neural network (RNN), Convolutional Neural Networks (CNN) and BERT-based language model. All of them provide an accuracy level higher than 90%. The most compelling is BERT, created by Google, which is a model composed of several stacked transformer-encoder blocks. BERT is already pre-trained on a large text corpora (books, news archives, Wikipedia) so that the user performs a tunning to adapt the model to a specific task. Classifiers of fake news with a BERT-based model could reach an accuracy level of 97% or higher. For more information on our news api or our news feed, follow us on Linkedin or Twitter, or reach us out at aldo.visibelli@connexun.com.

Connexun is the ultimate AI news engine — turning unstructured news content into multi-purpose actionable data.