Not all news api are the same, what are the key distinguishing factors?
By: Andrii Elyiv & Nikhil Aggarwal
World media produces thousands of news per hour in various languages. Many of them are original, some of them are aggregated from social networks, but most are just re-published from other news sources with minor edits. News contain different attributes, metadata, entities, keywords, and could be written with different levels of sentiment and objectiveness. To manage such amount of unstructured data, news Application Programming Interfaces (APIs) were developed to enable end users to retrieve events, topics and other useful information from news in a well organized and clear form consistently over time.
Direct application of web scrapers to news sources faces a great amount of challenges and constraints in order to grasp reliable data. News RSS feeds became less popular now and many sources limited their support and allowance. News APIs are widely used by Developers, Data Analytics professionals, Data Scientists and NLP engineers. Usually news API are complemented with Text analysis API which provide the possibility to extract valuable information from news, such as language detection, ad hoc summaries, keywords generation, etc.
We evaluated a list of 5 top news APIs worldwide in 2020 based on technical features and main endpoints.
We do not discuss here pricing and formats of results.
Event Registry is a media intelligence platform that analyzes current and archived news content. To collect news articles Event Registry uses RSS feeds from around 30,000 sources in 39 languages. Besides main national media Event Registry pays attention to local and minor news sources. Event Registry’s approach revolves around “Concept”: the term which represents various types of entities (persons, locations, organizations) or non-entities (things, etc), which is assigned by the Wikipedia page. Each concept may have names in different languages, synonyms, images and descriptions. Since wiki pages of the same thing in different languages are linked, the concept is the bridge between different languages for Event Registry. They introduce a score on how strong a given news is assigned to the mentioned concept. This is kind of a similarity between given news and some wiki pages.
They provide news categorization based on the hierarchical ontology scheme of DMOZ using three levels of 50,000 categories only for English language. Event Registry exploits clustering, assembles news about the same event, to extract stories. Each story contains articles in a single language and has the following attributes: language, title, summary, date, concepts, categories and others.
On the top you can find “Event”, which is a collection of one or more stories that report the same world event. This is kind of “agglomerates of clusters”- they could be in different languages linked by concept via Wiki pages. Each event can provide title and summary (in all available languages), date, location, list of stories, article count, a list of concepts, categories and frequently mentioned dates. Event Registry provides daily trends for concepts and categories.
Event Registry gives social impact for news, i.e. how many times an article has been shared on social webs. Also, they provide a searching tool over news, stories and events by keyword using a complex logic. News in Event Registry have the following attributes: title, text, date, time, source, image, a list of concepts, categories, extracted dates. Source has title, description, geolocation and its importance. Event Registry’s text analysis tools perform sentiment detection, language detection of news and events.
Aylien news API enriched news data as a service. They collect, analyze, aggregate and search news content from across the globe in real-time from thousands of sources.
Basic endpoints of Aylien are:
Stories - news articles enriched with NLP metadata,
Clusters - set of news with common events,
Time Series - visualize and detect spikes in story volumes over time,
Trends - quantitative analyses on news content,
Autocompletes - search functions and features,
Related Stories - semantically similar or related stories,
Coverages Track - how often a story is covered in the media.
One cluster in Aylien corresponds to one event or topic.
A story always belongs to merely one cluster. The relationship between the story and the cluster does not change with time. They provide the possibility to monitor news relevance and follow “breaking” events. Also, they provide real-time monitoring of stories trending online.
Aylien extracts sentiments from a piece of text such as a tweet, a review or an article: whether the tone is positive, neutral or negative. Splitting between subjective and objective text works for tweets at a sentence-level. Here subjective means that it’s reflecting the author’s opinion, and objective means that it’s expressing a fact.
To work with entities (people, dates, organizations, places or products, links, telephone numbers, email addresses, currency amounts and percentages) Aylien uses dbpedia. They apply categorization with taxonomies hierarchy and concept extraction using Wikipedia as context.
Aylien makes hashtag suggestions for news to get more exposure on social media. Also Aylien engages in news summarization and classification language detection. To sort results of queries they propose to use different ranks by Relevance, Recency, Hotness, Social impact, Number of photos or videos in news and Alexa ranking.
For extracted news they show title, text, author, image, video, RSS feed, publish date and keywords. Aylien provides Image Tagging to associate images with text. Aylien works with 16 top world languages. They directly apply: Classification, Entity Extraction, Concept Extraction, and Sentiment only for 5 of them (en, de, fr, it, es, pt). For others (ar, da, fi, nl, no, ru, sv, tr, zh-cn, zh-tw) they use the translation into English.
Connexun defines itself as the ultimate News Intelligence Engine with a focus on international news. It aggregates news in various languages using web crawling in real-time from more than 20.000 trusted sources. The average pipeline rate is c. 100.000 processed news per day.
It is committed to the origin of media sources and the classification of news by country. Also, a great amount of attention is dedicated to local news about cities or regions. Connexun performs multi-language online clustering for 8 languages (en,uk,it,hi, pa, pl, ru,es). It means that single topics could be composed of news from different languages.
News labels are kept for a long time to be able to follow topic evolution with time. Connexun clustering includes not only text but also images of news, i.e multi-modal clustering. It also includes a topic ranking system which depends upon how many unique sources published news about a given topic, from how many unique countries, in how many unique languages. It allows to avoid bias of particular sources and countries amongst topic trends.
An Interesting feature of Connexun is summarization using algorithms designed particularly for news articles. It is possible to generate the so called dynamic summaries which contain entitie related to a specific case.
Connexun periodically updates intercountry index which is the level of mutual mentions between countries, a relevant index for monitoring relations between countries. Connexun provides world trending news, as well as trending news about a given country or from sources published in a given country. Connexun news API give specific entities extraction as cities, airports and embassies/consulates.
An Important part of Connexun text analysis is Short Text Geoparsing which links any word/phrase/short text with a list of countries.
Connexun provides topic research based on textual content of the news to display all the articles related to a specific topic or key term. Three main keywords/phrases are extracted from news in the original language and for the whole topic in English. Connexun provides sentiment analysis for news with positive, neutral or negative sentiments integrated with a corresponding score.
NewsAPI was one of the very first firms in the API industry to deal with news. It provides access to world breaking news headlines, searching for articles from above 50,000 news sources. It tracks headlines from sources coming from over 50 countries, belonging to 7 distinct categories (business, entertainment, general, health, science, sports, technology).
NewsAPI allows to search for news that mention a specific topic or keywords within the last 24 months. They allow to make complex queries retrieving data from multiple sources and information sites with a single query. NewsAPI’s main endpoint provides live top and breaking headlines for a specific country, or specific category in a country from a given source(s).
Articles are sorted by date first or by relevance of the source. For the major sources they index name, description, and category. They are working with the following 14 languages: ar, de, en, es, fr, he, it, nl, no, pt, ru, se, ud, zh. Each news article has the following attributes, if available: source, author, title, a description or snippet from the article, url of news, url of image, date of publication, truncated to 200 chars the unformatted content of the article.
ContextualWeb uses a cutting-edge search technology that came from the neuroscience field. It is the 3rd largest search engine in the world by number of indexed webpages. It searches over 100.000 different news sources that are spread all over the world and sports the latest news articles and blogs posts.
ContextualWeb API’s include computational efficient news and image search engines. Their search engine is based on the inverted indexing scheme where the intrinsic inefficiency of the inverted indexing scheme is bypassed. ContextualWeb indexes and retrieves webpages without the list of intersections via implementation of the Hippocampal Memory Indexing approach. Also, they are doing entity extraction and extraction of primary keywords from news taking into account the context of whole news.
The above-mentioned news API propose different kinds of endpoints with various quantities and qualities. The advantage of Event Registry, NewsAPI, ContextualWeb are the huge number of processed news sources and the effective search tools. Event Registry, Connexun, NewsAPI and Aylien give the possibility to partners to work with different languages.
A Wide palette of text analysis tools are presented by Event Registry, Connexun and Aylien. The exclusiveness of Connexun includes unique features such as: classification of topics of news by country/countries, multi-language fast clustering using both text and image, topic ranking system of news about or from each world countries, summarizer designed specially for news, various kind of entity extractions, Short Text Geoparsing which links any word/phrase/short text with list of countries.