AI Identifies Who Was Important In History
Written by Sue Gee   
Wednesday, 20 October 2021

A new algorithm is being used to search historic documents to discover who were the influential people in bygone days. Is this  yet another example of AI expanding the horizons of our knowledge?

Today it is social media that defines who is important - he who tweets most frequently outranks those of greater merit. Things were not very different in the eighteenth and nineteenth centuries but the names of many of the noteworthy personages have never appeared in the "official" historical sources. Now a paper, co-authored by Haimonti Dutta, assistant professor in the Department of Management Science and Systems at the University of Buffalo and  Aayushee Gupta, research scholar at the International Institute of Information Technology Bangalore Department of Computer Science that has been accepted for Decision Support Systems, outlines an algorithm that can help discover important people from old newspapers.

The problem with using contemporary documents is that  its text is often "messy". As Dutta puts it: 

"It’s a known fact that when OCR software is run, very often the text gets garbled. With old newspapers, books and magazines, problems can arise from poor ink quality, crumpled or torn paper, or even unusual page layouts the software isn’t expecting.” 

To develop an algorithm that could automatically identify important people, the researchers partnered with the New York Public Library (NYPL) and analyzed more than 14,000 articles from New York City newspaper The Sun published during November and December of 1894. The NYPL has scanned more than 200,000 newspaper pages as part of Chronicling America, an initiative of the National Endowment for Humanities and the Library of Congress that is working to develop an online, searchable database of historical newspapers from 1777 to 1963.

Their algorithm ranks people’s names by importance based on a number of attributes, including the context of the name, title before the name, article length and how frequently the name was mentioned in an article. It works by first populating a list of person names using an out-of-the-box Named Entity Recognition software, extracting content-based features for the identified entities, and ranking them using a novel unsupervised Kernel Density Estimation based ranking algorithm. This generative model has the ability to learn rankings using the data distribution and therefore requires limited manual intervention.

The focus of the paper was to examine the effect of the noise introduced by garbled text on the operation of the algorithm by comparing the raw text produced by OCR software with a set of the same documents that had been manually cleaned and found that , the ranking algorithm was able to sort people’s names with a high degree of precision, even from the noisy OCR text.


Important historical figures identified by the algorithm include Grand Duke George Alexandrovich of Russia (1871-1899); Captain William Bainbridge-Hoff (1774-1833), a commodore in the U.S. Navy; Fanny Gordon (1837-?), wife of Confederate General John Brown Gordon; and Chauncey Mitchell Depew (1834-1928), attorney, businessman and Republican politician. 

Claiming that this process has wide-reaching implications for discovering important people throughout history, Dutta stated: 

“We recently used this technique on African American literature from the Civil War to learn more about the important people during the era of slavery. Going forward, we’ll be expanding the technique to examine relationships between people and build out the social networks of the past.”  

The methodology may been some refining, but the general approach represents a paradigm shift from relying on established knowledge bases to using a wide range of more informal documentation to understand how society really operated in the past. 


More Information

Data mining the past

PNRank: Unsupervised ranking of person name entities from noisy OCR text

Related Articles

Colorization Of Early Films Good or Bad?

Google AI Recreates Lost Klimt Artworks

Face Recognition Applied to Portraits

Image Processing Reveals the Young Leonardo

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.



Apache Doris Updated With Much Faster Queries

Apache Doris has been updated with a new version that is more stable, has improved query performance by ten times, and adds a number of new functions. The plans for future developments have also been  [ ... ]

Visual Studio C++ And Colored Braces

The preview release of Visual Studio 17.5 has been announced with a number of improvements for developers editing C++ code, plus spell-checking for C#, C++, and Markdown files but the one that made us [ ... ]

More News





or email your comment to:

Last Updated ( Wednesday, 20 October 2021 )