Google Helps Tell An Apple From Apple
Written by Alex Armstrong   
Saturday, 09 March 2013

One of the big problems of working with language is that a word can mean more than one thing. For example, apple might mean a fruit or a computer company. Humans are generally very good at working out, or "disambiguating" the meaning of words but now we need AI to be just as good.

Google Research has just released the Wikilinks corpus of over 40 million disambiguated "mentions" within over 10 million web pages. This is much bigger, over 100 times, what has been available until now.  A "mention" is a term that has a link to a Wikipedia page. The anchor text of the link can be thought of as being defined or disambiguated by the content of the Wikipedia page.

To get anything much out of this database it has to be processed. For example if different mentions link to the same Wikipedia page, then presumably the web pages are talking about the same entity. You could also build a more detailed definition of an entity by aggregating linked pages. You can even deal with disambiguation directly by, say, finding that Apple links to two separate Wikipedia pages one about fruit and the other about a computer company.

 

disambiguate

 

This is the raw material that can be used to create new AI applications. You can get the data (about 1Gbyte compressed) as a download. The data includes the URL of the web page, the anchor text, the Wikipedia target and a few extras. There are also some tools to help you get started on processing it from UMass Amherst.

The data has more to say about a mention than simply disambiguating its meaning and has a lot of potential to help with general natural language processing. The statistical approach to understanding needs lots and lots of data and this corpus is a big step in the right direction. This is the sort of data that powers Google Translate and could, in time, be used to convert Google's search into a semantic rather than an ad-hoc collection of signals.

More Information

Google’s Wikilinks Corpus

Tools from UMass site

Learning from Big Data: 40 Million Entities in Context

Related Articles

Microsoft Web N-gram Services go public

Handbook of Natural Language Processing (2e)

Grants Awarded To Kivy and NLTK To Boost Python 3

Strongsteam - an online AI API

 

To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin,  or sign up for our weekly newsletter.

 

blog comments powered by Disqus

 

Banner


NFC For Man's Best Friend
01/04/2014

Barclalycard has unveiled plans for PayWag, a payment chip that is discretely embedded into a dog's collar to allow it make small purchases. Does it mean you can stay in bed while the dog collects, an [ ... ]



Bredan Eich Leaves Mozilla
04/04/2014

Last week Brendan Eich was appointed CEO of Mozilla. Yesterday he resigned that position and left the company he co-founded 15 years ago, saying "I will be taking time before I decide what to do next. [ ... ]


More News

Last Updated ( Saturday, 09 March 2013 )
 
 

   
RSS feed of news items only
I Programmer News
Copyright © 2014 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.