N-gram services might sound like a messaging system but it could be a key part to building the intelligent web. Microsoft has now made petabytes of data available in a public beta.
One of the most valuable things on the web is the huge amount of natural language data that each and every web page and document represent. Until the web natural language researchers had limited access to the raw material they were studying - a few digitised books and other special resources. This frustrated any purely statistical analysis and prediction of meaning.
Today the web provides lots of examples of language in all its forms and functions and the statistical approach to many tasks that are simply to difficult to do analytically is a real possibility. For example, machine translation can be improved by looking for the occurrence of the translated phrase in all valid examples of text the language.
The statistics you need if you are going to work with on-line text as a language resource start out being simple with just counts of how many times a given word occurs.This is useful but not very useful. Much better is a count of how many times a particular word pairing occurs, how many time a particular three-word coupling occurs and so on - these are the N-gram frequencies and hence the name of the service.
The Microsoft N-gram web service is an XML-based service created by Microsoft Research and Bing. The actual raw data is provided by the Bing search engine which parses and tokenizes the texts of hundreds of billions of web pages. The service offers N-grams up to order 5 based on different sections of the web documents - title, body, anchor text and so on.
The Microsoft N-gram data isn't the only offering, Google also provides a set of CDs, Google's N-Gram Corpus, for example. The Microsoft N-gram Service is, however, a web service based on all of the documents indexed by Bing, which makes it possible to consider using the real-time information to work with the dynamics of the web. The statistical models used are also smoothed, which minimises the effect of rarely occurring n-grams. Possible uses for the service include: predictive text assisted entry, spelling correction, dealing with ungrammatical sentences, language segmentation, word breaking, translation - and, of course, novel applications that are waiting to be thought up by users of the service.
Until recently access to the beta was by invitation only but access has now been widened to members of accredited colleges and universities worldwide. This isn't quite as public as you might like, but you can always ask for access if you have a good idea you want to follow up. Natural language processing may be a great game for academics to play but it also has a huge potential commercial application for anyone who can make it work - and to make it work using a statistical approach needs data on the scale that Microsoft has just provided.
More information: Web N-gram Services.
For a demo: Multi-word Tag Cloud from Government Dataset Titles