Imagine if you could use search on audio as well as on text. Well it's not necessarily a dream thanks to MAVIS.
A research project from Microsoft is being used to provide indexes to recordings of speech in a number of technical previews.
The Microsoft Research Audio Video Indexing System (MAVIS) has been running on a trial basis on digital archives for the U.S. states of Georgia, Montana, and Washington, as well as the U.S. Department of Energy, the British Library, and, most recently, CERN, the European Organization for Nuclear Research.
More interestingly for developers, the software components run as a service on Windows Azure, as text search components for SQL Server 2005 and 2008, and some client side PowerShell and .NET tools. This opens the possibility of providing tools in your applications that would give your users the means to carry out text searches on digital audio content.
MAVIS is a set of software components that use speech recognition technology to enable searching of digitized spoken content such as presentations, online lectures or recordings of telephone calls or meetings.
The MAVIS UI is a set of aspx pages that can be changed to suit different applications. The MAVIS client side tools will let you submit audio video content to the speech recognition application running in the Azure service using an RSS formatted file and retrieve the results so they can be imported into a SQL Server for full text indexing. This then enables the audio video content to be searched just like any other text.
According to Microsoft Research, MAVIS not only enables search within audio files, but also within video. Footage from meetings, presentations, online lectures, and other, typically non-closed-captioned content all benefit from a speech-based approach.
MAVIS is currently a research project with a limited technical preview program. If you have deployed Microsoft SQL Server, have large speech archives and are interested in the MAVIS technical preview program, you can contact Microsoft Research to join the technical preview. The details are on the MAVIS website. The rest of us will have to wait for the tools to be made more publicly available.
One key feature of MAVIS is the use of a technique developed by researchers at Microsoft Research Asia called Probabilistic Word-Lattice Indexing, which improves accuracy for conversational speech indexing. Lattice indexing adjusts for the system’s confidence rating for recognition of a word and alternate recognition candidates.
“When we recognize the audio track of a video,”
Microsoft Research Asia’s Frank Seide, senior researcher and research manager explains,
“we keep the alternatives. If I say ‘Crimean War,’ the system may think I’ve said ‘crime in a war,’ because it lacks context. But we retain that as an alternative. By keeping the multiple word alternatives as well as the highest-confidence word, we get much better recall rates during the search phase.”
“We represent word alternatives as a graph structure: the lattice. Experiments showed that when it came to multiword queries, indexing and searching this word lattice significantly improved results for document-retrieval accuracy compared with plain speech-to-text transcripts: a 30- to 60-percent improvement for phrase queries and more than 200-percent better for queries consisting of multiple words or phrases.”
You can check out a demo of MAVIS at the Microsoft Video Web where more than 15,000 MSNBC news videos have been indexed, and you can read more about the technicalities of MAVIS on the Microsoft Research website.
If you would like to be informed about new articles on I Programmer you can either follow us on Twitter or Facebook or you can subscribe to our weekly newsletter.