Document Caching With dtSearch

Written by Ian Elliot

Wednesday, 07 March 2012

Continuing our look at facilities available in dtSearch, we examine the option to store the contents of a document within the index and show how it provides hit highlighting, report generation and document retrieval. It has lots of possibilities.

When you are searching for a document, what matters first is finding that it exists. Although occasionally just knowing it exists is enough, what you usually need to do next is to access the document.

In most case it is enough to build an index of words that occur in the document, complete with links to where the document is stored. An index is usually a lot smaller than the original collection of documents and in the past this was an important consideration. Now, however, with big multi- Terabyte disk drives available at reasonable prices, storage is cheap and plentiful. Today getting at the data you want is often a bigger design priority than minimizing storage.

dtSearch is an indexing system that we have looked at before and explored its API, but so far we have ignored one intriguing option that it makes available when you set up an index: you can opt to store the contents of a document within the index. This means that once a document is found by searching the index, it is more-or-less instantly available. If the original is stored at a difficult to reach location then this is not only a useful convenience, it may be the only way to make document retrieval actually work.

Notice that the document to be cached could be stored on a website, a cloud datastore or a file share. Adding the document to the index creates a local copy and this not only has advantages for reliability and speed but it also provides a form of backup.

To most programmers there is something slightly scary about the idea of making what amounts to a complete copy of the document filing system within an index, which is traditionally thought of as a compressed lightweight entity. However, you need to keep various options in mind.

The first is that you can opt to simply store the extracted text (without any formatting) in the cache. Text is reliably highly compressible and the resulting extra storage requirements are generally small. Also in many cases the text of a document is all you really need to present to the user quickly - a full document can be retrieved when it is available. The text can also be used to show search results in context much faster than if the original document has to be retrieved.

If just text alone isn't sufficient for your purpose, for example it you want to show hit highlighting in the retrieved document with formatting, or if the original documents are hard to access, then you can opt to store the entire document in the index. In this case you have the security of complete retrieval as part of the search but you cannot be sure that the storage requirements are going to be small. It really does depend on the document type and how compressible it is.

Even if you do need to cache the original files, it can still be useful to cache the text as well. Caching of text makes search report generation much faster, so you can efficiently include a brief hits-in-context snippet in search results without making searches too slow.

Even in the case of full document caching storage is rarely an issue and the important point is that it makes no different to the speed of search while speeding up any task, such as search report generation, that needs access to the originals. The only downside is that building the index will be slower because the documents have to be stored, but again using an incremental approach to updating an index usually makes this worthwhile.

There are also security implications - documents cached in the index can be accessed by anyone who can perform a search.

Creating an index with cached documents

All you have to do to create an index with cached documents is to access the advanced options when creating the index and either select the Advanced button or use the Create Index (Advanced) option. At this point you can select either Cache document text in the index or Cache documents in the index. When you start the indexing process either the text or the full document will be cached into the index.

Notice that you can opt to remove or keep documents that have been deleted when the index is updated. This in itself opens up some interesting possible uses of the index. For example, you could keep deleted documents as an audit trail or just as a way to protect users from accidentally deleting their documents. You could even give users a way to find their lost documents and restore them.

If you setup an index programmatically then it all works in the usual way - i.e. create an IndexJob object and use its Execute method. All you have to do is to remember to set the IndexingFlags property to dtsIndexCacheText or dtsIndexCacheOriginalFile before you execute the job. For example:

IndexJob IJob1 = new IndexJob();
 //set up usual properites
 //to specify the index
IJob1.IndexingFlags = 
    IndexingFlags.dtsIndexCacheOriginalFile;

IJob1.IndexingFlags = 
           IndexingFlags.dtsIndexCacheText;

Notice that you can only opt to cache text or documents when the index is first created. You cannot change this option for an index that exists. The only solution if you have forgotten to include the text of complete files is to rebuild the index. Once the index is built complete with cached files, updates are incremental as you would expect.

Using the cache

There are so many way to use the text or files cached in the index that it is impossible to cover them all but the basic operations are few in number.

Hit highlighting

You can opt to use the cached document or text to highlight hits. This is faster than retrieving the original data and it works even if the original data is no longer accessible.

To do this you simply use the FileConverter as you normally would but to use the cached version you need to set:

fc.Flags = ConvertFlags.dtsConvertGetFromCache;

After this the highlighted document will be returned even if it is currently unavailable or has been deleted say. For example, assuming the results of the search are in Results:

FileConverter fc=new FileConverter();
fc.Flags = ConvertFlags.dtsConvertGetFromCache;
fc.SetInputItem(Results,0);
fc.OutputFormat = OutputFormats.itHTML;
fc.BeforeHit = "<h1>";
fc.AfterHit = "</h1>";
fc.OutputToString = true;
fc.Execute();

This returns the text of the file as a string in HTML format with each hit tagged as H1.

Report generation

The second typical procedure is to generate a search report using the cache. Once again this is just a matter of setting the flags to the correct value:

RJob.Flags = ReportFlags.dtsReportGetFromCache;

As we haven't actually given an example of creating a SearchReport in previous articles it is worth listing the complete instructions to generate a simple report from some search results assumed stored in Result:

SearchReportJob RJob = new SearchReportJob();
RJob.SetResults(Results);
RJob.OutputToString = true;
RJob.SelectAll();
RJob.BeforeHit = "<b>";
RJob.AfterHit = "</b>";
RJob.WordsOfContext = 10;
RJob.Flags = ReportFlags.dtsReportGetFromCache;
RJob.Execute();

In this case, the report consists of all of the hits, with the hits highlighted using bold and an additional 10 words of context surrounding the hit. Using the cached documents should make the report generation faster and more reliable as it works even if the documents are inaccessible.

Document retrieval

The final mode of use is to retrieve the entire document from the cache. It is obvious that this has so many potential applications that it is pointless trying to list them. The basic mechanism is logical. The standard way of retrieving document content is to use the FileConverter along with a conversion to HTML, XML, RTF or plain text. In this case we use the FileConverter but without performing any file conversion. As before we simply need to set some flags. The only other restriction is that the file can only be retrieved to disk i.e. you can't save it in memory - which is reasonable enough.

So to retrieve the first file specified in Results you would follow the usual steps. First create a FileConverter:

FileConverter fc=new FileConverter();

But now you need to set the flags:

fc.Flags = ConvertFlags.dtsConvertGetFromCache
      | ConvertFlags.dtsConvertExtractOnly;

Now everything is more or less the same. Select the search item you want to retrieve - the first in this case, set the output file and call Execute:

fc.SetInputItem(Results,0);
fc.OutputFile = @"\MyFile.pdf";
fc.Execute();

At this point the file will be retrieved from the cache and stored where you specify. Of course you have access to the file name and type as part of the Search item and you can restore it to its original location if you want to.

Now you know how the cache works, it is easy to use it for whatever you want and you can start to be creative.

An index is for more than just looking things up!

More Information and to download: