Full Text Database Indexing with dtSearch

Written by Ian Elliot

Wednesday, 30 November 2011

Article Index
Full Text Database Indexing with dtSearch
Programming with dtSearch

Page 1 of 2

Continuing our look at how dtSearch makes full text indexing and search easy, we now move on to consider the strange topic of indexing databases.

You might think that the idea of bringing in a separate piece of software to index a database is slightly crazy - after all aren't databases supposed to be all about indexes?

The answer is yes, but by comparison with the sort of thing dtSearch is designed to do they are very simple indexes. Much of the time the documents that you want to index are stored in a collection of folders but increasingly databases are used to store documents of all types complete with a simple keyword or metadata index. This is fine for a lot of retrieval situations but what if you want to perform complex searches on the contents of the documents? For example, in the case of a data-based website the articles are stored in a database. If you want to create a full text search facility you have to index the database.

This is where dtSearch comes in. It can index documents stored in a database or anywhere for that matter. In this article we take a look at how you can index documents of all kinds stored in container you care to think up - and not just a database.

Creating an index under program control

The first thing we need to learn is how to create an index under program control. Most of the time you want to create a full text index of a set of files stored in a collection of directories. This is something that dtSearch can do automatically using the desktop utility. In fact there is usually very little reason to write programs to create an index - but it can be done and it isn't difficult. You do however want to write code to create an index if the data source is something other than files stored in folder.

So let's first look at how to create a standard index in code and then extend this to using a more general data source.

First you will need a copy of dtSearch and to follow this example it is suggested you download the 30-day evaluation from dtsearch.com. It is also assumed that you already know how to create an index and search it using say C#. If not read Getting started with dtSearch.

A simple index

Start a new C# project and make sure you have added a reference to the dtSearch library and have added:

using dtSearch.Engine;

to the start of the project.

Creating an index under program control with dtSearch is exceptionally simple. All you need is an IndexJob object:

IndexJob indexJob = new IndexJob();

You simply set the properties of the IndexJob object to specify the index you want to create and call one of the Execute methods to build or update the index.

So what do you have to specify to create an index?

First you have to say where you want the index to be created:

indexJob.IndexPath = @"C:\Users\
    name\AppData\Local\dtSearch\test2";

There is no particular reason to use this location; it is just the default used by the dtSearch Desktop utility for the indexes it creates. Notice that you specify the directory that the files for the index are created in.

Next you have to specify the folders and file that you would like to index. This is achieved using the FoldersToIndex string collection. You can add as many strings specifying paths to folders to this collection as you need. For the example we will add just one:

indexJob.FoldersToIndex.Add(@"C:\Users\
                       name\Documents");

You can add a <+> to the end of the path to signify that all of the subfolders should be indexed. If you don't add <+> then just the content of the specified folder is indexed. You can also add include and exclude filters to specify which types of file are to be indexed. For simplicity we will ignore filters.

Finally, we have to set some "Action" properties that indicate how the indexing operation should be performed. The ActionCreate property has to be set to true for the indexing operation to create a new index. If the index already exists then it is overwritten. The ActionAdd property allows new documents to be added to the index. To create a new empty index and add files to it you have to set both:

indexJob.ActionCreate = true;
indexJob.ActionAdd = true;

The IndexJob is now setup with minimal configuration and we can start it going. The simplest way to do this is to use the Execute method. This starts the indexing off and only returns with a Boolean to indicate success or failure when the index is complete. So, to complete the program, we have to add:

bool result = indexJob.Execute();

The complete program is:

IndexJob indexJob = new IndexJob();
indexJob.FoldersToIndex.Add(@"C:\Users\
                       name\Documents");
indexJob.IndexPath = @"C:\Users\name\
          AppData\Local\dtSearch\test2";
indexJob.ActionCreate = true;
indexJob.ActionAdd = true;
bool result = indexJob.Execute();

Execute may be simple but it isn't really of much use.

Do you really want your indexing program to wait unresponsively while the index is constructed?

No, probably not.

In most cases the construction of an index takes more time that you can afford to have the UI blocked for. The standard solution in this case is to run the long blocking process on another thread. In this case dtSearch makes this very easy for you.

Instead of calling Execute, all you have to do is call ExecuteInThread and the call returns immediately and the indexing proceeds on another thread. You can keep control of the progress of the index using IsThreadDone, AbortThread and so on.

Implementing a full indexing application using these facilities is fairly easy - everything works as you would expect - and so for simplicity of the example we will avoid the slight complication of making the indexing asynchronous. In this case it doesn't matter too much because the index is small and completed in a few minutes or less.

Other data sources

One of the nice things about dtSearch is that it tends to implement facilities in ways that are simple, direct and probably the way you would choose to do it as well. Of course this means that you don't get the chance to use a lot of new jargon but you also get the program completed quicker.

Rather than implementing lots of different interfaces to work with standard data exchange protocols dtSearch simply provides a DataSource class. This uses any protocol you care to name internally to retrieve the data and then presents it to the indexing engine in a simple and uniform way.

Now in all probability you are already an expert on ADO, LINQ or RSS and so I'm not going to go over any of these technologies. What I am going to concentrate on is how the DataSource class is used to feed the data to the indexing engine.

Let's get started.

Prev - Next >>

Last Updated ( Tuesday, 06 December 2011 )