Sun, 7 Aug 2011

10:28 AM - New search tool for MidnightBSD

 One of the nice features of Mac OS X is Apple's spotlight.  It makes it easy to find documents because it supports full text search and is aware of different file types.  In the open source world, there are many search tools for Linux, but they all fail in different ways.  Some of them are slow.  Others don't support full text search and rely on inotify.

Linux solutions 

With inotify, the Linux kernel can notify a program that a file has changed by path name.  In the BSD community, we have kqueue that will report changes via fd.  Ideally, one would create a system daemon that can monitor changes in files and update the index on the fly.  This is planned for a future version of msearch(1).  A flaw with most BSD approaches is that it's easy to hit the kern.maxfiles limit as one has to have many directories and files open to detect changes.  kqueue approaches tend to work with UFS and UFS2 file systems only.  Someone using ZFS or fat32 would not get changes unless polling was used. Most modern Linux systems use gamin or FAM to monitor file changes.  

Many of the Linux solutions are under the GPL license. They were not designed for BSD.  I've started down the path of solving this problem.  The first iteration of my work is called msearch.  msearch(1) is a command line tool to search for files on the computer either matching elements of the path or by using the full text search feature.  

Indexing

All text files on the computer can be indexed by msearch.  It uses libmagic to determine the mime type of the file.  This allows it to skip files that are empty, binary, or otherwise useless to the search tool.  

msearch(1) uses two index files generated by a program called msearch.index.  /var/db/msearch.db is a sqlite database containing path information, owner, group, and file size at the time of indexing. /var/db/msearch_full.db contains a sqlite 3 FTS4 full text index of the text files on the computer.  It makes use of zlib to compress the text data.  On my computer, approximately 350,000 files were indexed and 84,000 were considered text files indexable by the full text engine.  Prior to adding compression, the database used 850MB of space. After compression, the file uses 413MB.  Another compression algorithm might cut off additional space at the expense of indexing performance. 

The current version of msearch relies on a periodic script similar to locate(1).  It is run weekly and most be turned on with weekly_msearch_enable="YES" in periodic.conf.  I would like to replace this process with a daemon that handles search requests and indexing.  Apple's search features work in this manner.

Graphical Search

Most of the logic for msearch(1) was placed in a shared library, libmsearch, which can be used to create a graphical search tool.   I envision a sherlock like search tool for the initial release and possibly an integrated solution if MidnightBSD ever gets it's own window manager.  

Security

There are several possible issues with generating an index of all files.  If the index is readable by any user, it could allow one to open the sqlite file and read the contents of sensitive files.  For this reason, I've limited the indexer so that it cannot run as the root user.  Files most be readable by nobody (if using the periodic script) to become part of the index.  

There is also the possibility of sql injection.  The database files aren't writable by normal users and the indexer uses prepare statements.  As the searching functionality is currently using a custom built search string, this could result in undesired behavior.  It's also not recommended to do a search as the root user.  sqlite does have the ability to load extensions, and this feature is used to compress and rank full text data.  The extension loading is turned off right after the database is created to avoid problems form uesrs. 

Future directions

I have a large list of features to add to mserach(1).  I plan to add filtering based on file size, user id, group id, created and modified times. I've considered adding a network search feature in combination with the plans for the search daemon and indexing in near 'real time" with file monitoring.  In order for this to work efficiently, a new kernel interface would need to be created or kqueue would need to be modified.  

I don't intend for this tool to replace locate(1), find(1) or similar search functions, but merely allow users to have an additional option with full text.  

Performance

Full text searches are quire fast.  Simple queries such as searching for Linux are done in seconds.  A search against path names takes longer than locate(1), but is still respectable. locate(1) uses a path compression technique to keep the database small and was optimized for low resources.  msearch(1) takes advantage of the convenience of sqlite 3 and the modern performance of PCs. 

0 comments