Modifying The Filters in Recoll For Better Indexing

interpretive language scripts


Moderator: Forum moderators

Post Reply
s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

Modifying The Filters in Recoll For Better Indexing

Post by s243a »

recoll allows you to build indexes for file system searches.

Slackbuild notes the optional dependencies:
Optional dependencies are antiword, unrtf, and exiftool. Recoll uses Xapian as the back-end, On dpup buster CE I tried to use an index I created via omindex (part of the xapian-omega package) but I got an error saying that the index version was incompatible.

As a consequence, I decided to let recoll build the index. Recall is a bit different then omindex. In omindex you want to define a filter to convert each file type to plain text for indexing. Recoll already has these defined as python scripts and they are located at:

Code: Select all

/usr/share/recoll/filters
[code]

For instance rclrtf.py is a filter defined as part of the reclall package to convert rtf files to text. On line #38 we have:

[code]
        cmd = rclexecm.which("unrtf")
        if cmd:
            return ([cmd, "--nopict", "--html"], RTFProcessData(self.em))
        else:
            return ([], None)

which tells us that recall is using unrtf so for this to work one must have unrtf installed. Also note how recoll is searching for this function:

Code: Select all

  219 # Helper routine to test for program accessibility
  220 # Note that this works a bit differently from Linux 'which', which
  221 # won't search the PATH if there is a path part in the program name,
  222 # even if not absolute (e.g. will just try subdir/cmd in current
  223 # dir). We will find such a command if it exists in a matching subpath
  224 # of any PATH element.
  225 # This is very useful esp. on Windows so that we can have several bin
  226 # filter directories under filters (to avoid dll clashes). The
  227 # corresponding c++ routine in recoll execcmd works the same.

https://fossies.org/linux/recoll/filters/rclexecm.py

I'm not sure the implications of tis though and if it would ever cause any difficulties for recoll in finding the unrtf commmand.

So one can modiy these filters or even add their own. For instance here is a custom filter download for lotus notes:
http://rcollnotesfiltr.sourceforge.net/

Post Reply

Return to “Scripts”