xapian indexing software for searching

Moderators: kirk, jamesbond, p310don, JakeSFR, step, Forum moderators

Post Reply
s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

xapian indexing software for searching

Post by s243a »

I created slackbuilds for the following programs (Compiled in Fatdog810b):

xapian-core-1.4.17-x86_64-1_SBo.tgz
https://slackbuilds.org/repository/14.1 ... pian-core/
*slackbuild modified to use newer source

xapian-omega-1.4.17-x86_64-1_SBo.tgz
https://slackbuilds.org/repository/14.2 ... ian-omega/
*slackbuild modified to use newer source

xapian-bindings-1.4.17-x86_64-2_SBo.tgz
https://slackbuilds.org/repository/14.2 ... -bindings/
*slackbuild modified to use newer source

These programs can be used for things like indexing directories and web pages. Supports many document types such as htm, pdf and office documents. I tested this and it works
Create an index with the following command

Code: Select all

omindex -p --db info --url documents /mnt/data0/Documents.

https://www.ibm.com/developerworks/libr ... index.html
https://manpages.ubuntu.com/manpages/ar ... dex.1.html

Query the database as follows:

Code: Select all

quest --db=info redbook

https://www.ibm.com/developerworks/libr ... index.html

Some related links:

https://wiki.python.org/moin/HelpOnXapi ... g_an_index
https://xapian.org/download
https://github.com/xapian/xapian-docspr ... de/python3
https://getting-started-with-xapian.rea ... ample-code
https://xapian.org/docs/

alternativesto xapian:
https://unix.stackexchange.com/question ... t-indexing
https://www.tecmint.com/count-word-occu ... text-file/
http://swishplusplus.sourceforge.net/
https://web.archive.org/web/20061223111 ... ish-e.org/
https://en.wikipedia.org/wiki/SWISH-E
https://metacpan.org/pod/SWISH
https://www.linuxjournal.com/article/6652

Last edited by s243a on Wed Nov 11, 2020 7:59 am, edited 3 times in total.
s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

Re: xapian indexing software for searching

Post by s243a »

So here's a use example. On my dropbox, I have a number of folders for various job applications which contain resumes and job adds. I might want to search for a term in on of the documents to use as an example cover letter "CAD"

First navigate to the folder, and then create an index for it as follows:

Code: Select all

omindex -p --db ~/info_s243a_personal --url dropbox .

dropbox is the first part of the URL after the domain (this is just an abstraction, I should have used a longer path than just "dropbox" because the folder I indexed is nested more deeply into my dropbox than this. "." is the directory that I'm indexing. I first navigated to this folder before running the command. This keeps me from having to type out the whole path.

Now I can search for the term as follows:

Code: Select all

quest --db=info_s243a_personal CAD

*quest is installed as part of xapian-core
** info_s243a_personal is a directory (i.e. the database) in "~" (i.e. my root user home directory)

I'm using maestral to access dropbox. Roughly it can be installed as follows:

Code: Select all

python3 -m pip install --upgrade pip
python3 -m pip install --upgrade maestral
python3 -m pip install --upgrade maestral[gui] 

I requires devX to be loaded and may also need some other dependencies. For instance, see:
http://www.murga-linux.com/puppy/viewto ... 26#1046607

The search with xapian was very fast. It was much faster than even using the find command to just search file names, yet It was searching file contents.

s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

Re: xapian indexing software for searching

Post by s243a »

So in my above example, I successfully index ".pdf" and ".html" type documents but not ".doc" type documents. I think to do so I need to specify a filter for the omindex command. For example:

Code: Select all

--filter=application/msword:'abiword --to=txt --to-name=fd://1'

https://xapian.org/docs/omega/overview.html

Some alternative commands:

Code: Select all

soffice --headless --convert-to txt:Text YOUR-DOCUMENT-HERE.DOC

https://ask.libreoffice.org/en/question ... ext-files/
*libreoffice can't be running

Code: Select all

odt2txt YOUR-DOCUMENT-HERE.DOC

Edit: I think the full indexing command should look like this:

Code: Select all

 omindex --filter=application/msword:'soffice --headless --convert-to txt:Text fd://1' -p --db ~/info_john_personal --url dropbox .

will post whether or not this works in the next post

s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

Re: xapian indexing software for searching

Post by s243a »

The quoted files in the original post will be replaced by the following:

xapian-core-1.4.17-x86_64-1_SBo.tgz
xapian-omega-1.4.17-x86_64-1_SBo.tgz
xapian-bindings-1.4.17-x86_64-2_SBo.tgz

I decided to try a newer version because in the older command the --filter option wasn't using my conversion script to convert rtf files. Instead it was using antiword. I don't have this issue with the newer version.

Here is the command I was testing with:

Code: Select all

omindex -i -v --filter='text/rtf:s243a-convert' -p --db ~/info_john_personal_test --url dropbox . --overwrite

Where s243a-convert is defined at:

In theory, I should also be able to use wildcards in the mime time and filter by other things such as extension and encoding. I haven't tested this yet though with the new package.

s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

Re: xapian indexing software for searching

Post by s243a »

s243a wrote: Wed Nov 11, 2020 7:55 am

The quoted files in the original post will be replaced by the following:

xapian-core-1.4.17-x86_64-1_SBo.tgz
xapian-omega-1.4.17-x86_64-1_SBo.tgz
xapian-bindings-1.4.17-x86_64-2_SBo.tgz

The following dependency is needed:
chmlib-0.40-x86_64-1_SBo.tgz (download)

Post Reply

Return to “Software”