Indexing Excel Spreadsheets using htDig

the htDig intranet search engine does not provide a filter for Excel spreadsheets. But using the xlHtml Excel to Html converter allows htDig to index and thus search through Microsoft Excel files.

HtDig allows you to define an external perser file for indexing any document. Usually this is a file called parse_doc.pl, available from the the htDig website.
If you want to preserve your original parser file, follow this instruction, otherwise proceed below. For some troubleshooting hints please see below. 

Standard Instruction

  1. download xlHtml from www.xlHtml.org and install it
    (in this example it has been installed to /usr/local, otherwise the path needs to be updated in the file in step 3)
  2. make sure the following two lines are included in mime.types
    application/msexcel   xls
    application/vnd.ms-excel   xls

  3. copy the the new parse_doc.pl  to /usr/doc/packages/htdig/contrib
    (or whereever your parse_doc.pl is currently installed)
    (in this file the path to xlHtml from step 1 might need to be updated)
  4. edit your htdig configuration file (usually /opt/www/htdig/conf/htdig.conf)
    and add the following to the external_parsers section
    application/msexcel   /usr/doc/packages/htdig/contrib/parse_doc.pl  \
    application/vnd.ms-excel   /usr/doc/packages/htdig/contrib/parse_doc.pl
    (remember to check the path to parse_doc.pl from step 3)
  5. start rundig and then search with htDig for a word included in the excel file
Troubleshooting