External converters
TextIndexNG supports a registry for external converters wrapped into a Python class to convert a document or an object to text before it gets passed to the splitter. The converter is selected based on the mime-type and the extension of the object.
Supported formats
- HTML, SGML, XML
- PDF (requires xpdf)
- RTF (requires rtf2xml)
- Postscript (requires ghostscript)
- WinWord (requires wvWare Version 1(!), no support for V 2)
- PowerPoint (requires pphtml from xlhtml package)
- Excel (requires xls2csv from the catdoc package)
- OpenOffice
- (all other converters from the !DocumentLibrary product)
If you are on Linux then most converters can be installed using the
corresponding package manager e.g. apt-get install catdoc
ppthtml.
ALL CONVERTERS MUST BE IN THE EXECUTABLE SEARCHPATH $PATH OR WHATEVER. THEY MUST BE CALLABLE THROUGH PYTHON'S os.open() OR os.popen() call.
If you upload files to Zope, CMF or Plone you must ensure that the
content_type property of the object is set
properly to the corresponding mimetype e.g.
application/pdf if your content is PDF. This setting is
extremely important otherwise TextIndexNG may not
determine the type of your file and can not choose the required
converter.
