Format Converter


The Format Converter provides a means to use software to batch convert word processing files from one format to another. This code does not perform any format conversions itself, but rather calls the software via its published API, in order to convert files from one format to another.

If you want to learn more about the API, there are numerous great examples of its usage on the Developer's Site, which helped immensely in the creation of this code.

In particular, using 2.x, this Format Converter supports the following conversions:

From any of these formats:

  • Microsoft Office (Word, PowerPoint, Excel)
  • formats (Writer, Impress, Calc)

To any of these formats:

  • corresponding format (Writer, Impress, Calc)
  • PDF
  • HTML (has some problems, since images aren't always converted)
  • CSV (for Excel or Calc)
  • DocBook XML, Word XML, RTF, Plain text (for Word or Writer)

info A complete list of all 3.0 Import/Export Filters is available on the Wiki. The Format Converter uses these Filters to perform its format conversions, so it could generally support any transformation which is supported by these Import/Export filters.

For a slightly more detailed overview (with diagrams and pictures), see my Format Conversion in DSpace using "best" poster from Open Repositories 2007.

Prerequisites / System Requirements

The Format Converter requires the following prerequisites in order to function properly:


  • A local or remote installation of DSpace 1.4.x or above (to allow the optional Media Filter to perform scheduled format conversions in DSpace)

It should also be noted that, although not required, the Format Conversion was built on Linux. It has not been fully tested on Windows or Mac OS X. (Although I believe it should function properly on both systems)

Installing Converter

Configure Installation

Install Extra Fonts

In order to make your format conversions more accurate to the original document, it is highly recommended that you install extra fonts into your installation. Otherwise, will often be forced to change the fonts utilized in the output file to a font that comes with by default. Although this will not change the content of the output file, it obviously will affect the look of the file, and in some cases will affect the page breaks in the file.

warning Even with the extra fonts installed, there is always the possibility that will encounter an unknown font in an input file, and be forced to change it during the format conversion process.

The easiest way to install additional fonts is to use the Font Installation Wizard available from the following menu in Writer:

warning The Font Installation Wizard is only available for 2.x. In 3.x, all system-level fonts are auto-recognized by So, you can add more fonts just be adding them to your local system.

File -> Wizards -> Install fonts from the web

If possible be sure to open Writer as the user who installed This will ensure the user will have the rights to install these fonts into the installation directory!

From within the Font Installation Wizard in 2.x, do the following:

  1. Select English language (or whatever language you want the wizard will run in)
  2. Click "Start FontOOo" to start the Wizard
  3. Select "Administrative Setup" and click "Next"
    1. If this option is disabled, then you are not the user who installed on this system. If you cannot log in as that user, you may be able to get by using the "Current user Setup", if the current user is the one who will run the open-office-server (described in the next section)
  4. Click the "Retrieve the List" button at the top of the first "font packages" page
    1. If you get a message that there's a newer version of FontOOo, download it to [open-office]/share/dict/ooo/ (where [open-office] is the location where you installed
    2. If you downloaded a new version, it will be loaded automatically, but you'll have to start back over at step #1 above!
  5. Highlight all font packages (using Shift + mouse click) and click "Next"
    1. If you already have the font packages installed, they will appear grayed out. You will be unable to highlight these packages, since they are already installed, so just click "Next"
  6. Repeat steps #4 and #5 above for all available font pages
  7. Complete installation and restart in order for these fonts to be loaded.

info Additional information on installing fonts in is available at:
There is also a somewhat outdated Font FAQ at:

Starting as a "service"

In order to send files in bulk to for conversion, it is highly recommended to start up as a service, so that it will "listen" for conversion requests on a particular port. This is advantageous since it allows you to schedule your bulk format conversions for a time when your server's CPU is not being fully utilized (e.g. overnight). Technically, you could even schedule your service to start up just before your scheduled conversions (assuming you have a set conversion schedule, of course).

Creating a Service on Unix-based Systems (including Mac OS X)

Unfortunately, always requires an X Window System to run, even when it's running as a service. Normally, on Unix-based systems, someone needs to be logged in for the X Window System to be running. But, you can get around this by using a Virtual X Window System. I recommend installing Xvfb (X Virtual Frame Buffer). Many Linux distributions provide Xvfb (try running which Xvfb at the command line). But, if you cannot find it for your system, it's also available from as part of XFree86.

Assuming you decide to use Xvfb, then I have a freely available open-office-server init.d startup script which you can use to automatically start up each time your server boots. (Although this open-office-server script was built for RedHat Linux, it will likely work for similar linux environments).

If you decide to use the provided open-office-server script, you first must do the following:

  • Ensure the both Xvfb and killall (or a similar kill command) are installed on your server
  • Modify the following local parameters in the script:
    • pidfile = this parameter is in the header of the script, and gives the full path of a process-id file for this open-office-server service. Although you can probably set this to whatever you want, I recommend you provide a path in the location where is installed. (e.g. /opt/openoffice.org2.0/
    • JAVA_HOME = the full path of your JDK installation (e.g. /usr/java/jdk1.5.0_04)
    • SERVER_HOST = the host name of the server where this service is running. Any requests to the server must use this exact name. If you set this to localhost, only requests from this same server will be accepted by (e.g.
    • SERVER_PORT = the port which you want the server to listen for requests on. (e.g. 9000)
    • OOo_CMD = the full path to the executable (e.g. /usr/bin/soffice or /usr/bin/openoffice or similar)
    • OOo_PARAM = the parameters given to the exectutable. You probably don't need to modify these, unless you don't want it to run in invisible mode while you are testing.
    • Xvfb_CMD = the full path to the Xvfb executable (e.g. /usr/X11R6/bin/Xvfb)
    • Xvfb_TEMP_DIR = Xvfb needs a temporary directory to write frame buffers to on occasion. This should probably just be a location in your system /tmp or /var/tmp (e.g. /var/tmp/tempfb)
    • Xvfb_PARAM = the parameters given to the Xvfb executable. You probably don't need to modify these.
    • KILL_ALL = the full path of the killall executable (e.g. /usr/bin/killall)
    • XAUTH (optional) = if specified, then the xauth script is called to ensure that this program has authorizations to start on the specified $VIRTUAL_DISPLAY (Useful if you start to receive "connection refused by server" errors). This variable corresponds to the full path of the xauth executable (e.g. /usr/X11R6/bin/xauth)
    • MCOOKIE (optional) = Required if you are using $XAUTH, since it needs to generate a Magic Cookie to give authorization to the specified $VIRTUAL_DISPLAY

info Graham Triggs (of BioMed Central) has created a similar open-office-server script suitable for Debian Etch environments. To use it, do the following:
(1) Download open-office-server-debian and rename it to open-office-server
(2) Place the script in /etc/init.d
(3) Run the following to install into startup:
update-rc.d -n open-office-server start 85 2 3 4 5 .
(4) Finally, ensure Xvfb is installed by running: apt-get install xvfb

Creating a Service on Microsoft Windows

See Creating an Service on Windows, which ironically comes from JODConverter, a very similar Java converter tool (but without the DSpace integration functionality).

Please note, I've never tried running as a Windows Service, so I'm trusting that the above directions will actually work!

Download & Install Converter

Download Converter

The Format Converter is released under the terms and conditions of the University of Illinois/NCSA Open Source License. The code and documentation is provided "as-is", with no promises of support or maintenance by University of Illinois.

The latest Java source code can be downloaded here:

This includes the following file structure:

  • build.xml - the Apache Ant build file
  • bin/ - contains run scripts to more easily execute OOCoverter with the proper Java CLASSPATH
  • lib/ - included necessary JARs
    • README - contains information about each of the included JARs and their current version numbers
    • commons-cli.jar - a copy of the Jakarta Commons Command Line Interface (CLI)
    • commons-io.jar - a copy of the Jakarta Commons I/O
    • log4j.jar - a copy of Log4j
    • oo-converter.jar - a pre-compiled version of the Format Converter (in case you don't want to compile it yourself!)
    • juh.jar, jurt.jar, ridl.jar, unoil.jar - The required API JARs. These JARs are just here so that you can compile this code on a server without However, they are not able to perform format conversions without a full installation of to communicate with.
  • src/ - the Java source files for the Format Converter
    • edu/uiuc/ideals/conversion/oo/
      • - the main class which performs all format conversions by making conversion requests to an installation of running as a service and listening on a particular server and port (See Starting as a "service" above)
      • - class to read the ooconverter-formats.xml configuration file for the Format Converter
      • - class to represent a single <export-format> tag within the ooconverter-formats.xml configuration file
      • ooconverter-formats.xml - XML configuration file which lists all of the valid format conversions that the OOConverter script currently understands. This is based on the full list of 3.0 Import/Export Filters. As filter names change or more filters are added, this configuration file can be updated to automatically allow the OOConverter to support those filters.

Since the downloaded file already includes a compiled version of the Format Converter (in /lib/oo-converter.jar), you do not need to re-compile it.

If you wanted to recompile it, you would use Apache Ant to run the following (from the directory where you unzipped

ant update

Configure Java CLASSPATH

You will need to configure the Java CLASSPATH of whatever user(s) will be running the Converter (including the dspace user if you plan to use the Media Filter for DSpace).

In particular, you will need to include all of the JARs in your [oo-converter]/lib/ directory in your user's Java CLASSPATH. However, you may wish to consider the following:

  • Rather than placing the included JAR files in your CLASSPATH, it may be beneficial to directly reference the version of those JARs which came with your installation. (warning This becomes especially important if you are using a different version of Currently the included JAR files are from, version 2.2)
    • All four of the necessary JAR files are usually found in [open-office]/program/classes/:
      • juh.jar
      • jurt.jar
      • ridl.jar
      • unoil.jar
    • Make sure to add all of these JARs to your CLASSPATH so that the Converter can find them.

Command-Line run scripts

To ease running the OOConverter via the command-line, two run scripts have been included in the [oo-converter]/bin directory. These scripts will ensure all JARs in the [oo-converter]/lib/ are loaded into the Java CLASSPATH at runtime.

For Unix-based systems:

  • run - will add all JARs in the [oo-converter]/lib/ to CLASSPATH
    • usage: run <full-classname> [arg1 [arg2 ...]]

For Windows:

  • run.bat - will add all JARs in the [oo-converter]/lib/ to CLASSPATH
    • usage: run.bat <full-classname> [arg1 [arg2 ...]]

Running Converter

There are currently three main options to running the Converter: via command-line, via Java API, via Media Filter for DSpace.

warning Before you can run the Converter, you must have currently running as a service and listening on a server and port accessible by your current computer.

Running via Command-Line

This example assumes you are using one of the run scripts which are packaged in the [oo-converter]/bin directory. If you do not wish to use one of the run scripts, you just need to be sure that all of the JARs in the [oo-converter]/lib directory are in your Java CLASSPATH.

From the command line you would run something similar to the following:

cd [oo-converter]/bin
run edu.uiuc.ideals.conversion.oo.OOConverter [source-file]

Usage Options:

This class currently accepts the following command-line options (run it with the -h flag for a list of all options).

flag arg(s) description
-c [config-file] specifies the Format Configuration file to load. Defaults to internal ooconverter-formats.xml configuration file within the oo-converter.jar
-f [format] specifies the output format extension (acceptable values currently include 'pdf', 'xls', 'doc', 'ppt', 'html', 'rtf', 'csv', 'txt' or any 2.x extension). Defaults to converting [source-file] into the appropriate format.
-o [out-file] specifies the location of output file (if unspecified, the new extension is just appended to the source file name)
-s [server] name of server where service is listening (default: localhost)
-p [port] port number where service is listening (default: 9000)
-h none display all available command line options and exit immediately
-v none run in verbose mode

Additional Notes:

  • The [source-file] is required, and must be of a format which software can successfully open (e.g. Word, Excel, Powerpoint, RTF, CSV, HTML, or any format)
  • warning You may only convert to formats which are supported by, based on the format of the [source-file]. Take a look at the listing of currently supported 3.0 Import/Export Filters for more information. In this listing of Filters, the following is always true:
    • If the [source-file] is textual in nature, you can only convert to export formats which are listed for
    • If the [source-file] is a presentation format, you can only convert to export formats which are listed for
    • If the [source-file] is a spreadsheet format, you can only convert to export formats which are listed for
    • etc.
  • %NOTE% If you are having trouble converting to a specific format, first check to make sure it's a supported export filter. If it is, chances are I neglected to list it by default in the ooconverter-formats.xml file within the oo-converter.jar. You can use the -c flag to specify your own custom ooconverter-formats.xml to use (just make sure to follow the current structure of that config file).

Running via Java API

You can also request a format conversion directly from another Java program, by accessing the Java API directly. In general, you would request a format conversion by doing the following:

  1. Call the connectListeningOpenOffice() method of the edu.uiuc.ideals.conversion.oo.OOConverter class, with the appropriate parameters for your listening service.
  2. Call one of the various convertDocument() methods on the initialized OOConverter object.

Running within DSpace

Running the converter on DSpace content requires the installation of the Media Filter for DSpace! Media Filter for DSpace (optional)

Overview of Media Filter

The Media Filter allows you to perform format conversions of content currently being held in a local DSpace repository. It performs all format conversions using the above Converter, so it works in a very similar manner and has the same abilities and limitations as the Converter.

At a higher level, when the Media Filter executes, it performs the following processing in DSpace:

  1. It first searches DSpace for all Items which have file(s) with matching input formats (in the ORIGINAL bundle only)
  2. Each matching file is converted to the specified output format, using the Converter
  3. The output file is stored in the new CONVERSION bundle of the same Item.
    • The resulting file also inherits any access restrictions which were placed on the original content file.
    • The resulting file is given a description of "Automatically converted using" in order to store some very basic information about the provenance of this new file. (Currently nothing is actually stored to the dc.description.provenance metadata field)

You are able to configure which file formats you wish to perform format conversions on, as well as which output formats you are converting into. You may also configure the Media Filter to store the results into a different bundle (if you do not like the idea of a CONVERSION bundle).

Installing Media Filter for DSpace

If you are currently using DSpace as your repository software of choice, I've created a DSpace Media Filter which calls the Converter to automatically perform format conversions within DSpace.

Install Required Patch for DSpace

info This patch will officially become part of DSpace 1.5 out-of-the-box. So, once you upgrade to DSpace 1.5, no patch will be required!

Before you can successfully install the Media Filter, you must first update your DSpace installation with the following patch:

SourceForge Patch #1589429: "Self-Named" Media Filters (i.e. MediaFilter Plugins!)

I'll make all attempts to keep this patch up-to-date as best I can. But, if you cannot get it working, feel free to contact me, and I'll try to provide help (time-permitting, of course).

warning Be warned that this patch will attempt to update your existing Media Filter configurations within the dspace.cfg file. The structure of these configurations has now changed to allow for a single Media Filter to support multiple format transformations (in the past this was more of a one-to-one association).

Overview of Media Filter Installation

The Media Filter is released under the terms and conditions of the University of Illinois/NCSA Open Source License. The code and documentation is provided "as-is", with no promises of support or maintenance by University of Illinois.

  1. Download the zipped up source code:
  2. Unzip the source file into the src/ directory within your DSpace source directory (i.e. [dspace-source]/src/)
    • This should only include a single source file:
  3. Copy your oo-converter.jar over into the DSpace source directory's lib/ directory (i.e. [dspace-source]/lib/)
  4. You will need to update the configuration for all DSpace Media Filters in your [dspace-source]/config/dspace.cfg configuration file. See the below section on Updating Media Filter configuration in dspace.cfg for more information on this.
  5. You will need to add all the formats to your DSpace Bitstream Format Registry. See the section on adding formats to DSpace Format Registry below.
  6. Recompile DSpace. From [dspace-source]:
    • ant clean
    • ant update
    • Since the OpenOfficeMediaFilter has no user interface, you probably aren't required to restart Tomcat. But, to be safe, you may want to!

Updating Media Filter configuration in dspace.cfg

The required patch (SourceForge Patch #1589429: "Self-Named" Media Filters (i.e. MediaFilter Plugins!)) slightly changes the way that Media Filters are configured within the dspace.cfg configuration file. The change is "for the better" (I believe), since it allows you to create Media Filters which are a little more dynamic in nature (you'll see what I mean in a moment).

Add OpenOfficeMediaFilter Configuration to dspace.cfg

You'll want to add the configuration settings specific to the OpenOfficeMediaFilter to your dspace.cfg (preferably under the Media Filter configuration section):

Configuration Notes:

  • Notice that the OpenOfficeMediaFilter is only defined once (in the first uncommented line).
  • However, I've "named" many different conversion plugins that this single Media Filter supports. For each "named" conversion plugin, I've defined acceptable inputFormats, as well as the conversion outputFormat.
    • e.g. the conversion plugin named "OO2Writer" accepts "Microsoft Word or RTF" as inputs, and produces an "OpenOffice Writer (2.x)" output file.
    • e.g. the conversion plugin named "OO2PDF" accepts Word, PPT, Impress, or Writer as inputs, and produces an PDF output file.
  • By default the documents which are the result of the conversion are saved to a new CONVERSION bundle in a DSpace Item. However, you can specify that a conversion be saved to a different bundle in a DSpace item, by specifying the following:
    • = MYBUNDLE
    • The above would cause the PDF output of the OO2PDF conversion to be save to a new bundle called MYBUNDLE
    • warning It is highly recommended not to save the output of a conversion to the ORIGINAL bundle, since this may cause confusion as to which format of the file was the true original provided by the submitter!

info Each of the formats listed as inputFormats or outputFormat above MUST exist within your DSpace Bitstream Format Registry (available at http://your-dspace-url/dspace-admin/format-registry). Many of these formats may not already exist there, so you will need to add them exactly as they appear in your above configuration! See the section on adding formats to DSpace Format Registry below.

Enable the OpenOfficeMediaFilter plugin(s) you wish to use

Once you have the entire configuration for the OpenOfficeMediaFilter in your dspace.cfg, you need to determine which conversion plugins you wish to enable.

"Enabling" a conversion plugin means that plugin will run automatically whenever you schedule the DSpace bin/filter-media script to run. In general the following processing will occur:

  • First, the conversion plugin searches for all DSpace Items which have a file in the ORIGINAL bundle which matches one of the inputFormats for that plugin
    • e.g. the "OO2PDF" plugin will find all DSpace Items which have a file in the ORIGINAL bundle which is either Word, Powerpoint, Writer or Impress
  • The conversion plugin then performs the conversion using the Converter, and stores the result in the CONVERSION bundle by default.

You enable plugins by adding the plugin name to the end of the existing mediafilter.plugins field in dspace.cfg (If you do not see the mediafilter.plugins field, you most likely did not finish installing the "Self-Named" Media Filters (i.e. MediaFilter Plugins!) patch!)

For example, if you want to enable the "OO2PDF" and "OO2TXT" plugins, you would append them onto the existing list of mediafilter.plugins:

Adding formats to DSpace Format Registry

In order for DSpace to recognize formats properly, you will need to add the following formats to your DSpace Bitstream Format Registry (available at http://your-dspace-url/dspace-admin/format-registry)

Name MIME Type Long Description Support Level Extensions
OpenOffice Writer (1.x) application/ Writer Known sxg, sxw
OpenOffice Writer (2.x) application/vnd.oasis.openoffice.text Writer Known odt, odm
OpenOffice Calc (1.x) application/vnd.sun.xml.calc Calc Known sxc
OpenOffice Calc (2.x) application/vnd.oasis.openoffice.spreadsheet Calc Known ods
OpenOffice Impress (1.x) application/vnd.sun.xml.impress Impress Known sxi
OpenOffice Impress (2.x) application/vnd.oasis.openoffice.presentation Impress Known odp

Running the Media Filter for DSpace

Run using DSpace filter-media script

By default, the Media Filter works the same as all other DSpace Media Filters. You only need to schedule or manually run the following:


When filter-media is run, only enabled conversion plugins will execute. At this time, each enabled plugin will do the following:

  • First, each conversion plugin searches for all DSpace Items which have a file in the ORIGINAL bundle which matches one of the inputFormats for that plugin
    • e.g. the "OO2PDF" plugin will find all DSpace Items which have a file in the ORIGINAL bundle which is either Word, Powerpoint, Writer or Impress
  • The conversion plugin then performs the conversion using the Converter, and stores the result in the CONVERSION bundle by default.

Run a specific plugin

If you would like to manually execute a specific conversion plugin, you can call it directly via a new feature of filter-media. After applying the "Self-Named" Media Filters (i.e. MediaFilter Plugins!) patch, the filter-media script now has an option to call plugins by name using the -p flag.

For example, the following runs the "OO2PDF" plugin ONLY:

[dspace]/bin/filter-media -p OO2PDF

In addition, you can also provide a list of plugins! So, the following executes the "OO2PDF" and "OO2TXT" plugins ONLY:

[dspace]/bin/filter-media -p "OO2PDF","OO2TXT"

The double quotes in the above example are not required, unless the plugin has a space in the name!

Topic attachments
I Attachment Action Size Date Who Comment
zipzip manage 1630.5 K 16 Mar 2009 - 13:13 TimDonohue  
zipzip manage 5.7 K 23 Apr 2007 - 12:00 TimDonohue OpenOfficeMediaFilter? for DSpace
elseEXT open-office-server manage 5.4 K 16 Mar 2009 - 13:14 TimDonohue OpenOffice?.org init.d startup script
elseEXT open-office-server-debian manage 6.2 K 25 Oct 2007 - 09:24 TimDonohue OpenOffice?.org init.d startup script for Debian
Topic revision: r15 - 21 Oct 2009 - 14:00:48 - BillIngram
Copyright 2015 by University of Illinois at Urbana-Champaign.
All material on this collaboration platform is the property of the University of Illinois at Urbana-Champaign.
Suggestions, requests, or problems finding IDEALS Resources? Send feedback
Powered by the TWiki collaboration platform