Help Contents

Document Metadata

Along with the HTML5 document file itself, Print2HTML5 can create a file describing the document named metadata file. You may use the information from this file, for example, for creation of index of your documents for your search engine.

You can turn on metadata file creation:

  • when converting documents manually - by setting Create Metadata File option and choosing the file format in the Metadata File Format field of the Metadata tab of Document Options window;
  • when converting documents programmatically - with CreateMetaDataFile, MetaDataFileName and MetaDataFileFormat properties of Profile object or with CreateMetaDataFile, MetaDataFileName and MetaDataFileFormat options of Enhanced Batch Processing.

Metadata file formats

There are two supported formats of metadata file:

  • XML format;
  • Plain text format.

The format can be chosen with Metadata File Format field or MetaDataFileFormat property mentioned above. Below is the description of each format.

XML format

XML format provides a convenient, easy and extendable way to describe the document. Print2HTML5 produces XML documents of the following sample format:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE Print2HTML5Doc>
<Print2HTML5Doc xmlns="http://print2html5.com">
<pages pagenum="10">
<page num="1" width="1632" height="2112" resolution="192">
  <text>Text of page 1</text>
</page>
<page num="2" width="1587" height="2245" resolution="192">
  <text>Text of page 2</text>
</page>
  ...
</pages>
</Print2HTML5Doc>

The generated XML document file is encoded using UTF-8 encoding. You may parse this file yourself or using third-party libraries or components. For example, on Windows platform you may use Microsoft XML Core Services (MSXML).  Below is the description of each tag of metadata file in XML format.

Print2HTML5Doc tag

The root element of this XML document is Print2HTML5Doc tag.

pages tag

Nested inside the root element is pages tag which envelopes descriptions of each document page. pages tag has pagenum attribute which contains the total number of pages in the document.

page tag

Nested inside the pages tag are a number of page tags, each tag corresponding to a single document page. The tag has the following attributes:

  • num - ordinal page number;
  • width - page width in pixels (dots);
  • height - page height in pixels (dots);
  • resolution - resolution the page is rendered at in dots per inch (DPI). You may calculate the physical page dimensions (in inches) by dividing the page width and height by this resolution value.

text tag

text tag is nested in a page tag. It contains the text which appears on the page. The order of the text corresponds to the order it was printed in by the printing application. Note that some text may be sent by printing applications not in the form of text but in the form of images. Such text will not be present within the text tag.

Plain text  format

This format is simpler than the XML format but is less flexible. A file generated in this format represents a plain Unicode (UTF-16) text file in little-endian format. The text contains all document text from all pages merged together starting from the first page till the last page. There is no way to distinguish which text belongs to which page using this format. If you need this functionality, you have to use XML format.

At the beginning of text metadata file Print2HTML5 writes a byte order mark consisting of two bytes (0xFF followed by 0xFE). This mark designates the little-endian format.

Metadata file naming

Metadata file is named according to a name template. For example, you may create metadata file with this name:

mydoc_md.xml

At programmatic conversion the file naming is controlled by MetaDataFileName property of Profile object or by MetaDataFileName option of Enhanced Batch Processing. The file name contains two placeholders for inserting the output document file name and file extension. For example, to create the file mentioned above when output document name is mydoc.xml, you need to have this value for the MetaDataFileName property:

%name%_md.%ext%

When converting documents manually, the file is named in a similar way and stored in the folder you specify when saving the document in the Print2HTML5 Application with Save All button.