Binary and text documents

Introduction

In Orbeon Forms XPL and pipelines only deal with XML documents. This means that between processor outputs and processor inputs in a pipeline, only pure XML infosets circulate. There is however often a need to handle non-XML data in pipelines, in particular:

  • Binary document: any document that can be represented as a stream of bytes. In general this is the case of any document, but some document formats are almost always represented this way: images, sounds, PDF documents, etc.

  • Text documents: any document that can be represented as a stream of characters. Some documents are better looked at this way, like plain txt files, HTML files, and even the textual representation of XML.

Orbeon Forms addresses this question by defining two standard XML document formats to embed binary and text documents within an XML infoset. This solution has the benefit of keeping XPL simple by limiting it to pure XML infosets, while allowing XPL to conveniently manipulate any binary and text document.

Binary documents

A binary document consist of a document root node containing character data encoded with Base64. The following attributes are supported:

  • xsi:type: mandatory, specifies the content as xs:base64Binary

  • content-type: optional, provides a content-type which may be used by the consumer

  • last-modified: optional, provides a last modification date which may be used by the consumer

  • status-code: optional, provides a status code which may be used by the consumer

  • filename: optional, provides a file name which may be used by the consumer

  • disposition-type: [SINCE Orbeon Forms 2017.1] optional, when filename is specified:

    • attachment the default, if the browser should download the document

    • inline if the browser should display the document inline

Example:

<document 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:type="xs:base64Binary" 
    content-type="image/jpeg" 
    last-modified="Sun, 23 Mar 2008 07:51:07 GMT">
/9j/4AAQSkZJRgABAQEBygHKAAD/2wBDAAQDAwQDAwQEBAQFBQQFBwsHBwYGBw4KCggLEA4R
...
KKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooA//2Q==
</document>

NOTE: For the curious, the Base64 encoding is documented in RFC 2045. This encoding represents binary data by mapping it to a set of 64 ASCII characters.

Such documents are not meant to be read by users, in the same way that regular binary files are not meant to be examined by users. Binary documents are generated by Orbeon Forms processors, like the URL generator and converters. They are consumed by processors like the HTTP serializer, the Email processor, and converters.

Text documents

A text document consists of a document root element containing the text. The following attributes are supported:

  • xsi:type: mandatory, specifies the content as xs:string

  • content-type: optional, provides a content-type which may be used by the consumer

  • last-modified: optional, provides a last modification date which may be used by the consumer

Example:

<document 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:type="xs:string" 
    content-type="text/plain" 
    last-modified="Sun, 23 Mar 2008 07:51:07 GMT">
    This is line one of the input document!
    This is line two of the input document!
    This is line three of the input document!
</document>

The content-type attribute may have a charset parameter providing a hint for the character encoding, for example:

<document 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:type="xs:string" 
    content-type="text/plain; charset=iso-8859-1" 
    last-modified="Sun, 23 Mar 2008 07:51:07 GMT">
    This is line one of the input document!
    This is line two of the input document!
    This is line three of the input document!
</document>

Because XML character data itself is represented in Unicode (in other words it is designed to allow representing in a same document all the characters specified by the Unicode specification), there is no requirement for specifying character encoding in XML pipelines. However, when an XML infoset is read or written as a textual XML document, specifying a character encoding may may be a useful hint. For example a URL generator can, with this mechanism, communicate to an HTTP serializer the preferred character encoding obtained when the document was read. The serializer may then use that hint, but it is by no means authoritative.

In general, XML documents can be read and written using the utf-8 character encoding, which allows representing all the Unicode characters.

Unlike binary documents, text documents can easily be examined by users. They can also be easily manipulated by languages such as XSLT. Like binary documents, they are generated by Orbeon Forms processors, like the URL generator and converters. They are consumed by processors like the HTTP serializer, the Email processor, and converters.

Streaming

Processors can stream binary and text documents by issuing a number of short character SAX events. It is therefore possible to generate "infinitely" long binary and text documents with a constant amount of memory, assuming both the sender and the receiver of the document are able to perform streaming. This is the case for example of the URL generator and the HTTP serializer.