URL generator
Introduction
Generators are a special category of processors that have no XML data inputs, only outputs. They are generally used at the top of an XML pipeline to generate XML data from a Java object or other non-XML source.
The URL generator fetches a document from a URL and produces an XML output document. The protocols supported are http:
, https:
, and file:
as well as the Orbeon Forms resource protocol (oxf:
). See Resource Managers for more information about the oxf:
protocol.
Content type
The URL generator operates in several modes depending on the content type of the source document. The content type is determined according to the following priorities:
Use the content type in the
content-type
element of the configuration ifforce-content-type
is set totrue
.Use the content type set by the connection (for example, the content type sent with the document by an HTTP server), if any. Note that when using the
oxf:
orfile:
protocol, the connection content type is never available. When using thehttp:
protocol, the connection content type may or may not be available depending on the configuration of the HTTP server.Use the content type in the
content-type
element of the configuration, if specified.Use
application/xml
.
In addition, it is possible to force the mode using the <mode>
configuration element:
XML mode
The XML mode is selected when:
the content type is
text/xml
,application/xml
, or ends with+xml
according to the selection algorithm abovethe
xml
mode is forced using the<mode>
configuration element
The generator fetches the specified URL and parses the XML document.
The following options are available:
validating
:if set to
true
, a validating parser (using a DTD) is used, otherwise a non-validating parser is useddefault:
false
handle-xinclude
:if set to
true
, handle XInclude inclusions during parsingdefault:
true
external-entities
:if set to
true
, external entities are processeddefault:
false
handle-lexical
:if set to
true
, propagate XML comments present in the inputdefault:
true
Example:
If the URL is an HTTP or HTTPS URL and the server returns a non-success status code, an exception is raised.
NOTE: The URL must point to a well-formed XML document. If it doesn't, an exception is raised.
NOTE: Be careful when setting _external-entities to true, as non-trusted documents with external entities could be used by malicious users to inject content into your XML document._
HTML mode
The HTML mode is selected when:
the content type is
text/html
according to the selection algorithm abovethe
html
mode is forced using the<mode>
configuration element
In this mode, the URL generator uses HTML Tidy to transform HTML into XML. This feature is useful to later extract information from HTML using XPath.
Examples:
The <tidy-options>
part of the configuration in the two examples above is optional. However, by default quiet
is set to false, which causes HTML Tidy to output messages to the console when it finds invalid HTML. To prevent this, add a <tidy-options>
section to the configuration with quiet
set to true.
Even if HTML Tidy has some tolerance for malformed HTML, you should use well-formed HTML whenever possible.
If the URL is an HTTP or HTTPS URL and the server returns a non-success status code, an exception is raised.
Text mode
The text mode is selected when:
the content type according to the selection algorithm above starts with
text/
and is different fromtext/html
ortext/xml
, for exampletext/plain
the
text
mode is forced using the<mode>
configuration element
In this mode, the URL generator reads the input as a text file and produces an XML document containing the text read.
Example:
Assume the input document contains the following text:
The resulting document consists of a document
root element containing the text according to the text document format. The following attributes are present:
xsi:type
, set toxs:string
content-type
, if knownstatus-code
, if the resource was retrieved through HTTP or HTTPS
NOTE: The URL generator performs streaming. It generates a stream of short character SAX events. It is therefore possible to generate an "infinitely" long document with a constant amount of memory, assuming the generator is connected to other processors that do not require storing the entire stream of data in memory, for example the [__SQL processor]3 (with an appropriate configuration to stream BLOBs), or the [_HTTP serializer]4._
JSON mode
[SINCE Orbeon Forms 2016.2]
The JSON mode is selected when:
the content type is
application/json
according to the selection algorithm abovethe
json
mode is forced using the<mode>
configuration element
In this mode, the URL generator uses the XForms 2.0 conversion scheme to convert the incoming JSON content to XML.
[SINCE Orbeon Forms 2017.1]
In addition to the application/json
mediatype, mediatypes of the form a/b+json
are recognized.
Binary mode
The binary mode is selected when:
the content type is neither one of the XML content types nor one of the
text/*
content typesthe
binary
mode is forced using the<mode>
configuration element
In this mode, the URL generator uses a Base64 encoding to transform binary content into XML according to the binary document format. For example:
The resulting document consists of a document
root node containing character data encoded with Base64. The following attributes are present:
xsi:type
, set toxs:base64Binary
content-type
, if knownstatus-code
, if the resource was retrieved through HTTP or HTTPS
NOTE: The URL generator performs streaming. It generates a stream of short character SAX events. It is therefore possible to generate an "infinitely" long document with a constant amount of memory, assuming the generator is connected to other processors that do not require storing the entire stream of data in memory, for example the [__SQL processor]3 (with an appropriate configuration to stream BLOBs), or the [__HTTP serializer]4.
Character encoding
For text and XML, the character encoding is determined as follows:
Use the encoding in the
encoding
element of the configuration ifforce-encoding
is set totrue
.Use the encoding set by the connection (for example, the encoding sent with the document by an HTTP server), if any, unless
ignore-connection-encoding
is set totrue
(for XML documents, precedence is given to the connection encoding as per RFC 3023). Note that when using theoxf:
orfile:
protocol, the connection encoding is never available. When using thehttp:
protocol, the connection encoding may or may not be available depending on the configuration of the HTTP server. The encoding is specified along with the content type in thecontent-type
header, for example:Use the encoding in the
encoding
element of the configuration, if specified.For XML, the character encoding is determined automatically by the XML parser.
For text, including HTML: use the default of iso-8859
When reading XML documents, the preferred method of determining the character encoding is to let either the connection or the XML parser auto detect the encoding. In some instances, it may be necessary to override the encoding. For this purpose, the force-encoding
and encoding
elements can be used to override this default behavior, for example:
This use should be reserved for cases where it is known that a document specifies an incorrect encoding and it is not possible to modify the document.
HTML example:
Note that only the following encodings are supported for HTML documents:
iso-8859-1
utf-8
Also note that use of the HTML <meta>
tag to specify the encoding from within an HTML document is not supported.
HTTP headers
When retrieving a document from an HTTP server, you can optionally specify the headers sent to the server by adding one or more header
elements, as illustrated in the example below:
In addition, you can provide a list of space-separated header names with the forward-headers
element. Any header listed is automatically forwarded if it exists in the incoming request:
Headers specified with the header
element have precedence over forward-headers
.
Cache control
Local cache
It is possible to configure whether the URL generator caches documents locally in the Orbeon Forms cache. By default, it does. To disable caching, use the cache-control/use-local-cache
element, for example:
Using the local cache causes the URL generator to check if the document is in the Orbeon Forms cache first. If it is, its validity is checked with the protocol handler (looking at the last modified date for files, the last-modified
header for http, etc.). If the cached document is valid, it is used. Otherwise, it is fetched and put in the cache.
When the local cache is disabled, the document is never revalidated and always fetched.
Conditional GET
Usually, the URL generator does forced GET
requests. You can enable conditional GET
s with the cache-control/conditional-get
element.
When conditional-get
is set to true, and if the URL generator finds a corresponding resource in its local cache, it sends a conditional HTTP GET
using the If-Modified-Since
header. If the server responds with a code 304, the URL generator uses the resource it holds in cache, following usual HTTP semantics.
Example of configuration:
Relation to other settings:
When
handle-xinclude
is set totrue
,conditional-get
is automatically overridden to false.When
conditional-get
is set to true,use-local-cache
is automatically overridden to true as well.
Authentication
The simplest way to handle authentication is to embed user names and passwords in the URL:
In that case the default authentication parameters are applied: preemptive authentication is used and forces the HTTP basic scheme.
If you don't want to embed user names and passwords in URLs or need more control over authentication schemes, you can use an authentication
element:
The
username
andpassword
are self explanatory and contain the username and password.When
preemptive
is set tofalse
, the preemptive mode is switched off and the URL generator will use a basic or digest scheme as requested by the server.When the
domain
element is present the NTLM authentication scheme is used with this domain name.
Relative URLs
URLs passed to the URL generator can be relative. For example, consider the following pipeline fragment declared in a file called oxf:/my-pipelines/backend/import.xpl
:
In this case, the URL resolves to: oxf:/documents/claim.xml
.