Isolated Textblock Rendering: AMANA´s Approach to the Textblock Extraction Challenge
Approximately 50% of all ESEF reports are being created based on PDF documents. The reason is clear: PDF far exceeds the capability of Word to create precise report layouts based on detailed requirements by a company’s IR department. Converting these PDF documents to XHTML creates reliably good results that can convey the design language and message of a company.
The caveat here is that PDF to XHTML conversions, for reasons explained further below, create textblock contents that can be challenging to display when tagged in InlineXBRL reports. We at AMANA are trying to solve this challenge for everyone who would like to continue using PDF documents, by releasing a plugin to the iXBRL Viewer of XBRL International that allows to view textblock content in its original layout, even when spread out over multiple parts of the document:
Why is this needed?
Textblock extraction is currently on of the major topics within the XBRL community. While textblock in InlineXBRL documents for the ESEF requirement must be tagged, it is still up for debate how the content (either formatted or not) should or will be used. There are efforts underway in the ESEF Best Practice Group of XBRL Europe and in the Specifications Working Group of XBRL International to clarify the usage of textblock tags, and thus deduct the requirements for its content and to standardize on how these should be displayed.
The most discussed challenge regards the HTML styling of the extracted fact content. The source of this problem is to be found in the nature of HTML itself. Two things are usually necessary for correct rendering: the document tag structure and the corresponding CSS style rules. However, since the CSS style is often placed in the heading tag of the HTML document (as recommended by ESMA), the extracted content of the textblock, containing only a snippet of the entire document, will not derive any styles from the document style table when being rendered in any software or browser. This situation is further complicated by the fact that stylesheet rules are not only about element styling (fonts, colors, etc.) but also about the position of elements.
These problems usually occur with HTML documents that use a “fixed layout” for representing reports, such as those reports produced from PDF files. While the traditional “semantic layout” does not have this issue, it is also not as precise. Therefore, it is not applicable to booklet-quality documents.
Right now, two different approaches are available for proper rendering of textblocks:
Make all required styles local with the “style” attribute for the elements
Disallow any local styles and use only external class names
In both cases, to represent the extracted information, we just insert that style inside the “body element” to get an isolated HTML file representing our textblock.
The first approach will preserve the styling and generate a formatting that is closer to the original layout. The second one instead will ignore the styling and render textblocks as simple “flat” text fragments.
From a first look, making all styles local (first approach) can perfectly solve any formatting issues; however, it does not. The position and the layout of elements are usually dependent on their containers. Therefore, it is sometimes necessary, depending on the context, to have different styles for the same text fragments in the main document or in the document that represents the isolated textblock. This is obviously something that is not possible to achieve when applying local styling.
The image below is an example of wrong styling that generates non-readable content as a result:
Using external styling (second approach) makes things simpler on the one hand, but on the other it generates poor formatting results, which are difficult to compare to the original source. This happens, for example, in cases of multi-column page layout. The screenshot below illustrates this problem: the text is readable, but the formatting is not perfect; moreover, it misses the division of paragraphs.
AMANA is now happy to announce that it has found a solution that will fix this issue. As you can see from the screenshot below, the textblock preserves the original formatting and it can be viewed as intended. This should greatly improve the accessibility of textblock-tags.
The idea is quite simple: instead of copying the content of the textblock into the “body element”, our implementation changes the way of creating an isolated HTML file for extracting textblock.
The following two steps are performed:
Extraction and copying of the required styles from the stylesheet of the source document
Generation of a hierarchy of the required parent containers before copying the textblock content
To summarize the main points:
There is no need to have any additional software in order to display the extracted textblock and make assumptions about how it should look, because our implementation works as a plugin to the standard viewer
It works for most PDF-based reports, as most use the same PDF to XHTML converter
It is open source and freely available
It can be extended by other vendors to fit their output