10.10 Text

MuPDFs central text type is a fz_text structure. The exact definition of this structure has evolved considerably in the past to accommodate the needs of different input formats, and it is possible this will continue in future. Accordingly we have hidden the implementation behind an interface.

Nonetheless, it is worthwhile mentioning some of the design goals that have influenced the development of this area of the code.

As fz_text objects are the only text objects passed across the device interface, they need to encode several layers of information. For simple rendering devices, they need to be expressive enough to allow us to exactly render the exact specified glyphs. For text output devices, they need to be expressive enough to allow the unicode values to be extracted.

Ideally, given any input format we would like to be able to extract any output format from it (including the same format) with no loss of data. This means that our fz_text objects need to be expressive enough to represent the super-set of functionality of all input formats out there, even if we do not currently make use of all the information.

While the idea of a single representation being enough to encapsulate each glyph from the text on the page in turn is attractive, this is not the case. Indeed, it’s not even possible to trivially define the order in which glyphs will be sent!

It would be nice to think that text would be held in the source file in the order in which it should be displayed on the page, but this is frequently not the case.

The ‘logical order’ of text can be thought of as being the order in which text would be read out loud, if you were reading from the page. In many cases (such as for EPUB files), this is the order in which the information is stored within the file itself. Sadly, for other formats this is not always the case.

PDF files in particular have no particular defined ordering in which text is sent - as each glyph is individually positioned on the page, files can (and do) send them in any order they feel like. While most PDF files containing European languages will tend to send text in the expected logical ordering, there is no guarantee that this will always be the case. This likelihood gets even more remote as we start to deal with right-to-left text, top-to-bottom text, far eastern scripts, or texts in multiple different scripts or languages.

The classic case where logical order may differ noticeably from rendered order, is for ‘bidirectional’ text1 . Even if the internal document representation is in logical order, the order in which the text will actually be displayed can be quite different. Consider, for instance, some source text in Hebrew. If the individual glyphs are A,B,C,D etc, then the right-to-left nature of Hebrew means that these will be displayed in the order ‘DCBA’ on the page.

If, however, we have conventional western (arabic) numerals on the page, interspersed within the Hebrew text, this is still written left-to-right. So A,B,C,D,1,2,3,4,E,F,G would appear as ‘GFE12DCBA’.

The algorithm dealing with such strings is fairly complex, and so further discussion of this for the interested reader is best redirected to the ‘Unicode Bidirectional Algorithm’ as defined in Technical Report 9 at http://unicode.org/reports/tr9.

The final dose of complexity comes from scripts that require ‘shaping’. While simple western scripts (broadly) have a direct relationship between the character sent (e.g. the letter ’A’) and the shape used to represent it on the page (e.g. the glyph ’A’), this does not hold true for all scripts.

The simplest example for this is that of a ligature. A piece of source text might contain the letter ‘f’ followed by the letter ‘i’ (perhaps in the word ‘file’). When typeset onto a page, rather than displaying the glyphs individual, a combined glyph is generally used, ‘fi’.

This concept of their being a ‘transformation’ step from the input text to the output rendered form is extended massively when dealing with non-western scripts. For Arabic and Indic scripts in particular (and Eastern scripts in general), groups of characters are frequently combined together to give increasingly complex glyphs. This process is referred to as shaping, and it is generally applied after the bidirectional algorithm has been run.

Different source formats cope with this in different ways. The text strings within a PDF file have already had the layout and shaping process applied - they are literally a list of positioned glyphs to be displayed on the page. Each glyph is identified by a ‘glyph-id’ - a simple index of the glyph within a font, with no meaning other than that. The unicode values for the original text are frequently not there at all (and when they are they require specific work to derive).

Other formats, such as EPUB, take the opposite approach, by specifying the Unicode values directly, and leaving the displaying application (i.e. MuPDF) to do the conversion to glyphs (including the ‘shaping’ operation).

To cope with these different input requirements, and to allow us to translate one format into another, we require fz_text objects to encapsulate both forms of data at the same time.

Accordingly, our fz_text object represents a block of text, including font style and position, together with both unicode and glyph data (subject to the availability of the information in the original file). Where possible we try to provide this information in logical order, though no guarantee can be made of this.

If more information is required, then details of the current implementation are included in chapter 25 Text Internals in Part 2, otherwise just use it as a simple black box.