Implementation

A fz_text structure represents a block of text. At the lowest level the constituents of a block are fz_text_items.

$\begin{lstlisting} typedef struct fz_text_item_s fz_text_item; \par struct fz_te... ...pings */ int ucs; /* -1 for one ucs to many gid mappings */ }; \end{lstlisting}$

The items can be thought of as the individual `characters' that make up the display, together with their position. Where possible, we attempt to give both the glyph id (gid) and the unicode value (ucs) for the character, but there are various cases where a 1-1 mapping is not possible.

Some unicode characters can result in a string of glyphs. The glyph ids will be sent in a series of fz_text_items, in which the first ucs value will be the source unicode character, and subsequent ones will be -1.

Some sequences of unicode characters can result in a single glyph. Again, a sequence of fz_text_items will be sent listing the unicode values, but all but the first item will have the gid value set to -1.

In more complex cases, sequences of unicode characters can be transformed into a sequence of glyphs, with no direct correspondence between the source text and the output characters. In this case as many fz_text_items as are required are used, with either the gid or ucs values padded out by -1s as necessary.

Different input formats offer the text in different forms. With PDF, the data within the file is (typically) in the form of glyph ids, and mechanisms are optionally provided to infer unicode values from them. Glyphs are sent in any order, and absolutely positioned on the page.

With XPS the input can be either in the form of unicode or glyph ids, and directionality information is encoded in the file. This means that the logical ordering of the glyphs is well defined.

Some formats, such as EPUB and HTML, send unicode text with even less positioning information, and rely on the interpreter to perform layout. Part of this process involves inferring directional information from the source text, and then using shaping mechanisms embedded within the font to do complex conversions to give the final positioned glyph sequences.

In all such cases MuPDF will preserve the logical ordering of the unicode entries, at the cost of drawing glyphs non-monotonically onto the page.

Sequences of fz_text_items that share the same characteristics are gathered together into fz_text_spans:

$\begin{lstlisting} struct fz_text_span_s { fz_font *font; fz_matrix trm; unsi... ...*/ int len, cap; fz_text_item *items; fz_text_span *next; }; \end{lstlisting}$

Sequences of these spans are then gathered up into a linked list rooted in a fz_text.

$\begin{lstlisting} struct fz_text_s { int refs; fz_text_span *head, *tail; }; \end{lstlisting}$