Skip to main content

Module encoding

Module encoding 

Source
Expand description

Per-document /Differences-based encoding planning for the Core 14 Latin fonts.

§Problem

PDF single-byte fonts address at most 256 glyphs. The Core 14 WinAnsiEncoding only carries the ~216 Latin-1+Windows glyphs that every PDF reader ships built-in (Annex D.2). But each Core 14 AFM also lists 99 extra glyphs: Latin Extended-A (Ł, ł, Ě, …), the Romanian comma-below set, the spacing diacritics, the math operators, and the fi/fl ligatures, which have no WinAnsi byte.

PDF’s escape hatch is the /Encoding dictionary with a /Differences array: it lets us declare “byte 0x7F means /lslash, byte 0x81 means /Lslash, byte 0x90 means /ecaron” and so on, sitting on top of WinAnsiEncoding as the base. The glyph outlines still come from the reader’s built-in Helvetica/ Times/Courier; we just rearrange which byte addresses which glyph name from the AFM. No font data ships.

§Algorithm

For each Latin Core 14 face actually used by some text run:

  1. Walk every char of every run and partition into:

    • WinAnsi natives: have winansi_byte(ch) = Some(b). The byte b is claimed: it can’t be repurposed for a Differences remap because the content stream already uses it.
    • Extended: no winansi_byte, but extended_glyph_name(ch) resolves to an AFM glyph name. Needs a remapped slot.
    • Unmappable: neither (Cyrillic, CJK, emoji). Won’t occur in practice; the layout engine substitutes these to ? upstream. We treat them defensively as ? here.
  2. Allocate slots for the extended set from a deterministic free pool: a. The six WinAnsi gap bytes 0x7F, 0x81, 0x8D, 0x8F, 0x90, 0x9D first; these are guaranteed unmapped in WinAnsiEncoding and produce stable golden output for the common case (≤ 6 extended glyphs). b. Then unused 0x20..=0xFF slots in descending order. Going high-to-low keeps short ASCII-heavy paragraphs from perturbing low-byte slots; documents with rich punctuation at 0xE0..0xFF still get plenty of room from 0x20..0x7E.

  3. If we exhaust the pool before placing every extended char, emit MOS0032 and drop the overflow (those chars render as ?).

§Output

DocEncoding carries everything the PDF emit code needs: the /Differences pairs (slot → AFM glyph name) for the font dict, byte_for_char for the content-stream encoder, and to_unicode_entries for the /ToUnicode CMap so copy-paste keeps working.

Structs§

DocEncoding 🔒
Per-font planning output. The PDF emit path consumes this once per document.
EncodingPlanner 🔒
Two-phase encoding planner: caller streams every (face, ch) in through Self::observe, then calls Self::finalize to get one DocEncoding per Latin Core 14 face that participated.

Functions§

allocation_order 🔒
Preferred slot-allocation order: the six WinAnsi gap bytes first (predictable golden output for ≤ 6 extended glyphs), then 0xFF..=0x20 descending excluding those same six bytes. We deliberately skip 0x00..=0x1F: PDF readers tolerate control bytes in /Differences, but content streams that need an Str(...) literal can run afoul of \0/\r/\n escaping, and using high-byte slots first keeps short paragraphs from perturbing low-byte slots.
plan_face 🔒
Computes the encoding plan for a single face given the set of chars the document needs from it.