Expand description
Per-document /Differences-based encoding planning for the Core 14
Latin fonts.
§Problem
PDF single-byte fonts address at most 256 glyphs. The Core 14
WinAnsiEncoding only carries the ~216 Latin-1+Windows glyphs that
every PDF reader ships built-in (Annex D.2). But each Core 14 AFM
also lists 99 extra glyphs: Latin Extended-A (Ł, ł, Ě, …),
the Romanian comma-below set, the spacing diacritics, the math
operators, and the fi/fl ligatures, which have no WinAnsi byte.
PDF’s escape hatch is the /Encoding dictionary with a
/Differences array: it lets us declare “byte 0x7F means
/lslash, byte 0x81 means /Lslash, byte 0x90 means /ecaron”
and so on, sitting on top of WinAnsiEncoding as the base. The
glyph outlines still come from the reader’s built-in Helvetica/
Times/Courier; we just rearrange which byte addresses which glyph
name from the AFM. No font data ships.
§Algorithm
For each Latin Core 14 face actually used by some text run:
-
Walk every char of every run and partition into:
WinAnsi natives: havewinansi_byte(ch) = Some(b). The bytebis claimed: it can’t be repurposed for aDifferencesremap because the content stream already uses it.Extended: nowinansi_byte, butextended_glyph_name(ch)resolves to an AFM glyph name. Needs a remapped slot.Unmappable: neither (Cyrillic, CJK, emoji). Won’t occur in practice; the layout engine substitutes these to?upstream. We treat them defensively as?here.
-
Allocate slots for the extended set from a deterministic free pool: a. The six
WinAnsigap bytes0x7F, 0x81, 0x8D, 0x8F, 0x90, 0x9Dfirst; these are guaranteed unmapped inWinAnsiEncodingand produce stable golden output for the common case (≤ 6 extended glyphs). b. Then unused0x20..=0xFFslots in descending order. Going high-to-low keeps short ASCII-heavy paragraphs from perturbing low-byte slots; documents with rich punctuation at0xE0..0xFFstill get plenty of room from0x20..0x7E. -
If we exhaust the pool before placing every extended char, emit
MOS0032and drop the overflow (those chars render as?).
§Output
DocEncoding carries everything the PDF emit code needs: the
/Differences pairs (slot → AFM glyph name) for the font dict,
byte_for_char for the content-stream encoder, and
to_unicode_entries for the /ToUnicode CMap so copy-paste keeps
working.
Structs§
- DocEncoding 🔒
- Per-font planning output. The PDF emit path consumes this once per document.
- Encoding
Planner 🔒 - Two-phase encoding planner: caller streams every
(face, ch)in throughSelf::observe, then callsSelf::finalizeto get oneDocEncodingper Latin Core 14 face that participated.
Functions§
- allocation_
order 🔒 - Preferred slot-allocation order: the six
WinAnsigap bytes first (predictable golden output for ≤ 6 extended glyphs), then0xFF..=0x20descending excluding those same six bytes. We deliberately skip0x00..=0x1F: PDF readers tolerate control bytes in/Differences, but content streams that need anStr(...)literal can run afoul of\0/\r/\nescaping, and using high-byte slots first keeps short paragraphs from perturbing low-byte slots. - plan_
face 🔒 - Computes the encoding plan for a single face given the set of chars the document needs from it.