Subsetting: Latin characters always cause a MacRoman encoding CMAP table, but browsers expect Windows/Unicode CMAP table #111

fschutt · 2025-03-12T16:04:35Z

I'm debugging a simple test here in preparation to add OS/2 support for browsers. While the browser now doesn't complain anymore about a "missing OS/2" table if using SubsetProfile::Web, it complains about a broken cmap table, because the subsetting always uses MacRoman Encoding (table format 0 - old Apple format), which isn't supported by modern browsers, as they expect the cmap table to be in Windows/Unicode format (format 4 - 12):

// tests/cff.rs

#[test]
fn test_subset_with_os2_table() {
    use allsorts::font::Font;
    use allsorts::tables::cmap::CmapSubtable;
    use base64::{Engine as _, engine::general_purpose};
    use allsorts::tables::FontTableProvider;
    use std::fs::File;
    use std::io::Write;
    use std::collections::HashSet;

    // Test string to use for the font subset
    let test_string = "hello world";
    
    // Load the font
    let buffer = read_fixture("tests/fonts/opentype/Klei.otf");
    let opentype_file = ReadScope::new(&buffer).read::<OpenTypeFont<'_>>().unwrap();
    let provider = opentype_file.table_provider(0).unwrap();
    let provider2 = opentype_file.table_provider(0).unwrap();

    // Create a font instance to access cmap
    let font = Font::new(provider).unwrap();
    
    // Get the cmap subtable for unicode mapping
    let cmap_data = font.cmap_subtable_data();
    let cmap_subtable = ReadScope::new(cmap_data).read::<CmapSubtable>().unwrap();
    
    // Map characters to glyph IDs
    let mut glyph_ids = vec![0]; // Always include glyph 0 (.notdef)
    
    for c in test_string.chars() {
        if let Ok(Some(glyph_id)) = cmap_subtable.map_glyph(c as u32) {
            if !glyph_ids.contains(&glyph_id) {
                glyph_ids.push(glyph_id);
            }
        }
    }
    
    // Sort and deduplicate glyph IDs
    glyph_ids.sort();
    glyph_ids.dedup();
    
    println!("Using glyph IDs: {:?}", glyph_ids);
    
    // Subset the font
    let subset_buffer = subset(
        &provider2, 
        &glyph_ids, 
        &SubsetProfile::Full
    ).unwrap();
    
    // Validate that the OS/2 table is present in the subsetted font
    let subset_otf = ReadScope::new(&subset_buffer).read::<OpenTypeFont<'_>>().unwrap();
    let subset_provider = subset_otf.table_provider(0).unwrap();
    
    // Check that OS/2 table exists
    assert!(
        subset_provider.has_table(tag::OS_2), 
        "Subset font is missing the OS/2 table. Use Profile::Web for web compatibility."
    );
    
    // Compare tables in original and subset fonts
    let original_tables: HashSet<_> = opentype_file
        .table_provider(0)
        .unwrap()
        .table_tags()
        .unwrap_or_default()
        .into_iter()
        .collect();
        
    let subset_tables: HashSet<_> = subset_provider
        .table_tags()
        .unwrap_or_default()
        .into_iter()
        .collect();
        
    println!("Original font tables: {:?}", original_tables);
    println!("Subset font tables: {:?}", subset_tables);
    
    std::fs::write("./Klei.otf", &buffer);
    std::fs::write("./Klei-Subset.otf", &subset_buffer);
    // Output an HTML file with the test string using the subsetted font
    let base64_font = base64::prelude::BASE64_STANDARD.encode(&subset_buffer);
    let html = format!(r#"<!DOCTYPE html>
<html>
<head>
    <title>Font Subset Test</title>
    <style>
        @font-face {{
            font-family: 'SubsetFont';
            src: url('data:font/otf;base64,{}') format('opentype');
        }}
        .test-text {{
            font-family: 'SubsetFont', sans-serif;
            font-size: 24px;
        }}
        .fallback {{
            font-family: sans-serif;
            font-size: 24px;
        }}
    </style>
</head>
<body>
    <h1>Font Subset Test</h1>
    <p>The text below should display in the subsetted font:</p>
    <p class="test-text">{}</p>
    <p>This is fallback text:</p>
    <p class="fallback">{}</p>
</body>
</html>"#, base64_font, test_string, test_string);
    
    // Write the HTML to a file
    let output_path = "./subset_font_test.html";
    let mut file = File::create(output_path).unwrap();
    file.write_all(html.as_bytes()).unwrap();
    
    println!("Created {} - open in a browser to verify the font works", output_path);
}

In tables/cmap/subset.rs, I found this block:

impl owned::EncodingRecord {
    pub fn from_mappings(mappings: &MappingsToKeep<NewIds>) -> Result<Self, ParseError> {
        match mappings.plane() {
            CharExistence::MacRoman => {
                // The language field must be set to zero for all 'cmap' subtables whose platform
                // IDs are other than Macintosh (platform ID 1). For 'cmap' subtables whose
                // platform IDs are Macintosh, set this field to the Macintosh language ID of the
                // 'cmap' subtable plus one, or to zero if the 'cmap' subtable is not
                // language-specific. For example, a Mac OS Turkish 'cmap' subtable must set this
                // field to 18, since the Macintosh language ID for Turkish is 17. A Mac OS Roman
                // 'cmap' subtable must set this field to 0, since Mac OS Roman is not a
                // language-specific encoding.
                //
                // — https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#use-of-the-language-field-in-cmap-subtables
                let mut glyph_id_array = [0; 256];
                for (ch, gid) in mappings.iter() {
                    println!("encoding with MacRoman {ch:?} {gid}"); // <------------------
                    let ch_mac = match ch {
                        // NOTE(unwrap): Safe as we verified all chars with `is_macroman` earlier
                        Character::Unicode(unicode) => {
                            usize::from(char_to_macroman(unicode).unwrap())
                        }
                        Character::Symbol(_) => unreachable!("symbol in mac roman"),
                    };
                    // Cast is safe as we determined that all chars are valid in Mac Roman
                    glyph_id_array[ch_mac] = gid as u8;
                }
                let sub_table = owned::CmapSubtable::Format0 {
                    language: 0,
                    glyph_id_array: Box::new(glyph_id_array),
                };
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::MACINTOSH,
                    encoding_id: EncodingId::MACINTOSH_APPLE_ROMAN,
                    sub_table,
                })
            }
            CharExistence::BasicMultilingualPlane => {
                let sub_table = cmap::owned::CmapSubtable::Format4(
                    owned::CmapSubtableFormat4::from_mappings(mappings)?,
                );
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::UNICODE,
                    encoding_id: EncodingId::UNICODE_BMP,
                    sub_table,
                })
            }
            CharExistence::AstralPlane => {
                let sub_table = cmap::owned::CmapSubtable::Format12(
                    owned::CmapSubtableFormat12::from_mappings(mappings),
                );
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::UNICODE,
                    encoding_id: EncodingId::UNICODE_FULL,
                    sub_table,
                })
            }
            CharExistence::DivinePlane => {
                let sub_table = cmap::owned::CmapSubtable::Format4(
                    owned::CmapSubtableFormat4::from_mappings(mappings)?,
                );
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::WINDOWS,
                    encoding_id: EncodingId::WINDOWS_SYMBOL,
                    sub_table,
                })
            }
        }
    }
}

and running it prints:

---- test_subset_with_os2_table stdout ----
Using glyph IDs: [0, 1, 69, 70, 73, 77, 80, 83, 88]
encoding with MacRoman Unicode(' ') 1
encoding with MacRoman Unicode('d') 2
encoding with MacRoman Unicode('e') 3
encoding with MacRoman Unicode('h') 4
encoding with MacRoman Unicode('l') 5
encoding with MacRoman Unicode('o') 6
encoding with MacRoman Unicode('r') 7
encoding with MacRoman Unicode('w') 8

Now, this is wrong because it ALWAYS picks MacRoman encoding, which browsers don't support.

I can fix this and get a properly subsetted font without errors, if I merge the code paths for CharExistence::MacRoman and CharExistence::BasicMultilingualPlane.

CC @wezm - can this be merged or do any tools you know of require MacRoman encoding? Chrome simply refuses to load fonts with a Type0 cmap subtable, but it works with Type4.

The text was updated successfully, but these errors were encountered:

Fixes yeslogic#111

wezm · 2025-03-12T23:23:57Z

We use/prefer MacRoman wherever possible in Prince and thus embedding in PDFs so I don't think it can be merged with CharExistence::BasicMultilingualPlane. The reason is that in a PDF, if we can use MacRoman text in the PDF can use an 8-bit encoding. If the font is embedded with one of the Unicode cmaps then the the text in the PDF requires 16-bits per character, which can significantly increase the size of a PDF with a lot of text.

We could use the SubsetProfile enum you introduced in the PR to drive whether MacRoman is used or not. It would also be good include a link to the code in Chrome or the Chrome font sanitizer that rejects fonts with only a MacRoman cmap, so that we can point to that in the code.

fschutt added a commit to fschutt/allsorts that referenced this issue Mar 12, 2025

Fix cmap table not working in browsers

9617ca3

Fixes yeslogic#111

fschutt mentioned this issue Mar 12, 2025

Add SubsetProfile to subsetting function to add OS/2 table for browser #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsetting: Latin characters always cause a MacRoman encoding CMAP table, but browsers expect Windows/Unicode CMAP table #111

Subsetting: Latin characters always cause a MacRoman encoding CMAP table, but browsers expect Windows/Unicode CMAP table #111

fschutt commented Mar 12, 2025

wezm commented Mar 12, 2025

Subsetting: Latin characters always cause a MacRoman encoding CMAP table, but browsers expect Windows/Unicode CMAP table #111

Subsetting: Latin characters always cause a MacRoman encoding CMAP table, but browsers expect Windows/Unicode CMAP table #111

Comments

fschutt commented Mar 12, 2025

wezm commented Mar 12, 2025