Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting: Latin characters always cause a MacRoman encoding CMAP table, but browsers expect Windows/Unicode CMAP table #111

Open
fschutt opened this issue Mar 12, 2025 · 1 comment

Comments

@fschutt
Copy link
Contributor

fschutt commented Mar 12, 2025

I'm debugging a simple test here in preparation to add OS/2 support for browsers. While the browser now doesn't complain anymore about a "missing OS/2" table if using SubsetProfile::Web, it complains about a broken cmap table, because the subsetting always uses MacRoman Encoding (table format 0 - old Apple format), which isn't supported by modern browsers, as they expect the cmap table to be in Windows/Unicode format (format 4 - 12):

// tests/cff.rs

#[test]
fn test_subset_with_os2_table() {
    use allsorts::font::Font;
    use allsorts::tables::cmap::CmapSubtable;
    use base64::{Engine as _, engine::general_purpose};
    use allsorts::tables::FontTableProvider;
    use std::fs::File;
    use std::io::Write;
    use std::collections::HashSet;

    // Test string to use for the font subset
    let test_string = "hello world";
    
    // Load the font
    let buffer = read_fixture("tests/fonts/opentype/Klei.otf");
    let opentype_file = ReadScope::new(&buffer).read::<OpenTypeFont<'_>>().unwrap();
    let provider = opentype_file.table_provider(0).unwrap();
    let provider2 = opentype_file.table_provider(0).unwrap();

    // Create a font instance to access cmap
    let font = Font::new(provider).unwrap();
    
    // Get the cmap subtable for unicode mapping
    let cmap_data = font.cmap_subtable_data();
    let cmap_subtable = ReadScope::new(cmap_data).read::<CmapSubtable>().unwrap();
    
    // Map characters to glyph IDs
    let mut glyph_ids = vec![0]; // Always include glyph 0 (.notdef)
    
    for c in test_string.chars() {
        if let Ok(Some(glyph_id)) = cmap_subtable.map_glyph(c as u32) {
            if !glyph_ids.contains(&glyph_id) {
                glyph_ids.push(glyph_id);
            }
        }
    }
    
    // Sort and deduplicate glyph IDs
    glyph_ids.sort();
    glyph_ids.dedup();
    
    println!("Using glyph IDs: {:?}", glyph_ids);
    
    // Subset the font
    let subset_buffer = subset(
        &provider2, 
        &glyph_ids, 
        &SubsetProfile::Full
    ).unwrap();
    
    // Validate that the OS/2 table is present in the subsetted font
    let subset_otf = ReadScope::new(&subset_buffer).read::<OpenTypeFont<'_>>().unwrap();
    let subset_provider = subset_otf.table_provider(0).unwrap();
    
    // Check that OS/2 table exists
    assert!(
        subset_provider.has_table(tag::OS_2), 
        "Subset font is missing the OS/2 table. Use Profile::Web for web compatibility."
    );
    
    // Compare tables in original and subset fonts
    let original_tables: HashSet<_> = opentype_file
        .table_provider(0)
        .unwrap()
        .table_tags()
        .unwrap_or_default()
        .into_iter()
        .collect();
        
    let subset_tables: HashSet<_> = subset_provider
        .table_tags()
        .unwrap_or_default()
        .into_iter()
        .collect();
        
    println!("Original font tables: {:?}", original_tables);
    println!("Subset font tables: {:?}", subset_tables);
    
    std::fs::write("./Klei.otf", &buffer);
    std::fs::write("./Klei-Subset.otf", &subset_buffer);
    // Output an HTML file with the test string using the subsetted font
    let base64_font = base64::prelude::BASE64_STANDARD.encode(&subset_buffer);
    let html = format!(r#"<!DOCTYPE html>
<html>
<head>
    <title>Font Subset Test</title>
    <style>
        @font-face {{
            font-family: 'SubsetFont';
            src: url('data:font/otf;base64,{}') format('opentype');
        }}
        .test-text {{
            font-family: 'SubsetFont', sans-serif;
            font-size: 24px;
        }}
        .fallback {{
            font-family: sans-serif;
            font-size: 24px;
        }}
    </style>
</head>
<body>
    <h1>Font Subset Test</h1>
    <p>The text below should display in the subsetted font:</p>
    <p class="test-text">{}</p>
    <p>This is fallback text:</p>
    <p class="fallback">{}</p>
</body>
</html>"#, base64_font, test_string, test_string);
    
    // Write the HTML to a file
    let output_path = "./subset_font_test.html";
    let mut file = File::create(output_path).unwrap();
    file.write_all(html.as_bytes()).unwrap();
    
    println!("Created {} - open in a browser to verify the font works", output_path);
}

In tables/cmap/subset.rs, I found this block:

impl owned::EncodingRecord {
    pub fn from_mappings(mappings: &MappingsToKeep<NewIds>) -> Result<Self, ParseError> {
        match mappings.plane() {
            CharExistence::MacRoman => {
                // The language field must be set to zero for all 'cmap' subtables whose platform
                // IDs are other than Macintosh (platform ID 1). For 'cmap' subtables whose
                // platform IDs are Macintosh, set this field to the Macintosh language ID of the
                // 'cmap' subtable plus one, or to zero if the 'cmap' subtable is not
                // language-specific. For example, a Mac OS Turkish 'cmap' subtable must set this
                // field to 18, since the Macintosh language ID for Turkish is 17. A Mac OS Roman
                // 'cmap' subtable must set this field to 0, since Mac OS Roman is not a
                // language-specific encoding.
                //
                // — https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#use-of-the-language-field-in-cmap-subtables
                let mut glyph_id_array = [0; 256];
                for (ch, gid) in mappings.iter() {
                    println!("encoding with MacRoman {ch:?} {gid}"); // <------------------
                    let ch_mac = match ch {
                        // NOTE(unwrap): Safe as we verified all chars with `is_macroman` earlier
                        Character::Unicode(unicode) => {
                            usize::from(char_to_macroman(unicode).unwrap())
                        }
                        Character::Symbol(_) => unreachable!("symbol in mac roman"),
                    };
                    // Cast is safe as we determined that all chars are valid in Mac Roman
                    glyph_id_array[ch_mac] = gid as u8;
                }
                let sub_table = owned::CmapSubtable::Format0 {
                    language: 0,
                    glyph_id_array: Box::new(glyph_id_array),
                };
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::MACINTOSH,
                    encoding_id: EncodingId::MACINTOSH_APPLE_ROMAN,
                    sub_table,
                })
            }
            CharExistence::BasicMultilingualPlane => {
                let sub_table = cmap::owned::CmapSubtable::Format4(
                    owned::CmapSubtableFormat4::from_mappings(mappings)?,
                );
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::UNICODE,
                    encoding_id: EncodingId::UNICODE_BMP,
                    sub_table,
                })
            }
            CharExistence::AstralPlane => {
                let sub_table = cmap::owned::CmapSubtable::Format12(
                    owned::CmapSubtableFormat12::from_mappings(mappings),
                );
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::UNICODE,
                    encoding_id: EncodingId::UNICODE_FULL,
                    sub_table,
                })
            }
            CharExistence::DivinePlane => {
                let sub_table = cmap::owned::CmapSubtable::Format4(
                    owned::CmapSubtableFormat4::from_mappings(mappings)?,
                );
                Ok(owned::EncodingRecord {
                    platform_id: PlatformId::WINDOWS,
                    encoding_id: EncodingId::WINDOWS_SYMBOL,
                    sub_table,
                })
            }
        }
    }
}

and running it prints:

---- test_subset_with_os2_table stdout ----
Using glyph IDs: [0, 1, 69, 70, 73, 77, 80, 83, 88]
encoding with MacRoman Unicode(' ') 1
encoding with MacRoman Unicode('d') 2
encoding with MacRoman Unicode('e') 3
encoding with MacRoman Unicode('h') 4
encoding with MacRoman Unicode('l') 5
encoding with MacRoman Unicode('o') 6
encoding with MacRoman Unicode('r') 7
encoding with MacRoman Unicode('w') 8

Now, this is wrong because it ALWAYS picks MacRoman encoding, which browsers don't support.

I can fix this and get a properly subsetted font without errors, if I merge the code paths for CharExistence::MacRoman and CharExistence::BasicMultilingualPlane.

CC @wezm - can this be merged or do any tools you know of require MacRoman encoding? Chrome simply refuses to load fonts with a Type0 cmap subtable, but it works with Type4.

@wezm
Copy link
Contributor

wezm commented Mar 12, 2025

We use/prefer MacRoman wherever possible in Prince and thus embedding in PDFs so I don't think it can be merged with CharExistence::BasicMultilingualPlane. The reason is that in a PDF, if we can use MacRoman text in the PDF can use an 8-bit encoding. If the font is embedded with one of the Unicode cmaps then the the text in the PDF requires 16-bits per character, which can significantly increase the size of a PDF with a lot of text.

We could use the SubsetProfile enum you introduced in the PR to drive whether MacRoman is used or not. It would also be good include a link to the code in Chrome or the Chrome font sanitizer that rejects fonts with only a MacRoman cmap, so that we can point to that in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants