-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source format "diff-ability" should be an explicit goal #114
Comments
@alerque thank you for bringing this up: do you have any concrete improvements that could be discussed? |
@benkiel I remember a couple months ago I was looking for moldy vegetables to throw at alien ships after concluding the spec just wasn't designed with any thought to diff-ability, but but I don't remember the specifics. I'll dig out that bit of project and try to make a list. The original point of this issue though it less that X, Y, or Z is a show-stopper — this is more an instance of death by a thousand paper cuts: no one issue is a make-or-break problem that needs to be fixed, but because this isn't an explicit goal and people have not been weighing other spec decisions with their effect on diff-ability in mind and a lot of small factors have compounded. Making it an explicit design goal would both keep it from getting worse as new features are added and provide a reason to rehash problematic bits that might otherwise be too small to merit changing. |
@alerque Apologies, I didn't phrase the above very well; I agree with the idea, but there are a bunch of things above to unpack. Some the UFO spec doesn't have control over (tooling for version control, workflow setups) and some that it does. A couple of concrete examples of UFO specific sticking points would help all understand what the impact to the specfication might be. One that comes to mind for me is the contents.plist: it's a pain point for merge conflicts, and could possibly be made optional. Having a recommendation for normalization would be another —enforcing a normalization in the specification would be a big issue for tool makers, as they may be using libraries that don't allow for certain formatting, but having a standard for normalization would be possible. I'm sure there are others. |
It's a problem with the UFO format that we know very well at Dalton Maag, as internally all of our font work is stored on Git as designspace + UFOs, and designers collaborate through the Git process of branches and merge requests to work in parallel on the same project, for example one person refining the numerals and another adding accented glyphs to the same UFO. I would love to see a 4th point being added to UFO philosophy that prescribes that the format should be VCS-friendly. Here are our most often encountered problems: Whitespace diffsIt's not a big issue from a functionality point of view, but it contributes to the "thousand paper cuts". Our current solution is to run Ideally it would be great if all the tools that work with UFOs could output the same whitespace. I wouldn't mind the spec being prescriptive on this point, in the same way that for example the Python language has a style guide that recommends 4 spaces. Fontinfo fields that can either be not specified or specified as emptyIn the same vein of normalization issues, some fields in This one is not a big deal either but it generates meaningless diffs when switching between tools with different conventions. We don't really have a workaround. The most zealous designers with take care to not commit those diffs to Git when they're indeed meaningless, but that's wasting their time on annoying details. Maybe the spec could enforce that fields all need to be specified all the time? I don't know if it's worth it of if other people have opinions about this.
|
@belluzj This is fantastically helpful. Thank you so much for all the detail and examples! There are some great points raised in all of these comments. With regards to whitespace, attribute ordering, etc: The XML formatting has been very hard to specify because the various XML writers I've used and studied offer inconsistent or no controls over the formatting of their output. For example, Python's ElementTree didn't even do new lines or indentation and ordered attributes alphabetically. Because of this, I didn't feel comfortable specifying " I like the idea of having a specification level recommendation about formatting. That makes compliance optional and it becomes a conversation for a tool developer and their customers instead of a hard to implement requirement. We could do this by documenting what ufoNormalizer does (which, even though I tried to be logical, I'll admit that I made up as I went along) and use that as a starting point. That said… Does anyone know of a XML based specification that has specific formatting rules? I'd be curious to see how they phrase it.
I got about halfway through your first example and started wondering about this, too. contents.plist was/is an optimization. The reason for its existence is that it's very expensive to load
I don't know what we can do about this. (see idea below)
I have an idea on this:
This would make the mark colors only visible to the designer with
Yeah. That's why it's optional in ufoNormalizer. 😄 Since so much of this is out of our hands due to XML writer limitations, Git not knowing about fonts, etc. I wonder if a big part of the solution to this would be to build a UFO specific front end for or a prep-step for Git. I have my own problems with using Git with UFOs and have been working on a solution for myself. As part of that, I did a lot of thinking about diffing UFOs and built a tool to do it with knowledge of what the contents of each file mean rather than just being raw text. A comprehensive merging, resolution, etc. tool would be a big project but if there is enough perhaps resources could be pooled to build it. |
One thought would be to make contents.plist optional, not required. Agree about making a recommendation for formatting, but not an enforced standard: the enforced standard is a burden to developers who might not be able to control formatting. At least a recommendation would give normalizers a standard to follow (the ufoNormalizer does some things different form the SIL normalizer, this would solve that) |
About Git not knowing about fonts, yes that's really the crux of the problem. I see basically 3 axes to look at this:
|
Another thought (re contents.plist) would be to come up with a glyph name to file name scheme that is reversible: that you can get the glyph name from the file name, at least in most common cases. When that's truly not possible, eg. for very long glyph names, parsing the glif xml may still be necessary. |
Regarding mark colors: its really the only tool users have to group glyphs. The real solution will be to devise a better and higher level way to do that. (I cringe every time I see a proud screenshot with happy crayon colored glyph boxes... We should do better.) |
It's potentially more than contents.plist though. Any file that stores data about the glyphs outside of GLIF would suffer the same problem, including the cmap file that we want to add. I wonder if Ben's idea of making these files optional could work. If the file is there, use it. If not, quickly skim the XML. We'd have the data duplication problems we're frustrated with now with |
@justvanrossum I agree very much with the idea of a reversible file name mapping! It would also solve the problem that file names for If this is not an agenda item already, it should become one I think. |
How would it solve the cmap issue? |
If I understand correctly the CMAP issue, it's a file outside of GLIF that has a list of mappings "unicode -> glyph name" and the problem is: how do you know which GLIF file corresponds to each unicode? If I understood correctly, you would call that reversible mapping function on the glyph name, and you would get the matching GLIF file name. Is that right? |
I think the problem Tal means is that: if conflicts in contents.plist are frequent, then similar conflicts will happen in the separate cmap file as well. Which is a good point to consider, so we don't fix a problem in one place and create one in another. |
Oh ok, yes indeed. Thanks for explaining! cmap proposal here |
Yes and no, respectively:
|
Plus: • It's weird that a glyph named A on layer "foo" can have a different unicode than a glyph with the same name on layer "bar". |
Note that my first argument is a bit of a slippery slope: to set text properly, you need features; to build features, you need anchor data. FontGoggles has some fast-path glif parsing code that only extracts unicode and anchor data for this purpose. |
Yeah. So, doing my spec-writer-devil's-advocate job, I'll ask: Are these optimizations being added to solve tool limitations (albeit limitations that many tools face)? We try to avoid that. In this case we're presented with two conflicting tool needs:
This, to me, is the crux of the problem. Both are valid. I don't want a script editing a UFO to have to skim 65000 glyphs. I also feel the pain in the stories about contents.plist getting smashed by Git. So, I don't know what to do other than making these files optional. Hm. |
UFO is a source format, not an optimized output format. I think the needs of easily working with the source should be prioritized. Of course some tooling that works with the source will also want fast optimized access to things, but those mappings (such as In the specific case of interacting with VCS, files such as lookup maps that are derivable and only needed for output optimizations should not even be tracked in VCS. Since there could be a split between essential source data that must be tracked and shared and "intermediate" data formats that ease access for tooling, the latter could be regenerated when needed without making life harder on the source control end of things. The opposite should be true for generated binary output formats. These should go out of their way to be consumed by shapers in the most efficient way possible and have most optimizations up-front as it were. These generated output formats of course should not be tracked in VCS. |
Making |
I'm not up on the CMAP proposal so can't speak to it specifically, but specific to this proposal: if making diff-ability was an explicit priority, one design decision that would dictate is a separation between true original source material and derived data structures. Any content that can be regenerated on demand (even if it's inefficient to do so) should be stored in different files than data that can regenerated (even if it should stay in sync to be useful). It doesn't mean you can't have data that can be regenerated kicking around on disk, but you need to be able to cleanly separate it from other data. |
I'd like to add two more points about this:
Could this be avoided by making the current file name generation algorithm responsible for mapping back and forth?
I suppose we can run a small performance test of reading the contents of a directory of 10000 glyphs vs. loading the contents.plist file. The list can then be cached in
I'd like to point out Norad again. The original author tried loading a 60k+ glyph CJK UFO on his relatively recent MacBook after he implemented parallel loading and said loading that takes ~2-3.5 seconds (linebender/norad#50). This is also why Norad currently does not implement any lazy loading, it doesn't pay off. I want to look into making a ufoLib-compatible frontend for Norad as part of work, after I'm done with other more urgent matters. |
FWIW, I strongly disagree with the idea of making all fontinfo.plist fields mandatory. It would break many, many things. For example, if I'm making a font that will have CFF outlines, there is no need for me to set the TTF specific values. That's unnecessary overhead. Another example, in many cases the naming data and other identifying fields can be automatically built from the
Yes. No
I agree that what ufoNormalizer does needs to be documented. I think it's fine for the behavior to be defined there instead of ufoLib since there are more UFO writers than ufoLib and the point of the normalizer is to abstract away the differences between writers. |
(Just in case this is directed at "participating UFOs have the same interpolating fontinfo fields explicitly defined": to be clear, the check will only complain if e.g. one UFO has |
Possibly naive question: does "diff-ability" mean something different from "patch-ability"? For example, I feel that the diff-ability of contents.plist is fine, but since it's prone to merge conflicts, its patch-ability is rather poor. Or is the term diff-ability meant to cover that, too? |
@justvanrossum In the sense I intended to use the broad sense covering all the related issues of diffing, patching, merging, etc. Perhaps this could be re-worded "play nice with VCS". An actual literal diff is probably the easy part, even binary formats can be diffed. Patching isn't really any different, if you can generate a diff that is a patch. The hard part is patch conflict resolution. For that to work effectively you need to keep you diffs as noise-free as possible. Hence it all starts with changes being represented in consistent, clear, and minimal diffs. |
Got it, thanks for the elaboration. I'm happy to use "diff-ability" with your intended meaning. |
Besides making The lib public.glyphOrder tends to have less conflicts when several designers are working at the same time than alphabetical glyph-keyed list like |
The glyph order is such a trivial detail of the final font (that is often used as a bit of UI state, too) that I would be quite uncomfortable to use it in places like that at all. It also implies that changes in glyph order will be reflected in more than one place. |
Having data structures that change in multiple places for a singe discreet change is exactly the kind of thing we need to avoid in order to play nice with VCS. A source format with good diff-ability will have zero such cases. You can certainly make use of them in apps in the form of cached values, but there should be a clear dividing line between canonical source data (stored in VCS) and derived data structures (whether in memory or on disk) that store the same data in a different form. Obviously there are limits to this: if you have data that references other objects you have to have some kind of key obviously. If whatever that key is changes, then both the source and references will need to be updated to match across the whole source. But (for example) a key could be a value in a file or it could be in the filename, but it should never be both at once. If a file has key data in the file name, that same data shouldn't appear in the file. If the data is in the file, the file name should be generic and not need to change when the data does. That's an abstract example, but the principle seems to be eschewed in several places in UFO right now, |
@justvanrossum |
To keep things on track towards making this discussion actionable, is there consensus on these?
Item 2 is one that we would need to carefully deploy since it would be a breaking change. I took a quick look at |
One of the best parts about UFO as a format is it it much closer than any other format at facilitating lossless collaborate on a font design. That being said it's not quite there. Being able to collaborate means being able to use different tools to actually edit the font and VCS to track the differences. Frankly the state of VCS usage across the font industry right now is pretty much a disaster, with the vast majority of projects just using it as to store & publish snapshots—something that could just as easily be done with Dropbox or BTRFS. Most of the VCS tooling out there is somewhere between frustrating and useless to use and as a result designers are getting little millage out of it. Realistically while some nerdy programmer types can figure it out most designers can't keep design ideas in branches branches, they can't merge contributions from other people, and so forth. Almost inevitably –and due just to the nature of the data formats– trying to use VCS tooling to merge, rebase, branch, cherry-pick, etc. leads to merge conflicts that have to be dealt with by hand even when the nature of the changes doesn't conflict.
In the case of UFO the nature of those conflicts is much less egregious than in other leading formats I'm familiar with (Glyphs, SFD). It's close to being usable but there are still pain points.
The current 3 major points of the design philosophy indirectly imply most of what's needed.
The data must be human readable and human editable.
Human readable implies plain text, not some binary format. In this case all the competing formats are text based anyway which is a good starting point for VCS, but it isn't enough. Postscript is a text based format too, have you ever tried to merge two post script data sources by hand?
Human readable/editable is helps a lot, but have you ever tried to merge changes from two different editors to the same Markdown file? I do this at a book publishing company regularly and the tooling just doesn't support it well. It is 1000× easier if you mandate additional restrictions above and beyond "valid Markdown". For example one line per sentence plays much nicer with version control during the editing process later. Forcing a fixed width wrap during the initial writing process and then making edits never change the line wrapping is another way to skin the cat. My point is Markdown is great at being human readable and human editable, but it is terrible at being diff-able. Most Git tooling can't handle it. Those that can do word or character level diffs are a lot nicer to work with, but almost zero tooling is available for merging at that level.
The data should be application independent.
This helps too, but still leaves room for obnoxious things like datestamps (this is meta data that the file system and/or VCS can better provide or "generated by" values that add negative value in the context of VCS. If I share a font project with Bob and Alice (did I just give away my background in network security?) and ask Bob to cleanup the vowels and Nancy to cleanup the consonants and they both do exactly as told and don't touch any other letters, there shouldn't be any reason I can't merge their work in asynchronously. If both file versions coming back have differing date stamp and generator fields suddenly this becomes a manual merge operation, not an automatic one.
Another example would be key/value order. Right now there are 2 dedicated UFO normalizers plus other ways to normalize (e.g. round-tripping through Python w/ defcon or a UFO editor). Using one of these methods consistently helps, but there is still room in the spec for odds and ends than can and do get implemented differently and thus cause merge errors.
Data duplication should be avoided unless absolutely necessary.
This helps, but going further and making a place for unnecessary data that doesn't cause essential operations to choke up the machinery would be better
This issue isn't to hash out all the specifics of what should happen, only to suggest that the spec project add an explicit design goal of making collaboration easier using VCS tooling (whether that's Git or DARCS or Mercurial or Bazaar or whatever, the issues generalize to "how well does thing A difff with thing B).
The text was updated successfully, but these errors were encountered: