Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The USFM parser does not recognize some verse text near \fm markers #155

Open
benjaminking opened this issue Feb 20, 2025 · 7 comments
Open

Comments

@benjaminking
Copy link

When using the USFM parser on text that contains a \fm marker, it fails to recognize some verse text as being verse text. It extracts the verse text segments, but state.is_verse_text is False.

Here is an example of an input that it fails on:
\v 19 Your\f + \fr 19:19 \ft The Hebrew is singular.\f* servant has found favor in your\fm \f + \fr 19:19 \ft The Hebrew is singular.\f* eyes, and you\fm \f + \fr 19:19 \ft The Hebrew is singular.\f* have shown great kindness to me in sparing my life. But I canʼt flee to the mountains; this disaster will overtake me, and Iʼll die. \v 20 Look, here is a town near enough to run to, and it is small. Let me flee to it—it is very small, isnʼt it? Then my life will be spared.”

All segments after the first \fm have a value of False for state.is_verse_text. If further examples are needed, I can provide more.

@benjaminking
Copy link
Author

I was looking at the USFM stylesheet, and it says that \fm is supposed to be paired with \fm*, which it is not in this translation. I'm not sure if that means we're looking at bad USFM, or if the stylesheet might be wrong (it is a rather obscure tag and this is a very prominent translation).

@johnml1135
Copy link
Collaborator

I am ok just closing this one. It is a clear error in the USFM, and although it effects the NIV, it does so only minimally.

@ddaspit
Copy link
Contributor

ddaspit commented Feb 20, 2025

I don't want to close this. Although the USFM is invalid, I think we can handle it better.

@benjaminking
Copy link
Author

I did some checking on other Paratext projects to see if this marker is misused elsewhere. The only place where it is consistently misused is NIV11 (and its variants: NIV11R, NIV11UK). Unfortunately, since NIV11 is one of our most used source translations, it's probably worth adding some special logic to handle these errors.

The good news is that they make the same error consistently. \fm is supposed to only appear inside a footnote/endnote, but as in the example above, they always put a single \fm immediately before \f. So we might be able to do something as simple as ignore \fm when it appears outside a footnote.

@johnml1135
Copy link
Collaborator

What would "ignore" mean? What it could mean is that if I will "start and embed" with an fm, I say "no!" but rather make no change. That is feasible.

@benjaminking
Copy link
Author

I looked in Paratext to see how this issue was handled there, but when I download the NIV11 (and variants) as a resource, the USFM is actually different than the version in the Paratext projects folder. And crucially, the versions in Paratext don't contain these errors. Does anyone know if there are different versions of the NIV11 in Paratext?

@ddaspit
Copy link
Contributor

ddaspit commented Feb 28, 2025

That's interesting. Maybe we have an older version of NIV11R on the S3 bucket. We can update to the version from Paratext.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants