Quotation denormalization #203

benjaminking · 2025-06-28T15:31:02Z

I am creating a PR for quotation denormalization even though it still needs to be rebased. Eli had thought it would be helpful to go through it and make some comments, even if there were some minor changes still coming.

This change is

…translations. These updates make it happen.

Pass marker type (embed, style) to update block

…ests

codecov-commenter · 2025-06-28T20:48:35Z

Codecov Report

Attention: Patch coverage is 98.73251% with 48 lines in your changes missing coverage. Please review.

Project coverage is 90.99%. Comparing base (d8aa497) to head (7e7d577).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...a/punctuation_analysis/quotation_mark_tabulator.py	81.63%	9 Missing ⚠️
...e/corpora/punctuation_analysis/quote_convention.py	91.01%	8 Missing ⚠️
...achine/corpora/fallback_quotation_mark_resolver.py	90.32%	6 Missing ⚠️
...on_analysis/depth_based_quotation_mark_resolver.py	98.09%	4 Missing ⚠️
...on_analysis/preliminary_quotation_mark_analyzer.py	98.57%	3 Missing ⚠️
...unctuation_analysis/quotation_mark_string_match.py	96.84%	3 Missing ⚠️
...rpora/punctuation_analysis/quote_convention_set.py	96.93%	3 Missing ⚠️
...a/punctuation_analysis/usfm_structure_extractor.py	95.00%	3 Missing ⚠️
.../quote_convention_detection_resolution_settings.py	92.85%	2 Missing ⚠️
.../punctuation_analysis/quote_convention_detector.py	94.73%	2 Missing ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #203      +/-   ##
==========================================
+ Coverage   88.91%   90.99%   +2.07%     
==========================================
  Files         282      330      +48     
  Lines       17056    21509    +4453     
==========================================
+ Hits        15166    19572    +4406     
- Misses       1890     1937      +47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Enkidu93

Exciting to see this, Ben! I left just a few comments regarding function names, getters/setters, and unused functions. These are representative of similar situations elsewhere in the code. I think these changes could significantly reduce the amount of boiler-plate code necessary as well as improve readability - both of which will make porting much faster. Thank you for your hard work and my apologies for only now seeing that you'd put this PR in. I usually get notified 🤔.

Reviewed 6 of 49 files at r1, all commit messages.
Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions

machine/corpora/analysis/quotation_mark_metadata.py line 47 at r1 (raw file):

        return self.text_segment

    def get_start_index(self) -> int:

Unused function

machine/corpora/analysis/quotation_mark_tabulator.py line 50 at r1 (raw file):

        self.quotation_counts_by_depth_and_direction[key].count_quotation_mark(quotation_mark)

    def _has_depth_and_direction_been_observed(self, depth: int, direction: QuotationMarkDirection) -> bool:

Would _have_depth_and_direction... be better?

machine/corpora/analysis/text_segment.py line 29 at r1 (raw file):

        if self.immediate_preceding_marker != value.immediate_preceding_marker:
            return False
        return True

Why not also check the other properties?

machine/corpora/analysis/text_segment.py line 31 at r1 (raw file):

        return True

    def get_text(self) -> str:

I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment rather than set_.../get_...? If it is read-only, you can use the @property decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.

machine/corpora/analysis/text_segment.py line 43 at r1 (raw file):

        return self.text[index:]

    def get_immediate_preceding_marker_type(self) -> UsfmMarkerType:

Unused function

machine/corpora/analysis/text_segment.py line 62 at r1 (raw file):

    def replace_substring(self, start_index: int, end_index: int, replacement: str) -> None:
        self.text = self.text[:start_index] + replacement + self.text[end_index:]

What about substring_before(start_index) + replacement + substring_after(start_index)?

machine/corpora/analysis/quotation_mark_string_match.py line 48 at r1 (raw file):

        return regex_pattern.search(self.get_quotation_mark()) is not None

    def does_next_character_match(self, regex_pattern: regex.Pattern) -> bool:

Maybe next_character_matches()? I think <noun>_<verb>(e)s() is cleaner than does_<noun>_<verb>(). It's shorter, it sounds more natural in if statements, and I don't think does_... is used elsewhere in machine.py.

Enkidu93

Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions (waiting on @benjaminking)

machine/corpora/analysis/text_segment.py line 62 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

What about substring_before(start_index) + replacement + substring_after(start_index)?

*Sorry, end_index, but you get the idea :)

benjaminking · 2025-07-01T19:09:51Z

Unused function

Removed

Would _have_depth_and_direction... be better?

I have changed this, along with several other similarly-named functions.

Why not also check the other properties?

In this case, the object has references to previous and next instances (a linked list basically), so including it in the equality function would lead to infinite recursion.

I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment rather than set_.../get_...? If it is read-only, you can use the @property decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.

I have made these changes throughout my code. Thanks for the suggestion.

Unused function

Removed.

What about substring_before(start_index) + replacement + substring_after(start_index)?

Good suggestion. Done.

Maybe next_character_matches()? I think <noun>_<verb>(e)s() is cleaner than does_<noun>_<verb>(). It's shorter, it sounds more natural in if statements, and I don't think does_... is used elsewhere in machine.py.

Changed as part of the naming changes I mentioned earlier.

I also did a fair bit of minor refactoring to make things more consistent (that will be the majority of the changes in the commit).

Enkidu93

Great!

Reviewed 1 of 39 files at r2, all commit messages.
Reviewable status: 2 of 49 files reviewed, 1 unresolved discussion

machine/corpora/analysis/text_segment.py line 25 at r2 (raw file):

        if self._index_in_verse != value._index_in_verse:
            return False
        if self._index_in_verse != value._index_in_verse:

I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment properties?

Enkidu93 · 2025-07-01T19:27:59Z

I'll try and give it a more thorough review soon - thank you for the refactor and cleanup! I'll also begin porting more of the code now. I'm sure Damien will have more comments 🤪.

benjaminking · 2025-07-02T15:17:30Z

I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment properties?

Yup, I meant to check _num_segments_in_verse. Fixed.

Enkidu93

Sorry - my comments might trickle in now as I notice things. (Also, your replies are showing up in the wrong spot for me in Reviewable 🤔? Do you see that too?)

Reviewable status: 2 of 49 files reviewed, 2 unresolved discussions

machine/corpora/analysis/quote_convention.py line 60 at r3 (raw file):

        return True

    def get_name(self) -> str:

Looks like this method should be replaced with a property. .name is used in QuoteConvetionSet and get_name() elsewhere.

machine/corpora/analysis/quote_convention.py line 63 at r3 (raw file):

        return self.name

    def get_num_levels(self) -> int:

It's your call, but a property might be nice for this as well.

Enkidu93

Reviewable status: 2 of 49 files reviewed, 4 unresolved discussions (waiting on @benjaminking)

machine/corpora/analysis/quote_convention.py line 73 at r3 (raw file):

    def get_expected_quotation_mark(self, depth: int, direction: QuotationMarkDirection) -> str:
        if depth > len(self.levels) or depth < 1:

Could use get_num_levels/num_levels

machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

        return QuoteConvention(self.get_name() + "_normalized", [level.normalize() for level in self.levels])

    def print_summary(self) -> None:

Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message, you could override __repr__ or __str__.

ddaspit

Reviewed 10 of 49 files at r1, 37 of 39 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @benjaminking and @Enkidu93)

machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message, you could override __repr__ or __str__.

I agree.

machine/corpora/analysis/quotation_mark_tabulator.py line 76 at r3 (raw file):

        return 1 - (num_differences / num_total_quotation_marks)

    def print_summary(self) -> None:

Consider using __str__ or __rep__. If that is inappropriate, you can have the method return a string description instead of calling print directly.

machine/corpora/analysis/quotation_mark_finder.py line 13 at r3 (raw file):

class QuotationMarkFinder:
    quote_pattern = regex.compile(r"(\p{Quotation_Mark}|<<|>>|<|>)", regex.U)

This should be all caps and probably prefixed with an underscore.

machine/corpora/scripture_embed.py line 0 at r3 (raw file):
Is this module used anywhere?

machine/corpora/analysis/quotation_mark_resolution_settings.py line 28 at r3 (raw file):

    @abstractmethod
    def should_rely_on_paragraph_markers(self) -> bool: ...

This could be a property.

machine/corpora/analysis/quote_convention_set.py line 161 at r3 (raw file):

        return (best_quote_convention, best_similarity)

    def print_summary(self) -> None:

Consider using __str__ or __rep__ instead of direct print calls.

machine/corpora/analysis/quote_convention_detector.py line 18 at r3 (raw file):

@dataclass

Consider marking this dataclass as immutable.

machine/corpora/analysis/standard_quote_conventions.py line 4 at r3 (raw file):

from .quote_convention_set import QuoteConventionSet

standard_quote_conventions: QuoteConventionSet = QuoteConventionSet(

The naming convention for constants is all caps.

machine/corpora/analysis/quotation_mark_string_match.py line 16 at r3 (raw file):

    # extra stuff in the regex to handle Western Cham
    letter_pattern: Pattern = regex.compile(r"[\p{L}\U0001E200-\U0001E28F]", regex.U)

You should use all caps for constant names. I think this are only used internally, so they should be prefixed with an underscore.

machine/corpora/analysis/quote_convention.py line 6 at r3 (raw file):

from .quotation_mark_direction import QuotationMarkDirection

quote_normalization_map: Dict[str, str] = {

The naming convention for constants is all caps. If the constant is only used locally in the module, then you should prefix with an underscore.

machine/corpora/analysis/quote_convention.py line 22 at r3 (raw file):

@dataclass

If a dataclass is intended to be immutable, consider using @dataclass(frozen=True).

machine/corpora/analysis/__init__.py line 0 at r3 (raw file):
I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis?

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 104 at r3 (raw file):

class QuotationMarkCategorizer:
    apostrophe_pattern = regex.compile(r"[\'\u2019\u2018]", regex.U)

This should be all caps and prefixed with an underscore.

Enkidu93

Another batch of comments - wanted to get a chunk in so that they aren't all coming in at once.

Reviewed 1 of 39 files at r2.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @benjaminking)

machine/corpora/analysis/quote_convention_set.py line 25 at r3 (raw file):

        opening_quotation_marks: Set[str] = set()
        closing_quotation_marks: Set[str] = set()
        all_quotation_marks: Set[str] = set()

Since this is just the union of the other two, could we just set it as such downstream?

machine/corpora/analysis/quote_convention_set.py line 27 at r3 (raw file):

        all_quotation_marks: Set[str] = set()

        if len(self._conventions) > 0:

I don't think this if statement is needed

machine/corpora/analysis/quote_convention_set.py line 39 at r3 (raw file):

            if len(all_quotation_marks) > 0:
                self._opening_quotation_mark_regex: Pattern = regex.compile(
                    r"[" + "".join(sorted(list(opening_quotation_marks))) + "]"

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?

machine/corpora/analysis/quote_convention_set.py line 49 at r3 (raw file):

        if len(opening_quotation_marks) == 0:
            self._opening_quotation_mark_regex = regex.compile(r"")

I don't know how computationally expensive it would be to just make these the defaults and avoid these if statements? Or would that not work in some cases?

machine/corpora/analysis/quote_convention_set.py line 55 at r3 (raw file):

            self._all_quotation_mark_regex = regex.compile(r"")

    def _create_quotation_mark_pair_map(self) -> None:

Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)

machine/corpora/analysis/quote_convention_set.py line 88 at r3 (raw file):

    def get_all_quote_convention_names(self) -> List[str]:
        return sorted([qc.name for qc in self._conventions])

More sorting - why is this needed?

machine/corpora/analysis/quotation_mark_string_match.py line 57 at r3 (raw file):

    @property
    def previous_character(self) -> Union[str, None]:
        if self._start_index == 0:

Use is_at_start_of_segment?

machine/corpora/analysis/quotation_mark_tabulator.py line 8 at r3 (raw file):

class QuotationMarkCounts:

I wonder if you could leverage python's Counter class to simplify this or remove it altogether?

machine/corpora/analysis/quotation_mark_tabulator.py line 10 at r3 (raw file):

class QuotationMarkCounts:
    def __init__(self):
        self._string_counts: Dict[str, int] = dict()

Could this name be more specific?

machine/corpora/analysis/quotation_mark_tabulator.py line 35 at r3 (raw file):

    def __init__(self):
        self.quotation_counts_by_depth_and_direction: dict[tuple[int, QuotationMarkDirection], QuotationMarkCounts] = (

Should this be made private, as it were?

machine/corpora/analysis/quotation_mark_tabulator.py line 43 at r3 (raw file):

            self._count_quotation_mark(quotation_mark)

    def _count_quotation_mark(self, quote: QuotationMarkMetadata) -> None:

Throughout the code, there seems to be inconsistent use of quote, quotationMark, and quotation in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.

machine/corpora/analysis/quotation_mark_tabulator.py line 59 at r3 (raw file):

    def calculate_similarity(self, quote_convention: QuoteConvention) -> float:
        num_differences = 0

Could we qualify this name weighted_count... or something? The name suggests to me that it's an int but it's really a float.

Enkidu93

Reviewable status: all files reviewed, 32 unresolved discussions (waiting on @benjaminking)

machine/corpora/analysis/preliminary_quotation_analyzer.py line 43 at r3 (raw file):

    def count_word_initial_apostrophe(self, quotation_mark: str) -> None:
        if quotation_mark not in self._word_initial_occurrences:

This very same pattern of checking if a key exists then updating the value based on the previous value happens a lot in the code. It might be worth using python's defaultdict to avoid this. I have an extension in C# UpdateValue that I can use, I think, to a similar end.

machine/corpora/analysis/preliminary_quotation_analyzer.py line 76 at r3 (raw file):

        num_initial_marks: int = self._get_word_initial_occurrences(quotation_mark)
        num_total_marks: int = self._get_total_occurrences(quotation_mark)
        return num_total_marks > 0 and num_initial_marks / num_total_marks < 0.1

This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??

machine/corpora/analysis/preliminary_quotation_analyzer.py line 103 at r3 (raw file):

        self._later_quotation_mark_counts: Dict[str, int] = dict()

    def record_earlier_quotation_mark(self, quotation_mark: str) -> None:

Above you use count...() for method names but here you use record...().

machine/corpora/analysis/preliminary_quotation_analyzer.py line 155 at r3 (raw file):

        self._group_quotation_marks(quotation_marks)

    def _group_quotation_marks(self, quotation_marks: List[QuotationMarkStringMatch]) -> None:

You could use functools.reduce or itertools.grouby to make this a one-liner. It looks like we use groupby in thot_smt_model_trainer.py.

Enkidu93

Another handful of comments. I leave it to Damien's discretion if he'd like to leave any of these unaddressed for now in order to expedite getting the code in. I tried to make requested changes that I thought were less important non-blocking comments. I'm nearly through porting the code itself (almost all tests left), so my trickle of comments should be running dry soon. Thank you for doing this & my apologies for all the minor requested changes!

Reviewed 3 of 49 files at r1, 16 of 39 files at r2, 1 of 2 files at r3.
Reviewable status: all files reviewed, 44 unresolved discussions (waiting on @benjaminking)

machine/corpora/analysis/preliminary_quotation_analyzer.py line 273 at r3 (raw file):

    def __init__(self, quote_conventions: QuoteConventionSet):
        self._quote_conventions = quote_conventions

In some places QuoteConventionSet objects are called quoteConventionSets and in others quoteConventions. It would be nice if the variable naming were consistent.

machine/corpora/analysis/text_segment.py line 70 at r3 (raw file):

    # These setters need to be implemented outside the builder to avoid circular dependencies
    def set_previous_segment(self, previous_segment: "TextSegment") -> None:

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 26 at r3 (raw file):

    @property
    def current_depth(self) -> int:
        return self._current_depth + 1

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 49 at r3 (raw file):

        if depth > len(self._quotation_stack):
            raise RuntimeError(
                "get_opening_quotation_mark_at_depth() was called with a depth greater than the quotation stack size."

I don't think you necessarily need to pass the name of the function since it would be available in a stack trace. You may, however, want to pass the stack size and depth.

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 56 at r3 (raw file):

        if not self.has_open_quotation_mark():
            raise RuntimeError(
                "get_deepest_opening_quotation_mark() was called when the stack of quotation marks was empty."

I don't think this error message is correct.

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 77 at r3 (raw file):

    @property
    def current_depth(self) -> int:

See above comment. I might be missing something: Why +1 above but not here?

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 130 at r3 (raw file):

            if quote_match._start_index > 0:
                return False
            if quote_match.quotation_mark != self._quotation_mark_resolver_state.get_opening_quotation_mark_at_depth(

I think this logic can be simplified by taking this check and moving it outside the if/else - something like:

if quote_match.quotation_mark != ...:
    return False

if not _quotation_continuer_state.continuer_has_been_observed():
    ...

return true

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 163 at r3 (raw file):

            if quote_match.quotation_mark != "»":
                return False
            if not self._settings.are_marks_a_valid_pair(

Same thing as above method - I think this can be moved outside the if/else unless one of the bool functions is changing the state somehow.

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 187 at r3 (raw file):

        self,
        quote_match: QuotationMarkStringMatch,
        previous_match: Union[QuotationMarkStringMatch, None],

previous_match and next_match are unused.

machine/corpora/analysis/quotation_mark_resolver.py line 13 at r3 (raw file):

    def __init__(self, settings: QuotationMarkResolutionSettings):
        self._settings = settings

Is a default constructor needed?

machine/corpora/analysis/quotation_mark_resolver.py line 20 at r3 (raw file):

    ) -> Generator[QuotationMarkMetadata, None, None]: ...

    def reset(self) -> None: ...

Why is this method not marked with abstractmethod?

machine/corpora/analysis/usfm_structure_extractor.py line 16 at r3 (raw file):

        self._reset()

    def _reset(self):

Why a separate private reset function? Could we just do this in __init__()?

benjaminking · 2025-07-10T19:03:48Z

Since there are a bunch of comments, I will just skip the ones where my response would just be "Done."

Is this module used anywhere?

It must have been an artefact left over from rebasing with a long history. I deleted it now.

I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis?

I think that early on, John had thought that we should group all the things that analyze USFM documents together, but I don't think we have any plans for other types of analysis right now, so I went ahead and renamed it to punctuation_analysis.

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?

Yes, it's for testing, since list(some_set) doesn't put the elements in a predictable order.

Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)

I think I prefer to keep these separate. They work similarly and could be combined, but don't necessarily need to work the same way. I think we get better maintainability keeping them separate, in case the requirements for one of them change. And the performance difference should be negligible.

Throughout the code, there seems to be inconsistent use of quote, quotationMark, and quotation in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.

My intention was to use "quote" to refer to "the words that a person is saying," while "quotation mark" refers to a punctuation mark that indicates the presence of a quote. I've gone through everything to make it more consistent. A couple notes about the terminology:

A "quote convention" is the set of rules used to mark where quotes start and end
A "quote introducer" is combination of punctuation and whitespace that indicates that a quote will follow. For example, the comma and space in Bob said, "<quote>" or the colon and space in Bob said: "<quote>"
A "quote continuer" is used to indicate that a quote is being continued across paragraphs. In English, you don't use any closing quotation marks at the end of the first paragraph, but use new opening quotation marks at the start of the second paragraph. These quotation marks are called "quote continuers."
I think everything else in the code should refer to "quotation marks."

This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??

I have refactored them into constants. Hopefully the names make sense -- it's hard to come up with concise names for some of these things.

You could use functools.reduce or itertools.grouby to make this a one-liner. It looks like we use groupby in thot_smt_model_trainer.py.

It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using defaultdict.

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

I went ahead and gave them property setters, so they should behave as if they're public.

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)

This is a good call. I changed current_depth to be defined as the length of the stack for both classes.

I don't think this error message is correct.

I'm not sure I see why it's incorrect. Can you expound?

Enkidu93

Thanks! Great. I'll look back through the comments and try to confirm that all have been addressed. If you could please reply in Reviewable within the thread specific to that comment, it would make the back and forth a lot easier as we continue.

I went ahead and ported the changes you've made so far.

This should be my last round of new comments 🤪 🤞.

It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using defaultdict.
itertools is in the poetry lock and groupby is used in thot_smt_model_trainer.py - what do you mean by 'project'? It's not a big difference if it's left as-is, but it would simplify the function.

I'm not sure I see why it's incorrect. Can you expound?
I see now. I misread it and thought it had been copy-pasted from the previous error message. I like the new error message you came up with better 👍.

Reviewed 1 of 49 files at r1.
Reviewable status: 3 of 49 files reviewed, 57 unresolved discussions (waiting on @benjaminking and @ddaspit)

machine/corpora/analysis/text_segment.py line 70 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

Since there's no custom logic in the getter or setter, I'd just ditch the properties and replace e.g. _previous_segment with previous_segment.

machine/corpora/quotation_mark_update_first_pass.py line 37 at r4 (raw file):

    ) -> bool:
        target_marks_by_source_marks: Dict[str, Set[str]] = {}
        for level in range(1, source_quote_convention.num_levels + 1):

Some places depth is used and elsewhere level. Is there a true difference in meaning?

machine/corpora/quotation_mark_update_first_pass.py line 71 at r4 (raw file):

        return self._choose_best_strategy_based_on_observed_issues(self._quotation_mark_resolver.get_issues())

    def _choose_best_strategy_based_on_observed_issues(self, issues) -> QuotationMarkUpdateStrategy:

Type-hinting for issues

machine/corpora/punctuation_analysis/text_segment.py line 78 at r4 (raw file):

        self._next_segment = next_segment

    def set_index_in_verse(self, index_in_verse: int) -> None:

Same for index_in_verse as the segment fields.

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 69 at r4 (raw file):

    def _apply_fallback_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock:
        for element in block._elements:

Use Elements

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 74 at r4 (raw file):

    def _apply_standard_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock:
        for element in block._elements:

Use Elements

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 111 at r4 (raw file):

    def _create_text_segment(self, token: UsfmToken) -> Union[TextSegment, None]:
        self._next_scripture_text_segment_builder.set_usfm_token(token)

Might be a little cleaner to do something like:

text_segment_to_return = null
if token.text != null:
    ...
return text_segment_to_return

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 132 at r4 (raw file):

            if scripture_ref.chapter_num != self._current_chapter_number:
                self._current_chapter_number = scripture_ref.chapter_num
                self._start_new_chapter(self._current_chapter_number)

No need to pass bare field as parameter

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 147 at r4 (raw file):

            ):
                self._current_verse_number = scripture_ref.verse_num
                self._start_new_verse(self._current_verse_number)

No need to pass bare field as parameter

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 149 at r4 (raw file):

                self._start_new_verse(self._current_verse_number)

    def _start_new_verse(self, new_chapter_number: int) -> None:

new_chapter_number is unused

machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py line 93 at r4 (raw file):

        self._quote_continuer_mark_stack.append(quotation_mark)
        self._continuer_style = quote_continuer_style
        if len(self._quote_continuer_mark_stack) == len(quotation_mark_resolver_state._quotation_stack):

Should you use current_depth + 1?

machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py line 168 at r4 (raw file):

                return False

            # Check the next quotation mark match, since quote continuers must appear consecutively

Very minor, but some comments are starting with a capital letter - others are lowercase

machine/corpora/punctuation_analysis/quote_convention_detector.py line 53 at r4 (raw file):

        self._quotation_mark_tabulator.tabulate(resolved_quotation_marks)

    def detect_quote_convention(self, print_summary: bool) -> Union[QuoteConventionAnalysis, None]:

Throughout the code: Maybe Optional[Type] rather than Union[Type, None]?

machine/corpora/quotation_mark_update_resolution_settings.py line 16 at r4 (raw file):

        self._source_quote_convention = source_quote_convention
        self._quote_convention_singleton_set = QuoteConventionSet([self._source_quote_convention])
        self._target_quote_convention = target_quote_convention

This is unused.

ddaspit

Reviewed 46 of 46 files at r4, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @benjaminking and @Enkidu93)

machine/corpora/punctuation_analysis/quote_convention_detector.py line 53 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Throughout the code: Maybe Optional[Type] rather than Union[Type, None]?

For Python 3.9 and earlier Optional[Type] seems to be the preferred syntax. Once we move to 3.10+, Type | None seems to be preferred.

tests/corpora/test_fallback_quotation_mark_resolver.py line 15 at r4 (raw file):

def test_reset():
    english_quote_convention = standard_quote_conventions.STANDARD_QUOTE_CONVENTIONS.get_quote_convention_by_name(

This constant should probably be exported from the machine.corpora.punctuation_analysis package. It would make using it nicer.

… to headers when stripping paragraphs (#207)

benjaminking · 2025-07-17T15:43:40Z

A few higher-level notes:

I replied to this round of comments in Reviewable. Would it be helpful for me to go and reply to all the older comments there?
I added a new test that demonstrates the full process for performing quotation denormalization (since it involves a couple steps)
I have rebased to the latest revision of the main branch

And concerning itertools and functools, I hadn't realized they were part of the standard library -- my mistake. But I did try out refactoring the function in question to use reduce, and while it was technically reduced to one line, formatting turned it into 8 lines, so I decided to leave it as it was. (and it turns out that the functionality needed to make groupby work in this scenario is only available for later Python versions).

benjaminking

Reviewable status: 21 of 64 files reviewed, 46 unresolved discussions (waiting on @ddaspit and @Enkidu93)

machine/corpora/quotation_mark_update_first_pass.py line 37 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Some places depth is used and elsewhere level. Is there a true difference in meaning?

I've tried to use level in the context of rules and depth in the context of observed quotation marks. Hopefully this commit clarifies that a bit be

machine/corpora/quotation_mark_update_first_pass.py line 71 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Type-hinting for issues

Done.

machine/corpora/quotation_mark_update_resolution_settings.py line 16 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This is unused.

Done.

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 69 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use Elements

Done.

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 74 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use Elements

Done.

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 132 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No need to pass bare field as parameter

Done.

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 147 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No need to pass bare field as parameter

Done.

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 149 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

new_chapter_number is unused

Done.

machine/corpora/scripture_embed.py line at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is this module used anywhere?

It seems to have been a relic of rebasing. I've removed it.

ddaspit

@benjaminking Something weird happened with your branch. There seems to be some recent commits from the main branch that have been added to your branch. The preferred way to catch up with the main branch is to rebase on top of the main branch. As a last resort, you can merge the main branch into your branch.

Reviewed 4 of 43 files at r5, all commit messages.
Reviewable status: 25 of 64 files reviewed, 46 unresolved discussions (waiting on @benjaminking and @Enkidu93)

johnml1135 and others added 17 commits June 28, 2025 13:53

Some tests pass

44f2326

Added more test framework

90f3c93

Basic implementation and tests for quote convention detection

8e49f0b

Initial working version of quotation denormalization

101d3e2

I want to process the data in segments that correspond to individual …

f7ba4e5

…translations. These updates make it happen.

Updates for reviewer comments

541ba1c

Respond to reviewer comments

56f1ae5

Pass marker type (embed, style) to update block

Additional denormalization tests (not all passing)

f5bf1ee

Rebase + additional denormalization tests

f0c08cb

Improved handling for NLLB-produced quote errors

3adbb5e

Unit tests for basic quotation mark resolver

3542952

Unit tests for UsfmStructureExtractor

aef202c

Unit tests for several quotation mark analysis classes

a704d61

Fix a bug related to verse markers before quotation marks

7451a72

Refactoring to allow arbitrary quote convention changes + more unit t…

b1d1f28

…ests

Remaining unit tests

a594d75

Change update_usfm_parser_handler.py to match main branch

defe7dc

benjaminking force-pushed the quotation_denormalization branch from 5afb42c to defe7dc Compare June 28, 2025 20:33

Fix linting for test_quote_convention_detector.py

904edad

Enkidu93 requested changes Jun 30, 2025

View reviewed changes

Address reviewer comments + refactor

550d8a9

Enkidu93 requested changes Jul 1, 2025

View reviewed changes

Correct TextSegment equality function

83d89b8

Enkidu93 requested changes Jul 2, 2025

View reviewed changes

ddaspit requested changes Jul 2, 2025

View reviewed changes

Enkidu93 requested changes Jul 7, 2025

View reviewed changes

Damien's requested code-review changes

a24ef2b

Enkidu93 requested changes Jul 9, 2025

View reviewed changes

Ben King added 2 commits July 10, 2025 10:40

Eli's requested code-review changes

f2f929a

One code review change that was left out of the previous commit

38d3f05

Enkidu93 requested changes Jul 14, 2025

View reviewed changes

ddaspit reviewed Jul 15, 2025

View reviewed changes

Enkidu93 and others added 7 commits July 17, 2025 10:52

Pass metadata through update block (#202)

043c802

Add marker behavior to metadata for marker placement, fix bug related…

59fcb8f

… to headers when stripping paragraphs (#207)

Report additional ClearML progress via user properties (#205)

ab6e853

Add support to add tags to the ClearML task via build_options (#208)

bc735e2

Update machine library to 1.7.4 (#209)

f4ad9ea

Fixes for Eli's hopefully final code review comments

ee23dca

Fix typing issue in tests for QuotationMarkUpdateFirstPass

7e7d577

benjaminking commented Jul 17, 2025

View reviewed changes

ddaspit reviewed Jul 18, 2025

View reviewed changes

Uh oh!

Quotation denormalization #203

Are you sure you want to change the base?

Quotation denormalization #203

Uh oh!

Conversation

benjaminking commented Jun 28, 2025 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking commented Jul 1, 2025

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 commented Jul 1, 2025

Uh oh!

benjaminking commented Jul 2, 2025

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking commented Jul 10, 2025

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking commented Jul 17, 2025

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benjaminking commented Jun 28, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Jun 28, 2025 •

edited

Loading