Skip to content

Quotation denormalization #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Conversation

benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Jun 28, 2025

I am creating a PR for quotation denormalization even though it still needs to be rebased. Eli had thought it would be helpful to go through it and make some comments, even if there were some minor changes still coming.


This change is Reviewable

@benjaminking benjaminking force-pushed the quotation_denormalization branch from 5afb42c to defe7dc Compare June 28, 2025 20:33
@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2025

Codecov Report

Attention: Patch coverage is 98.73251% with 48 lines in your changes missing coverage. Please review.

Project coverage is 90.99%. Comparing base (d8aa497) to head (7e7d577).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...a/punctuation_analysis/quotation_mark_tabulator.py 81.63% 9 Missing ⚠️
...e/corpora/punctuation_analysis/quote_convention.py 91.01% 8 Missing ⚠️
...achine/corpora/fallback_quotation_mark_resolver.py 90.32% 6 Missing ⚠️
...on_analysis/depth_based_quotation_mark_resolver.py 98.09% 4 Missing ⚠️
...on_analysis/preliminary_quotation_mark_analyzer.py 98.57% 3 Missing ⚠️
...unctuation_analysis/quotation_mark_string_match.py 96.84% 3 Missing ⚠️
...rpora/punctuation_analysis/quote_convention_set.py 96.93% 3 Missing ⚠️
...a/punctuation_analysis/usfm_structure_extractor.py 95.00% 3 Missing ⚠️
.../quote_convention_detection_resolution_settings.py 92.85% 2 Missing ⚠️
.../punctuation_analysis/quote_convention_detector.py 94.73% 2 Missing ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #203      +/-   ##
==========================================
+ Coverage   88.91%   90.99%   +2.07%     
==========================================
  Files         282      330      +48     
  Lines       17056    21509    +4453     
==========================================
+ Hits        15166    19572    +4406     
- Misses       1890     1937      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see this, Ben! I left just a few comments regarding function names, getters/setters, and unused functions. These are representative of similar situations elsewhere in the code. I think these changes could significantly reduce the amount of boiler-plate code necessary as well as improve readability - both of which will make porting much faster. Thank you for your hard work and my apologies for only now seeing that you'd put this PR in. I usually get notified 🤔.

Reviewed 6 of 49 files at r1, all commit messages.
Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions


machine/corpora/analysis/quotation_mark_metadata.py line 47 at r1 (raw file):

        return self.text_segment

    def get_start_index(self) -> int:

Unused function


machine/corpora/analysis/quotation_mark_tabulator.py line 50 at r1 (raw file):

        self.quotation_counts_by_depth_and_direction[key].count_quotation_mark(quotation_mark)

    def _has_depth_and_direction_been_observed(self, depth: int, direction: QuotationMarkDirection) -> bool:

Would _have_depth_and_direction... be better?


machine/corpora/analysis/text_segment.py line 29 at r1 (raw file):

        if self.immediate_preceding_marker != value.immediate_preceding_marker:
            return False
        return True

Why not also check the other properties?


machine/corpora/analysis/text_segment.py line 31 at r1 (raw file):

        return True

    def get_text(self) -> str:

I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment rather than set_.../get_...? If it is read-only, you can use the @property decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.


machine/corpora/analysis/text_segment.py line 43 at r1 (raw file):

        return self.text[index:]

    def get_immediate_preceding_marker_type(self) -> UsfmMarkerType:

Unused function


machine/corpora/analysis/text_segment.py line 62 at r1 (raw file):

    def replace_substring(self, start_index: int, end_index: int, replacement: str) -> None:
        self.text = self.text[:start_index] + replacement + self.text[end_index:]

What about substring_before(start_index) + replacement + substring_after(start_index)?


machine/corpora/analysis/quotation_mark_string_match.py line 48 at r1 (raw file):

        return regex_pattern.search(self.get_quotation_mark()) is not None

    def does_next_character_match(self, regex_pattern: regex.Pattern) -> bool:

Maybe next_character_matches()? I think <noun>_<verb>(e)s() is cleaner than does_<noun>_<verb>(). It's shorter, it sounds more natural in if statements, and I don't think does_... is used elsewhere in machine.py.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/text_segment.py line 62 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

What about substring_before(start_index) + replacement + substring_after(start_index)?

*Sorry, end_index, but you get the idea :)

@benjaminking
Copy link
Collaborator Author

Unused function

Removed

Would _have_depth_and_direction... be better?

I have changed this, along with several other similarly-named functions.

Why not also check the other properties?

In this case, the object has references to previous and next instances (a linked list basically), so including it in the equality function would lead to infinite recursion.

I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment rather than set_.../get_...? If it is read-only, you can use the @property decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.

I have made these changes throughout my code. Thanks for the suggestion.

Unused function

Removed.

What about substring_before(start_index) + replacement + substring_after(start_index)?

Good suggestion. Done.

Maybe next_character_matches()? I think <noun>_<verb>(e)s() is cleaner than does_<noun>_<verb>(). It's shorter, it sounds more natural in if statements, and I don't think does_... is used elsewhere in machine.py.

Changed as part of the naming changes I mentioned earlier.

I also did a fair bit of minor refactoring to make things more consistent (that will be the majority of the changes in the commit).

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Reviewed 1 of 39 files at r2, all commit messages.
Reviewable status: 2 of 49 files reviewed, 1 unresolved discussion


machine/corpora/analysis/text_segment.py line 25 at r2 (raw file):

        if self._index_in_verse != value._index_in_verse:
            return False
        if self._index_in_verse != value._index_in_verse:

I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment properties?

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Jul 1, 2025

I'll try and give it a more thorough review soon - thank you for the refactor and cleanup! I'll also begin porting more of the code now. I'm sure Damien will have more comments 🤪.

@benjaminking
Copy link
Collaborator Author

I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment properties?

Yup, I meant to check _num_segments_in_verse. Fixed.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - my comments might trickle in now as I notice things. (Also, your replies are showing up in the wrong spot for me in Reviewable 🤔? Do you see that too?)

Reviewable status: 2 of 49 files reviewed, 2 unresolved discussions


machine/corpora/analysis/quote_convention.py line 60 at r3 (raw file):

        return True

    def get_name(self) -> str:

Looks like this method should be replaced with a property. .name is used in QuoteConvetionSet and get_name() elsewhere.


machine/corpora/analysis/quote_convention.py line 63 at r3 (raw file):

        return self.name

    def get_num_levels(self) -> int:

It's your call, but a property might be nice for this as well.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 49 files reviewed, 4 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/quote_convention.py line 73 at r3 (raw file):

    def get_expected_quotation_mark(self, depth: int, direction: QuotationMarkDirection) -> str:
        if depth > len(self.levels) or depth < 1:

Could use get_num_levels/num_levels


machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

        return QuoteConvention(self.get_name() + "_normalized", [level.normalize() for level in self.levels])

    def print_summary(self) -> None:

Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message, you could override __repr__ or __str__.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 49 files at r1, 37 of 39 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @benjaminking and @Enkidu93)


machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message, you could override __repr__ or __str__.

I agree.


machine/corpora/analysis/quotation_mark_tabulator.py line 76 at r3 (raw file):

        return 1 - (num_differences / num_total_quotation_marks)

    def print_summary(self) -> None:

Consider using __str__ or __rep__. If that is inappropriate, you can have the method return a string description instead of calling print directly.


machine/corpora/analysis/quotation_mark_finder.py line 13 at r3 (raw file):

class QuotationMarkFinder:
    quote_pattern = regex.compile(r"(\p{Quotation_Mark}|<<|>>|<|>)", regex.U)

This should be all caps and probably prefixed with an underscore.


machine/corpora/scripture_embed.py line 0 at r3 (raw file):
Is this module used anywhere?


machine/corpora/analysis/quotation_mark_resolution_settings.py line 28 at r3 (raw file):

    @abstractmethod
    def should_rely_on_paragraph_markers(self) -> bool: ...

This could be a property.


machine/corpora/analysis/quote_convention_set.py line 161 at r3 (raw file):

        return (best_quote_convention, best_similarity)

    def print_summary(self) -> None:

Consider using __str__ or __rep__ instead of direct print calls.


machine/corpora/analysis/quote_convention_detector.py line 18 at r3 (raw file):

@dataclass

Consider marking this dataclass as immutable.


machine/corpora/analysis/standard_quote_conventions.py line 4 at r3 (raw file):

from .quote_convention_set import QuoteConventionSet

standard_quote_conventions: QuoteConventionSet = QuoteConventionSet(

The naming convention for constants is all caps.


machine/corpora/analysis/quotation_mark_string_match.py line 16 at r3 (raw file):

    # extra stuff in the regex to handle Western Cham
    letter_pattern: Pattern = regex.compile(r"[\p{L}\U0001E200-\U0001E28F]", regex.U)

You should use all caps for constant names. I think this are only used internally, so they should be prefixed with an underscore.


machine/corpora/analysis/quote_convention.py line 6 at r3 (raw file):

from .quotation_mark_direction import QuotationMarkDirection

quote_normalization_map: Dict[str, str] = {

The naming convention for constants is all caps. If the constant is only used locally in the module, then you should prefix with an underscore.


machine/corpora/analysis/quote_convention.py line 22 at r3 (raw file):

@dataclass

If a dataclass is intended to be immutable, consider using @dataclass(frozen=True).


machine/corpora/analysis/__init__.py line 0 at r3 (raw file):
I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis?


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 104 at r3 (raw file):

class QuotationMarkCategorizer:
    apostrophe_pattern = regex.compile(r"[\'\u2019\u2018]", regex.U)

This should be all caps and prefixed with an underscore.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of comments - wanted to get a chunk in so that they aren't all coming in at once.

Reviewed 1 of 39 files at r2.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/quote_convention_set.py line 25 at r3 (raw file):

        opening_quotation_marks: Set[str] = set()
        closing_quotation_marks: Set[str] = set()
        all_quotation_marks: Set[str] = set()

Since this is just the union of the other two, could we just set it as such downstream?


machine/corpora/analysis/quote_convention_set.py line 27 at r3 (raw file):

        all_quotation_marks: Set[str] = set()

        if len(self._conventions) > 0:

I don't think this if statement is needed


machine/corpora/analysis/quote_convention_set.py line 39 at r3 (raw file):

            if len(all_quotation_marks) > 0:
                self._opening_quotation_mark_regex: Pattern = regex.compile(
                    r"[" + "".join(sorted(list(opening_quotation_marks))) + "]"

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?


machine/corpora/analysis/quote_convention_set.py line 49 at r3 (raw file):

        if len(opening_quotation_marks) == 0:
            self._opening_quotation_mark_regex = regex.compile(r"")

I don't know how computationally expensive it would be to just make these the defaults and avoid these if statements? Or would that not work in some cases?


machine/corpora/analysis/quote_convention_set.py line 55 at r3 (raw file):

            self._all_quotation_mark_regex = regex.compile(r"")

    def _create_quotation_mark_pair_map(self) -> None:

Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)


machine/corpora/analysis/quote_convention_set.py line 88 at r3 (raw file):

    def get_all_quote_convention_names(self) -> List[str]:
        return sorted([qc.name for qc in self._conventions])

More sorting - why is this needed?


machine/corpora/analysis/quotation_mark_string_match.py line 57 at r3 (raw file):

    @property
    def previous_character(self) -> Union[str, None]:
        if self._start_index == 0:

Use is_at_start_of_segment?


machine/corpora/analysis/quotation_mark_tabulator.py line 8 at r3 (raw file):

class QuotationMarkCounts:

I wonder if you could leverage python's Counter class to simplify this or remove it altogether?


machine/corpora/analysis/quotation_mark_tabulator.py line 10 at r3 (raw file):

class QuotationMarkCounts:
    def __init__(self):
        self._string_counts: Dict[str, int] = dict()

Could this name be more specific?


machine/corpora/analysis/quotation_mark_tabulator.py line 35 at r3 (raw file):

    def __init__(self):
        self.quotation_counts_by_depth_and_direction: dict[tuple[int, QuotationMarkDirection], QuotationMarkCounts] = (

Should this be made private, as it were?


machine/corpora/analysis/quotation_mark_tabulator.py line 43 at r3 (raw file):

            self._count_quotation_mark(quotation_mark)

    def _count_quotation_mark(self, quote: QuotationMarkMetadata) -> None:

Throughout the code, there seems to be inconsistent use of quote, quotationMark, and quotation in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.


machine/corpora/analysis/quotation_mark_tabulator.py line 59 at r3 (raw file):

    def calculate_similarity(self, quote_convention: QuoteConvention) -> float:
        num_differences = 0

Could we qualify this name weighted_count... or something? The name suggests to me that it's an int but it's really a float.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 32 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/preliminary_quotation_analyzer.py line 43 at r3 (raw file):

    def count_word_initial_apostrophe(self, quotation_mark: str) -> None:
        if quotation_mark not in self._word_initial_occurrences:

This very same pattern of checking if a key exists then updating the value based on the previous value happens a lot in the code. It might be worth using python's defaultdict to avoid this. I have an extension in C# UpdateValue that I can use, I think, to a similar end.


machine/corpora/analysis/preliminary_quotation_analyzer.py line 76 at r3 (raw file):

        num_initial_marks: int = self._get_word_initial_occurrences(quotation_mark)
        num_total_marks: int = self._get_total_occurrences(quotation_mark)
        return num_total_marks > 0 and num_initial_marks / num_total_marks < 0.1

This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??


machine/corpora/analysis/preliminary_quotation_analyzer.py line 103 at r3 (raw file):

        self._later_quotation_mark_counts: Dict[str, int] = dict()

    def record_earlier_quotation_mark(self, quotation_mark: str) -> None:

Above you use count...() for method names but here you use record...().


machine/corpora/analysis/preliminary_quotation_analyzer.py line 155 at r3 (raw file):

        self._group_quotation_marks(quotation_marks)

    def _group_quotation_marks(self, quotation_marks: List[QuotationMarkStringMatch]) -> None:

You could use functools.reduce or itertools.grouby to make this a one-liner. It looks like we use groupby in thot_smt_model_trainer.py.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another handful of comments. I leave it to Damien's discretion if he'd like to leave any of these unaddressed for now in order to expedite getting the code in. I tried to make requested changes that I thought were less important non-blocking comments. I'm nearly through porting the code itself (almost all tests left), so my trickle of comments should be running dry soon. Thank you for doing this & my apologies for all the minor requested changes!

Reviewed 3 of 49 files at r1, 16 of 39 files at r2, 1 of 2 files at r3.
Reviewable status: all files reviewed, 44 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/preliminary_quotation_analyzer.py line 273 at r3 (raw file):

    def __init__(self, quote_conventions: QuoteConventionSet):
        self._quote_conventions = quote_conventions

In some places QuoteConventionSet objects are called quoteConventionSets and in others quoteConventions. It would be nice if the variable naming were consistent.


machine/corpora/analysis/text_segment.py line 70 at r3 (raw file):

    # These setters need to be implemented outside the builder to avoid circular dependencies
    def set_previous_segment(self, previous_segment: "TextSegment") -> None:

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 26 at r3 (raw file):

    @property
    def current_depth(self) -> int:
        return self._current_depth + 1

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 49 at r3 (raw file):

        if depth > len(self._quotation_stack):
            raise RuntimeError(
                "get_opening_quotation_mark_at_depth() was called with a depth greater than the quotation stack size."

I don't think you necessarily need to pass the name of the function since it would be available in a stack trace. You may, however, want to pass the stack size and depth.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 56 at r3 (raw file):

        if not self.has_open_quotation_mark():
            raise RuntimeError(
                "get_deepest_opening_quotation_mark() was called when the stack of quotation marks was empty."

I don't think this error message is correct.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 77 at r3 (raw file):

    @property
    def current_depth(self) -> int:

See above comment. I might be missing something: Why +1 above but not here?


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 130 at r3 (raw file):

            if quote_match._start_index > 0:
                return False
            if quote_match.quotation_mark != self._quotation_mark_resolver_state.get_opening_quotation_mark_at_depth(

I think this logic can be simplified by taking this check and moving it outside the if/else - something like:

if quote_match.quotation_mark != ...:
    return False

if not _quotation_continuer_state.continuer_has_been_observed():
    ...

return true

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 163 at r3 (raw file):

            if quote_match.quotation_mark != "»":
                return False
            if not self._settings.are_marks_a_valid_pair(

Same thing as above method - I think this can be moved outside the if/else unless one of the bool functions is changing the state somehow.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 187 at r3 (raw file):

        self,
        quote_match: QuotationMarkStringMatch,
        previous_match: Union[QuotationMarkStringMatch, None],

previous_match and next_match are unused.


machine/corpora/analysis/quotation_mark_resolver.py line 13 at r3 (raw file):

    def __init__(self, settings: QuotationMarkResolutionSettings):
        self._settings = settings

Is a default constructor needed?


machine/corpora/analysis/quotation_mark_resolver.py line 20 at r3 (raw file):

    ) -> Generator[QuotationMarkMetadata, None, None]: ...

    def reset(self) -> None: ...

Why is this method not marked with abstractmethod?


machine/corpora/analysis/usfm_structure_extractor.py line 16 at r3 (raw file):

        self._reset()

    def _reset(self):

Why a separate private reset function? Could we just do this in __init__()?

@benjaminking
Copy link
Collaborator Author

Since there are a bunch of comments, I will just skip the ones where my response would just be "Done."

Is this module used anywhere?

It must have been an artefact left over from rebasing with a long history. I deleted it now.

I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis?

I think that early on, John had thought that we should group all the things that analyze USFM documents together, but I don't think we have any plans for other types of analysis right now, so I went ahead and renamed it to punctuation_analysis.

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?

Yes, it's for testing, since list(some_set) doesn't put the elements in a predictable order.

Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)

I think I prefer to keep these separate. They work similarly and could be combined, but don't necessarily need to work the same way. I think we get better maintainability keeping them separate, in case the requirements for one of them change. And the performance difference should be negligible.

Throughout the code, there seems to be inconsistent use of quote, quotationMark, and quotation in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.

My intention was to use "quote" to refer to "the words that a person is saying," while "quotation mark" refers to a punctuation mark that indicates the presence of a quote. I've gone through everything to make it more consistent. A couple notes about the terminology:

  • A "quote convention" is the set of rules used to mark where quotes start and end
  • A "quote introducer" is combination of punctuation and whitespace that indicates that a quote will follow. For example, the comma and space in Bob said, "<quote>" or the colon and space in Bob said: "<quote>"
  • A "quote continuer" is used to indicate that a quote is being continued across paragraphs. In English, you don't use any closing quotation marks at the end of the first paragraph, but use new opening quotation marks at the start of the second paragraph. These quotation marks are called "quote continuers."
  • I think everything else in the code should refer to "quotation marks."

This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??

I have refactored them into constants. Hopefully the names make sense -- it's hard to come up with concise names for some of these things.

You could use functools.reduce or itertools.grouby to make this a one-liner. It looks like we use groupby in thot_smt_model_trainer.py.

It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using defaultdict.

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

I went ahead and gave them property setters, so they should behave as if they're public.

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)

This is a good call. I changed current_depth to be defined as the length of the stack for both classes.

I don't think this error message is correct.

I'm not sure I see why it's incorrect. Can you expound?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Great. I'll look back through the comments and try to confirm that all have been addressed. If you could please reply in Reviewable within the thread specific to that comment, it would make the back and forth a lot easier as we continue.

I went ahead and ported the changes you've made so far.

This should be my last round of new comments 🤪 🤞.

It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using defaultdict.
itertools is in the poetry lock and groupby is used in thot_smt_model_trainer.py - what do you mean by 'project'? It's not a big difference if it's left as-is, but it would simplify the function.

I'm not sure I see why it's incorrect. Can you expound?
I see now. I misread it and thought it had been copy-pasted from the previous error message. I like the new error message you came up with better 👍.

Reviewed 1 of 49 files at r1.
Reviewable status: 3 of 49 files reviewed, 57 unresolved discussions (waiting on @benjaminking and @ddaspit)


machine/corpora/analysis/text_segment.py line 70 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

Since there's no custom logic in the getter or setter, I'd just ditch the properties and replace e.g. _previous_segment with previous_segment.


machine/corpora/quotation_mark_update_first_pass.py line 37 at r4 (raw file):

    ) -> bool:
        target_marks_by_source_marks: Dict[str, Set[str]] = {}
        for level in range(1, source_quote_convention.num_levels + 1):

Some places depth is used and elsewhere level. Is there a true difference in meaning?


machine/corpora/quotation_mark_update_first_pass.py line 71 at r4 (raw file):

        return self._choose_best_strategy_based_on_observed_issues(self._quotation_mark_resolver.get_issues())

    def _choose_best_strategy_based_on_observed_issues(self, issues) -> QuotationMarkUpdateStrategy:

Type-hinting for issues


machine/corpora/punctuation_analysis/text_segment.py line 78 at r4 (raw file):

        self._next_segment = next_segment

    def set_index_in_verse(self, index_in_verse: int) -> None:

Same for index_in_verse as the segment fields.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 69 at r4 (raw file):

    def _apply_fallback_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock:
        for element in block._elements:

Use Elements


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 74 at r4 (raw file):

    def _apply_standard_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock:
        for element in block._elements:

Use Elements


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 111 at r4 (raw file):

    def _create_text_segment(self, token: UsfmToken) -> Union[TextSegment, None]:
        self._next_scripture_text_segment_builder.set_usfm_token(token)

Might be a little cleaner to do something like:

text_segment_to_return = null
if token.text != null:
    ...
return text_segment_to_return

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 132 at r4 (raw file):

            if scripture_ref.chapter_num != self._current_chapter_number:
                self._current_chapter_number = scripture_ref.chapter_num
                self._start_new_chapter(self._current_chapter_number)

No need to pass bare field as parameter


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 147 at r4 (raw file):

            ):
                self._current_verse_number = scripture_ref.verse_num
                self._start_new_verse(self._current_verse_number)

No need to pass bare field as parameter


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 149 at r4 (raw file):

                self._start_new_verse(self._current_verse_number)

    def _start_new_verse(self, new_chapter_number: int) -> None:

new_chapter_number is unused


machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py line 93 at r4 (raw file):

        self._quote_continuer_mark_stack.append(quotation_mark)
        self._continuer_style = quote_continuer_style
        if len(self._quote_continuer_mark_stack) == len(quotation_mark_resolver_state._quotation_stack):

Should you use current_depth + 1?


machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py line 168 at r4 (raw file):

                return False

            # Check the next quotation mark match, since quote continuers must appear consecutively

Very minor, but some comments are starting with a capital letter - others are lowercase


machine/corpora/punctuation_analysis/quote_convention_detector.py line 53 at r4 (raw file):

        self._quotation_mark_tabulator.tabulate(resolved_quotation_marks)

    def detect_quote_convention(self, print_summary: bool) -> Union[QuoteConventionAnalysis, None]:

Throughout the code: Maybe Optional[Type] rather than Union[Type, None]?


machine/corpora/quotation_mark_update_resolution_settings.py line 16 at r4 (raw file):

        self._source_quote_convention = source_quote_convention
        self._quote_convention_singleton_set = QuoteConventionSet([self._source_quote_convention])
        self._target_quote_convention = target_quote_convention

This is unused.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 46 of 46 files at r4, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @benjaminking and @Enkidu93)


machine/corpora/punctuation_analysis/quote_convention_detector.py line 53 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Throughout the code: Maybe Optional[Type] rather than Union[Type, None]?

For Python 3.9 and earlier Optional[Type] seems to be the preferred syntax. Once we move to 3.10+, Type | None seems to be preferred.


tests/corpora/test_fallback_quotation_mark_resolver.py line 15 at r4 (raw file):

def test_reset():
    english_quote_convention = standard_quote_conventions.STANDARD_QUOTE_CONVENTIONS.get_quote_convention_by_name(

This constant should probably be exported from the machine.corpora.punctuation_analysis package. It would make using it nicer.

@benjaminking
Copy link
Collaborator Author

A few higher-level notes:

  • I replied to this round of comments in Reviewable. Would it be helpful for me to go and reply to all the older comments there?
  • I added a new test that demonstrates the full process for performing quotation denormalization (since it involves a couple steps)
  • I have rebased to the latest revision of the main branch

And concerning itertools and functools, I hadn't realized they were part of the standard library -- my mistake. But I did try out refactoring the function in question to use reduce, and while it was technically reduced to one line, formatting turned it into 8 lines, so I decided to leave it as it was. (and it turns out that the functionality needed to make groupby work in this scenario is only available for later Python versions).

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 21 of 64 files reviewed, 46 unresolved discussions (waiting on @ddaspit and @Enkidu93)


machine/corpora/quotation_mark_update_first_pass.py line 37 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Some places depth is used and elsewhere level. Is there a true difference in meaning?

I've tried to use level in the context of rules and depth in the context of observed quotation marks. Hopefully this commit clarifies that a bit be


machine/corpora/quotation_mark_update_first_pass.py line 71 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Type-hinting for issues

Done.


machine/corpora/quotation_mark_update_resolution_settings.py line 16 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This is unused.

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 69 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use Elements

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 74 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use Elements

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 132 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No need to pass bare field as parameter

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 147 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No need to pass bare field as parameter

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 149 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

new_chapter_number is unused

Done.


machine/corpora/scripture_embed.py line at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is this module used anywhere?

It seems to have been a relic of rebasing. I've removed it.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminking Something weird happened with your branch. There seems to be some recent commits from the main branch that have been added to your branch. The preferred way to catch up with the main branch is to rebase on top of the main branch. As a last resort, you can merge the main branch into your branch.

Reviewed 4 of 43 files at r5, all commit messages.
Reviewable status: 25 of 64 files reviewed, 46 unresolved discussions (waiting on @benjaminking and @Enkidu93)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants