-
-
Notifications
You must be signed in to change notification settings - Fork 3
Quotation denormalization #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…translations. These updates make it happen.
Pass marker type (embed, style) to update block
5afb42c
to
defe7dc
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #203 +/- ##
==========================================
+ Coverage 88.91% 90.99% +2.07%
==========================================
Files 282 330 +48
Lines 17056 21509 +4453
==========================================
+ Hits 15166 19572 +4406
- Misses 1890 1937 +47 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting to see this, Ben! I left just a few comments regarding function names, getters/setters, and unused functions. These are representative of similar situations elsewhere in the code. I think these changes could significantly reduce the amount of boiler-plate code necessary as well as improve readability - both of which will make porting much faster. Thank you for your hard work and my apologies for only now seeing that you'd put this PR in. I usually get notified 🤔.
Reviewed 6 of 49 files at r1, all commit messages.
Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions
machine/corpora/analysis/quotation_mark_metadata.py
line 47 at r1 (raw file):
return self.text_segment def get_start_index(self) -> int:
Unused function
machine/corpora/analysis/quotation_mark_tabulator.py
line 50 at r1 (raw file):
self.quotation_counts_by_depth_and_direction[key].count_quotation_mark(quotation_mark) def _has_depth_and_direction_been_observed(self, depth: int, direction: QuotationMarkDirection) -> bool:
Would _have_depth_and_direction...
be better?
machine/corpora/analysis/text_segment.py
line 29 at r1 (raw file):
if self.immediate_preceding_marker != value.immediate_preceding_marker: return False return True
Why not also check the other properties?
machine/corpora/analysis/text_segment.py
line 31 at r1 (raw file):
return True def get_text(self) -> str:
I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment
rather than set_...
/get_...
? If it is read-only, you can use the @property
decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.
machine/corpora/analysis/text_segment.py
line 43 at r1 (raw file):
return self.text[index:] def get_immediate_preceding_marker_type(self) -> UsfmMarkerType:
Unused function
machine/corpora/analysis/text_segment.py
line 62 at r1 (raw file):
def replace_substring(self, start_index: int, end_index: int, replacement: str) -> None: self.text = self.text[:start_index] + replacement + self.text[end_index:]
What about substring_before(start_index) + replacement + substring_after(start_index)
?
machine/corpora/analysis/quotation_mark_string_match.py
line 48 at r1 (raw file):
return regex_pattern.search(self.get_quotation_mark()) is not None def does_next_character_match(self, regex_pattern: regex.Pattern) -> bool:
Maybe next_character_matches()
? I think <noun>_<verb>(e)s()
is cleaner than does_<noun>_<verb>()
. It's shorter, it sounds more natural in if
statements, and I don't think does_...
is used elsewhere in machine.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions (waiting on @benjaminking)
machine/corpora/analysis/text_segment.py
line 62 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
What about
substring_before(start_index) + replacement + substring_after(start_index)
?
*Sorry, end_index
, but you get the idea :)
Removed
I have changed this, along with several other similarly-named functions.
In this case, the object has references to previous and next instances (a linked list basically), so including it in the equality function would lead to infinite recursion.
I have made these changes throughout my code. Thanks for the suggestion.
Removed.
Good suggestion. Done.
Changed as part of the naming changes I mentioned earlier. I also did a fair bit of minor refactoring to make things more consistent (that will be the majority of the changes in the commit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Reviewed 1 of 39 files at r2, all commit messages.
Reviewable status: 2 of 49 files reviewed, 1 unresolved discussion
machine/corpora/analysis/text_segment.py
line 25 at r2 (raw file):
if self._index_in_verse != value._index_in_verse: return False if self._index_in_verse != value._index_in_verse:
I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment
properties?
I'll try and give it a more thorough review soon - thank you for the refactor and cleanup! I'll also begin porting more of the code now. I'm sure Damien will have more comments 🤪. |
Yup, I meant to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - my comments might trickle in now as I notice things. (Also, your replies are showing up in the wrong spot for me in Reviewable 🤔? Do you see that too?)
Reviewable status: 2 of 49 files reviewed, 2 unresolved discussions
machine/corpora/analysis/quote_convention.py
line 60 at r3 (raw file):
return True def get_name(self) -> str:
Looks like this method should be replaced with a property. .name
is used in QuoteConvetionSet
and get_name()
elsewhere.
machine/corpora/analysis/quote_convention.py
line 63 at r3 (raw file):
return self.name def get_num_levels(self) -> int:
It's your call, but a property might be nice for this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 49 files reviewed, 4 unresolved discussions (waiting on @benjaminking)
machine/corpora/analysis/quote_convention.py
line 73 at r3 (raw file):
def get_expected_quotation_mark(self, depth: int, direction: QuotationMarkDirection) -> str: if depth > len(self.levels) or depth < 1:
Could use get_num_levels
/num_levels
machine/corpora/analysis/quote_convention.py
line 122 at r3 (raw file):
return QuoteConvention(self.get_name() + "_normalized", [level.normalize() for level in self.levels]) def print_summary(self) -> None:
Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message
, you could override __repr__
or __str__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 49 files at r1, 37 of 39 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @benjaminking and @Enkidu93)
machine/corpora/analysis/quote_convention.py
line 122 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of
_get_summary_message
, you could override__repr__
or__str__
.
I agree.
machine/corpora/analysis/quotation_mark_tabulator.py
line 76 at r3 (raw file):
return 1 - (num_differences / num_total_quotation_marks) def print_summary(self) -> None:
Consider using __str__
or __rep__
. If that is inappropriate, you can have the method return a string description instead of calling print
directly.
machine/corpora/analysis/quotation_mark_finder.py
line 13 at r3 (raw file):
class QuotationMarkFinder: quote_pattern = regex.compile(r"(\p{Quotation_Mark}|<<|>>|<|>)", regex.U)
This should be all caps and probably prefixed with an underscore.
machine/corpora/scripture_embed.py
line 0 at r3 (raw file):
Is this module used anywhere?
machine/corpora/analysis/quotation_mark_resolution_settings.py
line 28 at r3 (raw file):
@abstractmethod def should_rely_on_paragraph_markers(self) -> bool: ...
This could be a property.
machine/corpora/analysis/quote_convention_set.py
line 161 at r3 (raw file):
return (best_quote_convention, best_similarity) def print_summary(self) -> None:
Consider using __str__
or __rep__
instead of direct print
calls.
machine/corpora/analysis/quote_convention_detector.py
line 18 at r3 (raw file):
@dataclass
Consider marking this dataclass as immutable.
machine/corpora/analysis/standard_quote_conventions.py
line 4 at r3 (raw file):
from .quote_convention_set import QuoteConventionSet standard_quote_conventions: QuoteConventionSet = QuoteConventionSet(
The naming convention for constants is all caps.
machine/corpora/analysis/quotation_mark_string_match.py
line 16 at r3 (raw file):
# extra stuff in the regex to handle Western Cham letter_pattern: Pattern = regex.compile(r"[\p{L}\U0001E200-\U0001E28F]", regex.U)
You should use all caps for constant names. I think this are only used internally, so they should be prefixed with an underscore.
machine/corpora/analysis/quote_convention.py
line 6 at r3 (raw file):
from .quotation_mark_direction import QuotationMarkDirection quote_normalization_map: Dict[str, str] = {
The naming convention for constants is all caps. If the constant is only used locally in the module, then you should prefix with an underscore.
machine/corpora/analysis/quote_convention.py
line 22 at r3 (raw file):
@dataclass
If a dataclass is intended to be immutable, consider using @dataclass(frozen=True)
.
machine/corpora/analysis/__init__.py
line 0 at r3 (raw file):
I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis
?
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 104 at r3 (raw file):
class QuotationMarkCategorizer: apostrophe_pattern = regex.compile(r"[\'\u2019\u2018]", regex.U)
This should be all caps and prefixed with an underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another batch of comments - wanted to get a chunk in so that they aren't all coming in at once.
Reviewed 1 of 39 files at r2.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @benjaminking)
machine/corpora/analysis/quote_convention_set.py
line 25 at r3 (raw file):
opening_quotation_marks: Set[str] = set() closing_quotation_marks: Set[str] = set() all_quotation_marks: Set[str] = set()
Since this is just the union of the other two, could we just set it as such downstream?
machine/corpora/analysis/quote_convention_set.py
line 27 at r3 (raw file):
all_quotation_marks: Set[str] = set() if len(self._conventions) > 0:
I don't think this if statement is needed
machine/corpora/analysis/quote_convention_set.py
line 39 at r3 (raw file):
if len(all_quotation_marks) > 0: self._opening_quotation_mark_regex: Pattern = regex.compile( r"[" + "".join(sorted(list(opening_quotation_marks))) + "]"
Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?
machine/corpora/analysis/quote_convention_set.py
line 49 at r3 (raw file):
if len(opening_quotation_marks) == 0: self._opening_quotation_mark_regex = regex.compile(r"")
I don't know how computationally expensive it would be to just make these the defaults and avoid these if statements? Or would that not work in some cases?
machine/corpora/analysis/quote_convention_set.py
line 55 at r3 (raw file):
self._all_quotation_mark_regex = regex.compile(r"") def _create_quotation_mark_pair_map(self) -> None:
Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)
machine/corpora/analysis/quote_convention_set.py
line 88 at r3 (raw file):
def get_all_quote_convention_names(self) -> List[str]: return sorted([qc.name for qc in self._conventions])
More sorting - why is this needed?
machine/corpora/analysis/quotation_mark_string_match.py
line 57 at r3 (raw file):
@property def previous_character(self) -> Union[str, None]: if self._start_index == 0:
Use is_at_start_of_segment
?
machine/corpora/analysis/quotation_mark_tabulator.py
line 8 at r3 (raw file):
class QuotationMarkCounts:
I wonder if you could leverage python's Counter
class to simplify this or remove it altogether?
machine/corpora/analysis/quotation_mark_tabulator.py
line 10 at r3 (raw file):
class QuotationMarkCounts: def __init__(self): self._string_counts: Dict[str, int] = dict()
Could this name be more specific?
machine/corpora/analysis/quotation_mark_tabulator.py
line 35 at r3 (raw file):
def __init__(self): self.quotation_counts_by_depth_and_direction: dict[tuple[int, QuotationMarkDirection], QuotationMarkCounts] = (
Should this be made private, as it were?
machine/corpora/analysis/quotation_mark_tabulator.py
line 43 at r3 (raw file):
self._count_quotation_mark(quotation_mark) def _count_quotation_mark(self, quote: QuotationMarkMetadata) -> None:
Throughout the code, there seems to be inconsistent use of quote
, quotationMark
, and quotation
in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.
machine/corpora/analysis/quotation_mark_tabulator.py
line 59 at r3 (raw file):
def calculate_similarity(self, quote_convention: QuoteConvention) -> float: num_differences = 0
Could we qualify this name weighted_count...
or something? The name suggests to me that it's an int but it's really a float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 32 unresolved discussions (waiting on @benjaminking)
machine/corpora/analysis/preliminary_quotation_analyzer.py
line 43 at r3 (raw file):
def count_word_initial_apostrophe(self, quotation_mark: str) -> None: if quotation_mark not in self._word_initial_occurrences:
This very same pattern of checking if a key exists then updating the value based on the previous value happens a lot in the code. It might be worth using python's defaultdict
to avoid this. I have an extension in C# UpdateValue
that I can use, I think, to a similar end.
machine/corpora/analysis/preliminary_quotation_analyzer.py
line 76 at r3 (raw file):
num_initial_marks: int = self._get_word_initial_occurrences(quotation_mark) num_total_marks: int = self._get_total_occurrences(quotation_mark) return num_total_marks > 0 and num_initial_marks / num_total_marks < 0.1
This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??
machine/corpora/analysis/preliminary_quotation_analyzer.py
line 103 at r3 (raw file):
self._later_quotation_mark_counts: Dict[str, int] = dict() def record_earlier_quotation_mark(self, quotation_mark: str) -> None:
Above you use count...()
for method names but here you use record...()
.
machine/corpora/analysis/preliminary_quotation_analyzer.py
line 155 at r3 (raw file):
self._group_quotation_marks(quotation_marks) def _group_quotation_marks(self, quotation_marks: List[QuotationMarkStringMatch]) -> None:
You could use functools.reduce
or itertools.grouby
to make this a one-liner. It looks like we use groupby
in thot_smt_model_trainer.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another handful of comments. I leave it to Damien's discretion if he'd like to leave any of these unaddressed for now in order to expedite getting the code in. I tried to make requested changes that I thought were less important non-blocking comments. I'm nearly through porting the code itself (almost all tests left), so my trickle of comments should be running dry soon. Thank you for doing this & my apologies for all the minor requested changes!
Reviewed 3 of 49 files at r1, 16 of 39 files at r2, 1 of 2 files at r3.
Reviewable status: all files reviewed, 44 unresolved discussions (waiting on @benjaminking)
machine/corpora/analysis/preliminary_quotation_analyzer.py
line 273 at r3 (raw file):
def __init__(self, quote_conventions: QuoteConventionSet): self._quote_conventions = quote_conventions
In some places QuoteConventionSet
objects are called quoteConventionSet
s and in others quoteConventions
. It would be nice if the variable naming were consistent.
machine/corpora/analysis/text_segment.py
line 70 at r3 (raw file):
# These setters need to be implemented outside the builder to avoid circular dependencies def set_previous_segment(self, previous_segment: "TextSegment") -> None:
Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 26 at r3 (raw file):
@property def current_depth(self) -> int: return self._current_depth + 1
This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1
in the calling code? (...with a comment as needed)
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 49 at r3 (raw file):
if depth > len(self._quotation_stack): raise RuntimeError( "get_opening_quotation_mark_at_depth() was called with a depth greater than the quotation stack size."
I don't think you necessarily need to pass the name of the function since it would be available in a stack trace. You may, however, want to pass the stack size and depth.
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 56 at r3 (raw file):
if not self.has_open_quotation_mark(): raise RuntimeError( "get_deepest_opening_quotation_mark() was called when the stack of quotation marks was empty."
I don't think this error message is correct.
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 77 at r3 (raw file):
@property def current_depth(self) -> int:
See above comment. I might be missing something: Why +1
above but not here?
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 130 at r3 (raw file):
if quote_match._start_index > 0: return False if quote_match.quotation_mark != self._quotation_mark_resolver_state.get_opening_quotation_mark_at_depth(
I think this logic can be simplified by taking this check and moving it outside the if/else
- something like:
if quote_match.quotation_mark != ...:
return False
if not _quotation_continuer_state.continuer_has_been_observed():
...
return true
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 163 at r3 (raw file):
if quote_match.quotation_mark != "»": return False if not self._settings.are_marks_a_valid_pair(
Same thing as above method - I think this can be moved outside the if/else
unless one of the bool functions is changing the state somehow.
machine/corpora/analysis/depth_based_quotation_mark_resolver.py
line 187 at r3 (raw file):
self, quote_match: QuotationMarkStringMatch, previous_match: Union[QuotationMarkStringMatch, None],
previous_match
and next_match
are unused.
machine/corpora/analysis/quotation_mark_resolver.py
line 13 at r3 (raw file):
def __init__(self, settings: QuotationMarkResolutionSettings): self._settings = settings
Is a default constructor needed?
machine/corpora/analysis/quotation_mark_resolver.py
line 20 at r3 (raw file):
) -> Generator[QuotationMarkMetadata, None, None]: ... def reset(self) -> None: ...
Why is this method not marked with abstractmethod
?
machine/corpora/analysis/usfm_structure_extractor.py
line 16 at r3 (raw file):
self._reset() def _reset(self):
Why a separate private reset function? Could we just do this in __init__()
?
Since there are a bunch of comments, I will just skip the ones where my response would just be "Done."
It must have been an artefact left over from rebasing with a long history. I deleted it now.
I think that early on, John had thought that we should group all the things that analyze USFM documents together, but I don't think we have any plans for other types of analysis right now, so I went ahead and renamed it to
Yes, it's for testing, since
I think I prefer to keep these separate. They work similarly and could be combined, but don't necessarily need to work the same way. I think we get better maintainability keeping them separate, in case the requirements for one of them change. And the performance difference should be negligible.
My intention was to use "quote" to refer to "the words that a person is saying," while "quotation mark" refers to a punctuation mark that indicates the presence of a quote. I've gone through everything to make it more consistent. A couple notes about the terminology:
I have refactored them into constants. Hopefully the names make sense -- it's hard to come up with concise names for some of these things.
It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using
I went ahead and gave them
This is a good call. I changed
I'm not sure I see why it's incorrect. Can you expound? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Great. I'll look back through the comments and try to confirm that all have been addressed. If you could please reply in Reviewable within the thread specific to that comment, it would make the back and forth a lot easier as we continue.
I went ahead and ported the changes you've made so far.
This should be my last round of new comments 🤪 🤞.
It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using
defaultdict
.
itertools
is in the poetry lock andgroupby
is used inthot_smt_model_trainer.py
- what do you mean by 'project'? It's not a big difference if it's left as-is, but it would simplify the function.
I'm not sure I see why it's incorrect. Can you expound?
I see now. I misread it and thought it had been copy-pasted from the previous error message. I like the new error message you came up with better 👍.
Reviewed 1 of 49 files at r1.
Reviewable status: 3 of 49 files reviewed, 57 unresolved discussions (waiting on @benjaminking and @ddaspit)
machine/corpora/analysis/text_segment.py
line 70 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?
Since there's no custom logic in the getter or setter, I'd just ditch the properties and replace e.g. _previous_segment
with previous_segment
.
machine/corpora/quotation_mark_update_first_pass.py
line 37 at r4 (raw file):
) -> bool: target_marks_by_source_marks: Dict[str, Set[str]] = {} for level in range(1, source_quote_convention.num_levels + 1):
Some places depth
is used and elsewhere level
. Is there a true difference in meaning?
machine/corpora/quotation_mark_update_first_pass.py
line 71 at r4 (raw file):
return self._choose_best_strategy_based_on_observed_issues(self._quotation_mark_resolver.get_issues()) def _choose_best_strategy_based_on_observed_issues(self, issues) -> QuotationMarkUpdateStrategy:
Type-hinting for issues
machine/corpora/punctuation_analysis/text_segment.py
line 78 at r4 (raw file):
self._next_segment = next_segment def set_index_in_verse(self, index_in_verse: int) -> None:
Same for index_in_verse
as the segment fields.
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 69 at r4 (raw file):
def _apply_fallback_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock: for element in block._elements:
Use Elements
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 74 at r4 (raw file):
def _apply_standard_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock: for element in block._elements:
Use Elements
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 111 at r4 (raw file):
def _create_text_segment(self, token: UsfmToken) -> Union[TextSegment, None]: self._next_scripture_text_segment_builder.set_usfm_token(token)
Might be a little cleaner to do something like:
text_segment_to_return = null
if token.text != null:
...
return text_segment_to_return
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 132 at r4 (raw file):
if scripture_ref.chapter_num != self._current_chapter_number: self._current_chapter_number = scripture_ref.chapter_num self._start_new_chapter(self._current_chapter_number)
No need to pass bare field as parameter
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 147 at r4 (raw file):
): self._current_verse_number = scripture_ref.verse_num self._start_new_verse(self._current_verse_number)
No need to pass bare field as parameter
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 149 at r4 (raw file):
self._start_new_verse(self._current_verse_number) def _start_new_verse(self, new_chapter_number: int) -> None:
new_chapter_number
is unused
machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py
line 93 at r4 (raw file):
self._quote_continuer_mark_stack.append(quotation_mark) self._continuer_style = quote_continuer_style if len(self._quote_continuer_mark_stack) == len(quotation_mark_resolver_state._quotation_stack):
Should you use current_depth + 1
?
machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py
line 168 at r4 (raw file):
return False # Check the next quotation mark match, since quote continuers must appear consecutively
Very minor, but some comments are starting with a capital letter - others are lowercase
machine/corpora/punctuation_analysis/quote_convention_detector.py
line 53 at r4 (raw file):
self._quotation_mark_tabulator.tabulate(resolved_quotation_marks) def detect_quote_convention(self, print_summary: bool) -> Union[QuoteConventionAnalysis, None]:
Throughout the code: Maybe Optional[Type]
rather than Union[Type, None]
?
machine/corpora/quotation_mark_update_resolution_settings.py
line 16 at r4 (raw file):
self._source_quote_convention = source_quote_convention self._quote_convention_singleton_set = QuoteConventionSet([self._source_quote_convention]) self._target_quote_convention = target_quote_convention
This is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 46 of 46 files at r4, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @benjaminking and @Enkidu93)
machine/corpora/punctuation_analysis/quote_convention_detector.py
line 53 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Throughout the code: Maybe
Optional[Type]
rather thanUnion[Type, None]
?
For Python 3.9 and earlier Optional[Type]
seems to be the preferred syntax. Once we move to 3.10+, Type | None
seems to be preferred.
tests/corpora/test_fallback_quotation_mark_resolver.py
line 15 at r4 (raw file):
def test_reset(): english_quote_convention = standard_quote_conventions.STANDARD_QUOTE_CONVENTIONS.get_quote_convention_by_name(
This constant should probably be exported from the machine.corpora.punctuation_analysis
package. It would make using it nicer.
… to headers when stripping paragraphs (#207)
A few higher-level notes:
And concerning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 21 of 64 files reviewed, 46 unresolved discussions (waiting on @ddaspit and @Enkidu93)
machine/corpora/quotation_mark_update_first_pass.py
line 37 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Some places
depth
is used and elsewherelevel
. Is there a true difference in meaning?
I've tried to use level
in the context of rules and depth
in the context of observed quotation marks. Hopefully this commit clarifies that a bit be
machine/corpora/quotation_mark_update_first_pass.py
line 71 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Type-hinting for
issues
Done.
machine/corpora/quotation_mark_update_resolution_settings.py
line 16 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
This is unused.
Done.
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 69 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Use
Elements
Done.
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 74 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Use
Elements
Done.
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 132 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
No need to pass bare field as parameter
Done.
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 147 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
No need to pass bare field as parameter
Done.
machine/corpora/quote_convention_changing_usfm_update_block_handler.py
line 149 at r4 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
new_chapter_number
is unused
Done.
machine/corpora/scripture_embed.py
line at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Is this module used anywhere?
It seems to have been a relic of rebasing. I've removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminking Something weird happened with your branch. There seems to be some recent commits from the main branch that have been added to your branch. The preferred way to catch up with the main branch is to rebase on top of the main branch. As a last resort, you can merge the main branch into your branch.
Reviewed 4 of 43 files at r5, all commit messages.
Reviewable status: 25 of 64 files reviewed, 46 unresolved discussions (waiting on @benjaminking and @Enkidu93)
I am creating a PR for quotation denormalization even though it still needs to be rebased. Eli had thought it would be helpful to go through it and make some comments, even if there were some minor changes still coming.
This change is