Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support human readable slugs for non-English locales #9892

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

ChrisBAshton
Copy link
Contributor

@ChrisBAshton ChrisBAshton commented Feb 5, 2025

This PR:

  • No longer sets slug to nil when the locale is non-English. Every locale is considered "sluggable".
  • This also means we no longer need the sluggable_locale?, slug_eligible? methods and ASCII checks. (The latter has caused a number of Zendesk tickets, whereby even English HTML Attachments have been given a content-ID based URL, because they contain a non-ASCII character like a fancy single quote).
  • sluggable_string doesn't appear to have been in use, so it was deleted. (sluggable_string is used in the Document class: friendly_id :sluggable_string. But the Attachment class uses friendly_id :title, so overriding the sluggable_string method had no effect).
  • Adds a test for friendly_id's native "uniqueness" feature (which we've opted into via use: :scoped, scope: :attachable on the Attachment class), so in theory we should be protected from clashes in the attachment slug namespace. Therefore the issue alluded to in Bump friendly #5321 (which reverts the last time this was attempted) should not happen again.
  • Adds a basic safeguard that the slug-generation results in a slug that contains at least 4 a-z or A-Z characters. This is to protect against generating a slug like -, which could happen from sluggifying a title like {chinese}-{chinese}.
  • That said, if we somehow DO get into a state whereby an attachment has what we'd call an invalid slug (<4 azAZ chars), we don't want to fix it if the attachment is already live. So we check for safely_resluggable?, introduced in Re-slug HTML attachments if they are unpublished #5942.

Result:

  • Most HTML attachments should now have human-readable slugs. Only languages that are non-ASCII and cannot be converted to ASCII, such as Chinese, will continue to fall back to the content-ID based URLs.

Next steps:

  • Now that all Whitehall tables use the UTF8MB4 charset (Convert all tables to use UTF8MB4 charset #9767), we could potentially use CJK character sets natively in slugs - but it may not always show nicely in the browser. Compare https://example.com/你好 with https://example.com/%E4%BD%A0%E5%A5%BD.
  • The option we'd likely want to go with is conversion to their Pinyin representations, e.g. https://example.com/ni-hao, probably using the pinyin or ruby-pinyin gem.

Trello: https://trello.com/c/YGsEG9YV/3418-improve-html-attachment-slug-creation

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

@ChrisBAshton ChrisBAshton force-pushed the html-attachment-slugs branch from 5308c32 to 210b738 Compare February 5, 2025 07:47
@ChrisBAshton ChrisBAshton marked this pull request as ready for review February 5, 2025 07:49
@ChrisBAshton ChrisBAshton force-pushed the html-attachment-slugs branch from 210b738 to 47cdbf4 Compare February 5, 2025 09:19
assert_equal "foo", attachment.slug
end

# This behaviour is built into the friendly_id gem
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I did spot in the docs:

Previous versions of FriendlyId appended a numeric sequence to make slugs unique, but this was removed to simplify using FriendlyId in concurrent code.

So I'd expect this test to need an update when we update finder_id. I did try updating finder_id locally but got Bundler attempted to update friendly_id but its version stayed the same and it wasn't immediately clear why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting 🤔

Copy link
Contributor

@ryanb-gds ryanb-gds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions. Nice work though!

app/models/html_attachment.rb Outdated Show resolved Hide resolved
app/models/html_attachment.rb Outdated Show resolved Hide resolved
app/models/html_attachment.rb Show resolved Hide resolved
This PR:

- No longer sets `slug` to `nil` when the locale is non-English.
  Every locale is considered "sluggable".
- This also means we no longer need the `sluggable_locale?`,
  `slug_eligible?` methods and ASCII checks. (The latter has
  caused a number of Zendesk tickets, whereby even English HTML
  Attachments have been given a content-ID based URL, because
  they contain a non-ASCII character like a fancy single quote).
- `sluggable_string` doesn't appear to have been in use, so it
  was deleted. (`sluggable_string` is used in the Document class:
  `friendly_id :sluggable_string`. But the Attachment class uses
  `friendly_id :title`, so overriding the `sluggable_string`
  method had no effect).
- Adds a test for friendly_id's native "uniqueness" feature
  (which we've opted into via `use: :scoped, scope: :attachable`
  on the Attachment class), so in theory we should be protected
  from clashes in the attachment slug namespace. Therefore the
  issue alluded to in #5321 (which reverts the last time this
  was attempted) should not happen again.
- Adds a basic safeguard that the slug-generation results in a
  slug that contains at least 4 a-z or A-Z characters. This is
  to protect against generating a slug like `-`, which could
  happen from sluggifying a title like `{chinese}-{chinese}`.
- That said, if we somehow DO get into a state whereby an
  attachment has what we'd call an invalid slug (<4 azAZ chars),
  we don't want to fix it if the attachment is already live.
  So we check for `safely_resluggable?`, introduced in #5942.

Result:

- Most HTML attachments should now have human-readable slugs.
  Only languages that are non-ASCII and cannot be converted to
  ASCII, such as Chinese, will continue to fall back to the
  content-ID based URLs.

Next steps:

- Now that all Whitehall tables use the UTF8MB4 charset (#9767),
  we could potentially use CJK character sets natively in
  slugs - but it may not always show nicely in the browser.
  Compare `https://example.com/你好` with
  `https://example.com/%E4%BD%A0%E5%A5%BD`.
- The option we'd likely want to go with is conversion to their
  Pinyin representations, e.g. `https://example.com/ni-hao`,
  probably using the `pinyin` or `ruby-pinyin` gem.

Trello: https://trello.com/c/YGsEG9YV/3418-improve-html-attachment-slug-creation
@ChrisBAshton ChrisBAshton force-pushed the html-attachment-slugs branch from 47cdbf4 to ac898d0 Compare February 5, 2025 14:44
See code review comment:
#9892 (comment)

`safely_resluggable?` returns `true` for newly created attachments
and `false` for attachments that have been copied over from a
previous edition.

New attachments will start off with a `nil` slug, or in any case
a slug that is derived from the attachment's current title. It
can be safely re-generated an infinite number of times before the
attachment is made live. So we don't need to be checking the
value of the slug in this method.
Copy link
Contributor

@ryanb-gds ryanb-gds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, this makes a lot more sense now

@ChrisBAshton ChrisBAshton merged commit 92274f5 into main Feb 6, 2025
19 checks passed
@ChrisBAshton ChrisBAshton deleted the html-attachment-slugs branch February 6, 2025 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants