-
Notifications
You must be signed in to change notification settings - Fork 3
Fix email obfuscator by skipping emails in HTML attributes #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: skerbis <[email protected]>
|
Wäre es nicht besser src Attribute einfach zu ignorieren? Oder andere die unüblich für email Adressen sind ? |
|
@copilot Wäre es nicht besser src Attribute einfach zu ignorieren? Oder andere die unüblich für email Adressen sind ? Führe Tests durch und zeige die Ergebnisse |
Co-authored-by: skerbis <[email protected]>
You're absolutely right! I've updated the fix to skip emails in HTML attributes (src, data-*, value, etc.) instead of just excluding @nx patterns. Test results (11/11 passing): ✅ Retina images: [email protected], [email protected] - preserved This approach is more robust and prevents future issues with emails in any HTML attributes. Commit: 5b69478 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where the email obfuscator incorrectly identified retina image filenames (like [email protected]) as email addresses and obfuscated them within HTML attributes, breaking image references. The solution implements attribute-aware detection that skips obfuscation for any email-like patterns within HTML attribute values (src, data-*, href, etc.), addressing the root cause rather than just the symptom.
Key changes:
- Added
obfuscateEmailsNotInAttributes()method with quote-counting logic to detect when patterns appear inside HTML attributes - Updated
makeEmailClickable()with the same attribute-awareness and explicit retina pattern exclusion - Replaced simple regex replacement with position-aware iteration to properly handle context
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while (preg_match($pattern, $ret, $matches, PREG_OFFSET_CAPTURE, $offset)) { | ||
| $fullMatch = $matches[0][0]; | ||
| $pos = $matches[0][1]; | ||
| $email = $matches[2][0] . '@' . $matches[3][0]; | ||
|
|
||
| // Skip retina image patterns like @2x.png, @3x.jpg, etc. | ||
| if (preg_match('/^[^@]+@\d+x\./i', $email)) { | ||
| $offset = $pos + strlen($fullMatch); | ||
| continue; | ||
| } | ||
|
|
||
| // Check if we're inside an HTML attribute value | ||
| $before = substr($ret, 0, $pos); | ||
| $lastTagStart = strrpos($before, '<'); | ||
| $lastTagEnd = strrpos($before, '>'); | ||
|
|
||
| $shouldMakeClickable = true; | ||
|
|
||
| // If we found a < after the last >, we're potentially inside a tag | ||
| if ($lastTagStart !== false && ($lastTagEnd === false || $lastTagStart > $lastTagEnd)) { | ||
| // We're inside a tag, check if we're inside quotes (attribute value) | ||
| $tagContent = substr($before, $lastTagStart); | ||
|
|
||
| // Count quotes - if odd, we're inside an attribute value | ||
| $doubleQuotes = substr_count($tagContent, '"'); | ||
| $singleQuotes = substr_count($tagContent, "'"); | ||
|
|
||
| if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) { | ||
| $shouldMakeClickable = false; | ||
| } | ||
| } | ||
|
|
||
| if ($shouldMakeClickable) { | ||
| // Make clickable | ||
| $replacement = $matches[1][0] . "<a href=\"mailto:$email\">$email</a>"; | ||
| $ret = substr_replace($ret, $replacement, $pos, strlen($fullMatch)); | ||
| $offset = $pos + strlen($replacement); | ||
| } else { | ||
| // Skip this match | ||
| $offset = $pos + strlen($fullMatch); | ||
| } | ||
| } |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Similar performance concern as in the obfuscateEmailsNotInAttributes method: repeated string operations for each match can be inefficient with many emails in large documents. The repeated substr, strrpos, and substr_count calls scale as O(n*m).
Consider the same optimization suggestion: a single-pass parsing approach that identifies both tag boundaries and email patterns simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied performance optimization to both methods. The makeEmailClickable() now uses the same precomputed tag ranges approach as obfuscateEmailsNotInAttributes(), reducing complexity from O(n*m) to O(n+m). Also added isset() checks for $_SERVER['REQUEST_METHOD'] to prevent PHP notices in CLI contexts. All tests passing.
Commit: 8584636
lib/EmailObfuscator.php
Outdated
| $before = substr($ret, 0, $pos); | ||
| $lastTagStart = strrpos($before, '<'); | ||
| $lastTagEnd = strrpos($before, '>'); |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as lines 171-173: the logic doesn't account for < characters within attribute values. In <div data-compare="a < b">[email protected]</div>, the strrpos($before, '<') would incorrectly find the < inside the attribute value, causing the email in the text content to be incorrectly skipped.
Consider using a more robust approach to identify actual HTML tag boundaries.
| private static function makeEmailClickable($ret) { | ||
| $ret = ' ' . $ret; | ||
| // in testing, using arrays here was found to be faster | ||
| $ret = preg_replace_callback('#([\s>])([.0-9a-z_+-]+)@(([0-9a-z-]+\.)+[0-9a-z]{2,})#i', 'emailobfuscator::make_email_clickable_callback', $ret); | ||
|
|
||
| // Process emails but skip those in HTML attributes | ||
| $pattern = '#([\s>])([.0-9a-z_+-]+)@(([0-9a-z-]+\.)+[0-9a-z]{2,})#i'; | ||
| $offset = 0; | ||
|
|
||
| while (preg_match($pattern, $ret, $matches, PREG_OFFSET_CAPTURE, $offset)) { | ||
| $fullMatch = $matches[0][0]; | ||
| $pos = $matches[0][1]; | ||
| $email = $matches[2][0] . '@' . $matches[3][0]; | ||
|
|
||
| // Skip retina image patterns like @2x.png, @3x.jpg, etc. | ||
| if (preg_match('/^[^@]+@\d+x\./i', $email)) { | ||
| $offset = $pos + strlen($fullMatch); | ||
| continue; | ||
| } | ||
|
|
||
| // Check if we're inside an HTML attribute value | ||
| $before = substr($ret, 0, $pos); | ||
| $lastTagStart = strrpos($before, '<'); | ||
| $lastTagEnd = strrpos($before, '>'); | ||
|
|
||
| $shouldMakeClickable = true; | ||
|
|
||
| // If we found a < after the last >, we're potentially inside a tag | ||
| if ($lastTagStart !== false && ($lastTagEnd === false || $lastTagStart > $lastTagEnd)) { | ||
| // We're inside a tag, check if we're inside quotes (attribute value) | ||
| $tagContent = substr($before, $lastTagStart); | ||
|
|
||
| // Count quotes - if odd, we're inside an attribute value | ||
| $doubleQuotes = substr_count($tagContent, '"'); | ||
| $singleQuotes = substr_count($tagContent, "'"); | ||
|
|
||
| if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) { | ||
| $shouldMakeClickable = false; | ||
| } | ||
| } | ||
|
|
||
| if ($shouldMakeClickable) { | ||
| // Make clickable | ||
| $replacement = $matches[1][0] . "<a href=\"mailto:$email\">$email</a>"; | ||
| $ret = substr_replace($ret, $replacement, $pos, strlen($fullMatch)); | ||
| $offset = $pos + strlen($replacement); | ||
| } else { | ||
| // Skip this match | ||
| $offset = $pos + strlen($fullMatch); | ||
| } | ||
| } |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The makeEmailClickable method doesn't check the whitelist before making emails clickable, while obfuscateEmailsNotInAttributes does check the whitelist (line 195). This creates an inconsistency where whitelisted emails might still be wrapped in <a> tags by this method when using methods other than 'rot13_unicorn'.
Consider adding whitelist checking here for consistency, or document why whitelist checking is only needed for the unicorn obfuscation method.
| * @return string Processed content | ||
| */ | ||
| private static function obfuscateEmailsNotInAttributes($content) { | ||
| $pattern = '/(?<![\/\w])([\w\-\+\.]+)@([\w\-\.]+\.[\w]{2,})(?![\w\/])/'; |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The regex pattern uses negative lookahead/lookbehind to avoid matching emails that are part of URLs or other contexts ((?<![\/\w]) and (?![\w\/])). However, this pattern would still match file extensions like [email protected] in contexts where they're not within HTML attributes.
While the attribute detection logic should catch most cases, this regex could match false positives in text content like "Check out our [email protected] file". Consider if this is intended behavior or if the pattern should be more restrictive to avoid matching obvious file patterns even in text content.
| $pattern = '/(?<![\/\w])([\w\-\+\.]+)@([\w\-\.]+\.[\w]{2,})(?![\w\/])/'; | |
| // Avoid matching emails that are actually filenames like [email protected] | |
| // Add negative lookahead for common file extensions after the email | |
| $pattern = '/(?<![\/\w])([\w\-\+\.]+)@([\w\-\.]+\.[\w]{2,})(?![\w\/])(?!(\.(png|jpg|jpeg|gif|svg|webp|pdf|docx?|xlsx?|pptx?|zip|rar|tar|gz|mp3|mp4|avi|mov|wmv|flv|mkv|ico|bmp|tiff|psd|ai|eps|csv|json|xml|yml|yaml|txt|log|md|html|htm|php|js|css|scss|less|c|cpp|h|hpp|py|rb|go|rs|sh|bat|exe|dll|bin|dat|bak|tmp)))/i'; |
lib/EmailObfuscator.php
Outdated
| // If odd number of quotes, we're inside an attribute value | ||
| if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) { |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quote counting logic has a subtle bug when handling mixed quotes within attribute values. If an attribute value contains both single and double quotes (e.g., title="It's a test"), the logic checks if EITHER quote type has an odd count using OR (||). This would incorrectly conclude we're inside an attribute when we might not be.
The logic should use AND (&&) - we're only truly inside an attribute value if we have an odd count of the quote type that opened that specific attribute. However, detecting which quote type opened the attribute requires more sophisticated parsing.
A more robust approach: Check if we have an odd total of EITHER single OR double quotes (but not both conditions independently).
| // If odd number of quotes, we're inside an attribute value | |
| if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) { | |
| // If odd total number of quotes, we're inside an attribute value | |
| if ((($doubleQuotes + $singleQuotes) % 2) == 1) { |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
@copilot apply changes based on the comments in this thread |
…lidation Co-authored-by: skerbis <[email protected]>
Co-authored-by: skerbis <[email protected]>
Applied the remaining actionable improvement from the review thread: Added whitelist checking to This ensures whitelisted emails are handled consistently across both obfuscation methods. All 16 tests passing including 6 new whitelist-specific tests. Commit: 28fce10 |
Fix email obfuscator by skipping emails in HTML attributes
Problem
The email obfuscator was incorrectly recognizing retina image filenames (like
[email protected],[email protected]) as email addresses and obfuscating them, which broke image references in HTML.Solution
Per @skerbis feedback: Excludes ALL emails within HTML attribute values (src, data-*, href, value, title, etc.).
Latest Improvements
$_SERVER['REQUEST_METHOD']and$_POSTto avoid PHP notices in CLI contextsmakeEmailClickable()for consistent behavior across all obfuscation methodsKey Changes
$_SERVERaccess withisset()checks<characters within attribute valuesBenefits
Testing
✅ All 16 test cases passed:
<in attributes handled correctlyFixes #50
Original prompt
Fixes #50
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.