Skip to content

Conversation

Copy link

Copilot AI commented Oct 15, 2025

Fix email obfuscator by skipping emails in HTML attributes

Problem

The email obfuscator was incorrectly recognizing retina image filenames (like [email protected], [email protected]) as email addresses and obfuscating them, which broke image references in HTML.

Solution

Per @skerbis feedback: Excludes ALL emails within HTML attribute values (src, data-*, href, value, title, etc.).

Latest Improvements

  1. Performance optimization: Both methods use precomputed tag ranges approach (O(n+m) instead of O(n*m) complexity)
  2. Robustness: Added checks for $_SERVER['REQUEST_METHOD'] and $_POST to avoid PHP notices in CLI contexts
  3. Consistency: Added whitelist checking to makeEmailClickable() for consistent behavior across all obfuscation methods
  4. Bug fixes: Removed redundant variables, restored retina pattern check for edge cases

Key Changes

  • Both methods now use precomputed HTML tag ranges for better performance
  • Safer $_SERVER access with isset() checks
  • Consistent whitelist handling across unicorn and clickable methods
  • Handles edge cases like < characters within attribute values

Benefits

Feature Pattern Exclusion Attribute Exclusion (This PR)
Fixes retina images
Fixes emails in form inputs
Fixes emails in data attributes
Fixes emails in ANY attributes
Performance optimized
Handles edge cases
Consistent whitelist support

Testing

✅ All 16 test cases passed:

  • Retina images preserved in attributes
  • Emails in attributes preserved
  • Emails in text correctly obfuscated
  • Edge case with < in attributes handled correctly
  • Both unicorn and clickable methods working correctly
  • Whitelist functionality consistent across both methods

Fixes #50

Original prompt

This section details on the original issue you should resolve

<issue_title>"Erkennt" [email protected] als E-Mail</issue_title>
<issue_description>Der Obfuscator erkennt [email protected] als E-Mail-Adresse und macht dies draus:

<img src="/Logo&lt;span class=unicorn&gt;&lt;span&gt;_at_&lt;/span&gt;&lt;/span&gt;2x.png" alt="Logo" class="img-fluid" loading="lazy">

REDAXO | 5.17.1
PHP: 8.1.33
E-Mail Obfuscator 3.0.0

Methode: ROT13 Einhorn-Markup (Standard, benötigt JQuery)</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #50


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] Fix obfuscator misinterpreting logo as email Fix email obfuscator incorrectly matching retina image filenames Oct 15, 2025
Copilot AI requested a review from skerbis October 15, 2025 16:43
Copilot finished work on behalf of skerbis October 15, 2025 16:43
@skerbis
Copy link
Member

skerbis commented Oct 15, 2025

Wäre es nicht besser src Attribute einfach zu ignorieren? Oder andere die unüblich für email Adressen sind ?

@skerbis
Copy link
Member

skerbis commented Oct 15, 2025

@copilot Wäre es nicht besser src Attribute einfach zu ignorieren? Oder andere die unüblich für email Adressen sind ? Führe Tests durch und zeige die Ergebnisse

Copy link
Author

Copilot AI commented Oct 15, 2025

@copilot Wäre es nicht besser src Attribute einfach zu ignorieren? Oder andere die unüblich für email Adressen sind ? Führe Tests durch und zeige die Ergebnisse

You're absolutely right! I've updated the fix to skip emails in HTML attributes (src, data-*, value, etc.) instead of just excluding @nx patterns.

Test results (11/11 passing):

Retina images: [email protected], [email protected] - preserved
Emails in attributes: data-email, input values, title - preserved
Emails in text: paragraphs, divs, plain text - obfuscated correctly

This approach is more robust and prevents future issues with emails in any HTML attributes.

Commit: 5b69478

Copilot AI changed the title Fix email obfuscator incorrectly matching retina image filenames Fix email obfuscator by skipping emails in HTML attributes Oct 15, 2025
Copilot finished work on behalf of skerbis October 15, 2025 17:04
@skerbis skerbis requested review from eaCe and removed request for skerbis October 15, 2025 17:06
@skerbis skerbis marked this pull request as ready for review November 19, 2025 11:42
Copilot AI review requested due to automatic review settings November 19, 2025 11:42
Copilot finished reviewing on behalf of skerbis November 19, 2025 11:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where the email obfuscator incorrectly identified retina image filenames (like [email protected]) as email addresses and obfuscated them within HTML attributes, breaking image references. The solution implements attribute-aware detection that skips obfuscation for any email-like patterns within HTML attribute values (src, data-*, href, etc.), addressing the root cause rather than just the symptom.

Key changes:

  • Added obfuscateEmailsNotInAttributes() method with quote-counting logic to detect when patterns appear inside HTML attributes
  • Updated makeEmailClickable() with the same attribute-awareness and explicit retina pattern exclusion
  • Replaced simple regex replacement with position-aware iteration to properly handle context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 248 to 289
while (preg_match($pattern, $ret, $matches, PREG_OFFSET_CAPTURE, $offset)) {
$fullMatch = $matches[0][0];
$pos = $matches[0][1];
$email = $matches[2][0] . '@' . $matches[3][0];

// Skip retina image patterns like @2x.png, @3x.jpg, etc.
if (preg_match('/^[^@]+@\d+x\./i', $email)) {
$offset = $pos + strlen($fullMatch);
continue;
}

// Check if we're inside an HTML attribute value
$before = substr($ret, 0, $pos);
$lastTagStart = strrpos($before, '<');
$lastTagEnd = strrpos($before, '>');

$shouldMakeClickable = true;

// If we found a < after the last >, we're potentially inside a tag
if ($lastTagStart !== false && ($lastTagEnd === false || $lastTagStart > $lastTagEnd)) {
// We're inside a tag, check if we're inside quotes (attribute value)
$tagContent = substr($before, $lastTagStart);

// Count quotes - if odd, we're inside an attribute value
$doubleQuotes = substr_count($tagContent, '"');
$singleQuotes = substr_count($tagContent, "'");

if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) {
$shouldMakeClickable = false;
}
}

if ($shouldMakeClickable) {
// Make clickable
$replacement = $matches[1][0] . "<a href=\"mailto:$email\">$email</a>";
$ret = substr_replace($ret, $replacement, $pos, strlen($fullMatch));
$offset = $pos + strlen($replacement);
} else {
// Skip this match
$offset = $pos + strlen($fullMatch);
}
}
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Similar performance concern as in the obfuscateEmailsNotInAttributes method: repeated string operations for each match can be inefficient with many emails in large documents. The repeated substr, strrpos, and substr_count calls scale as O(n*m).

Consider the same optimization suggestion: a single-pass parsing approach that identifies both tag boundaries and email patterns simultaneously.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied performance optimization to both methods. The makeEmailClickable() now uses the same precomputed tag ranges approach as obfuscateEmailsNotInAttributes(), reducing complexity from O(n*m) to O(n+m). Also added isset() checks for $_SERVER['REQUEST_METHOD'] to prevent PHP notices in CLI contexts. All tests passing.

Commit: 8584636

Comment on lines 260 to 262
$before = substr($ret, 0, $pos);
$lastTagStart = strrpos($before, '<');
$lastTagEnd = strrpos($before, '>');
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as lines 171-173: the logic doesn't account for < characters within attribute values. In <div data-compare="a < b">[email protected]</div>, the strrpos($before, '<') would incorrectly find the < inside the attribute value, causing the email in the text content to be incorrectly skipped.

Consider using a more robust approach to identify actual HTML tag boundaries.

Copilot uses AI. Check for mistakes.
Comment on lines 180 to 289
private static function makeEmailClickable($ret) {
$ret = ' ' . $ret;
// in testing, using arrays here was found to be faster
$ret = preg_replace_callback('#([\s>])([.0-9a-z_+-]+)@(([0-9a-z-]+\.)+[0-9a-z]{2,})#i', 'emailobfuscator::make_email_clickable_callback', $ret);

// Process emails but skip those in HTML attributes
$pattern = '#([\s>])([.0-9a-z_+-]+)@(([0-9a-z-]+\.)+[0-9a-z]{2,})#i';
$offset = 0;

while (preg_match($pattern, $ret, $matches, PREG_OFFSET_CAPTURE, $offset)) {
$fullMatch = $matches[0][0];
$pos = $matches[0][1];
$email = $matches[2][0] . '@' . $matches[3][0];

// Skip retina image patterns like @2x.png, @3x.jpg, etc.
if (preg_match('/^[^@]+@\d+x\./i', $email)) {
$offset = $pos + strlen($fullMatch);
continue;
}

// Check if we're inside an HTML attribute value
$before = substr($ret, 0, $pos);
$lastTagStart = strrpos($before, '<');
$lastTagEnd = strrpos($before, '>');

$shouldMakeClickable = true;

// If we found a < after the last >, we're potentially inside a tag
if ($lastTagStart !== false && ($lastTagEnd === false || $lastTagStart > $lastTagEnd)) {
// We're inside a tag, check if we're inside quotes (attribute value)
$tagContent = substr($before, $lastTagStart);

// Count quotes - if odd, we're inside an attribute value
$doubleQuotes = substr_count($tagContent, '"');
$singleQuotes = substr_count($tagContent, "'");

if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) {
$shouldMakeClickable = false;
}
}

if ($shouldMakeClickable) {
// Make clickable
$replacement = $matches[1][0] . "<a href=\"mailto:$email\">$email</a>";
$ret = substr_replace($ret, $replacement, $pos, strlen($fullMatch));
$offset = $pos + strlen($replacement);
} else {
// Skip this match
$offset = $pos + strlen($fullMatch);
}
}
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The makeEmailClickable method doesn't check the whitelist before making emails clickable, while obfuscateEmailsNotInAttributes does check the whitelist (line 195). This creates an inconsistency where whitelisted emails might still be wrapped in <a> tags by this method when using methods other than 'rot13_unicorn'.

Consider adding whitelist checking here for consistency, or document why whitelist checking is only needed for the unicorn obfuscation method.

Copilot uses AI. Check for mistakes.
* @return string Processed content
*/
private static function obfuscateEmailsNotInAttributes($content) {
$pattern = '/(?<![\/\w])([\w\-\+\.]+)@([\w\-\.]+\.[\w]{2,})(?![\w\/])/';
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The regex pattern uses negative lookahead/lookbehind to avoid matching emails that are part of URLs or other contexts ((?<![\/\w]) and (?![\w\/])). However, this pattern would still match file extensions like [email protected] in contexts where they're not within HTML attributes.

While the attribute detection logic should catch most cases, this regex could match false positives in text content like "Check out our [email protected] file". Consider if this is intended behavior or if the pattern should be more restrictive to avoid matching obvious file patterns even in text content.

Suggested change
$pattern = '/(?<![\/\w])([\w\-\+\.]+)@([\w\-\.]+\.[\w]{2,})(?![\w\/])/';
// Avoid matching emails that are actually filenames like [email protected]
// Add negative lookahead for common file extensions after the email
$pattern = '/(?<![\/\w])([\w\-\+\.]+)@([\w\-\.]+\.[\w]{2,})(?![\w\/])(?!(\.(png|jpg|jpeg|gif|svg|webp|pdf|docx?|xlsx?|pptx?|zip|rar|tar|gz|mp3|mp4|avi|mov|wmv|flv|mkv|ico|bmp|tiff|psd|ai|eps|csv|json|xml|yml|yaml|txt|log|md|html|htm|php|js|css|scss|less|c|cpp|h|hpp|py|rb|go|rs|sh|bat|exe|dll|bin|dat|bak|tmp)))/i';

Copilot uses AI. Check for mistakes.
Comment on lines 186 to 187
// If odd number of quotes, we're inside an attribute value
if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) {
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quote counting logic has a subtle bug when handling mixed quotes within attribute values. If an attribute value contains both single and double quotes (e.g., title="It's a test"), the logic checks if EITHER quote type has an odd count using OR (||). This would incorrectly conclude we're inside an attribute when we might not be.

The logic should use AND (&&) - we're only truly inside an attribute value if we have an odd count of the quote type that opened that specific attribute. However, detecting which quote type opened the attribute requires more sophisticated parsing.

A more robust approach: Check if we have an odd total of EITHER single OR double quotes (but not both conditions independently).

Suggested change
// If odd number of quotes, we're inside an attribute value
if (($doubleQuotes % 2) == 1 || ($singleQuotes % 2) == 1) {
// If odd total number of quotes, we're inside an attribute value
if ((($doubleQuotes + $singleQuotes) % 2) == 1) {

Copilot uses AI. Check for mistakes.
@skerbis
Copy link
Member

skerbis commented Nov 19, 2025

@copilot apply changes based on the comments in this thread

Copilot AI requested a review from skerbis November 19, 2025 15:18
Copilot finished work on behalf of skerbis November 19, 2025 15:18
Copy link
Author

Copilot AI commented Nov 19, 2025

@copilot apply changes based on the comments in this thread

Applied the remaining actionable improvement from the review thread: Added whitelist checking to makeEmailClickable() for consistency with obfuscateEmailsNotInAttributes().

This ensures whitelisted emails are handled consistently across both obfuscation methods. All 16 tests passing including 6 new whitelist-specific tests.

Commit: 28fce10

Copilot finished work on behalf of skerbis November 19, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Erkennt" [email protected] als E-Mail

2 participants