Skip to content

Commit e24b19a

Browse files
refactor & doc : UrlArea
1 parent 119a5e5 commit e24b19a

11 files changed

+501
-404
lines changed

Diff for: README.md

+29-9
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,58 @@
11
# Url-knife [![NPM version](https://img.shields.io/npm/v/url-knife.svg)](https://www.npmjs.com/package/url-knife) [![](https://data.jsdelivr.com/v1/package/gh/patternknife/url-knife/badge)](https://www.jsdelivr.com/package/gh/patternknife/url-knife) [![](https://badgen.net/bundlephobia/minzip/url-knife)](https://bundlephobia.com/result?p=url-knife)
22
## Overview
3-
Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with robust patterns.
4-
3+
Extract and decompose (fuzzy) URLs (including emails, which are conceptually a part of URLs) in texts with ``Area-Pattern-based modularity``.
54
- This library is currently being refactored into TypeScript, as it was originally developed in JavaScript.
65

76
#### URL knife
87
<a href="https://jsfiddle.net/AndrewKang/xtfjn8g3/" target="_blank">LIVE DEMO</a>
98

109

10+
## Area-Pattern-Based Modularity
11+
12+
The **Area** represents a designated section of content, such as general text, XML (HTML) areas, URL areas, or EMAIL areas. Each **Area** is associated with a specific set of **Patterns** (regular expressions) tailored to its context.
13+
14+
### Example:
15+
16+
1. In a **TextArea** (general plain text), the system applies a URL-specific regular expression to extract potential URLs.
17+
2. Once the area is narrowed down to contain URLs, **UrlArea** logic is used, applying URL-specific patterns to decompose the URL into its components (e.g., protocol, domain, path, query parameters).
18+
19+
### Enhanced Accuracy with Regular Expression Indexes:
20+
To further improve accuracy, the system leverages the **index** (or **offset**) values from regular expressions. These indexes help pinpoint exact locations of matches within the text, ensuring precise extraction and minimizing false positives.
21+
22+
For example:
23+
- If a **CommentArea** is processed using its specific patterns, the system identifies indexes for matches within that area.
24+
- These indexes can then be used to exclude matched URLs from a broader **TextArea**, ensuring only relevant URLs are processed and avoiding redundant or incorrect extractions.
25+
26+
### Key Benefits:
27+
This modular approach ensures that each **Area** is processed efficiently with the most relevant and optimized regular expressions. By incorporating index-based matching, it enables robust, scalable, and highly accurate parsing for various content types while preventing conflicts between overlapping patterns.
28+
29+
1130
## Installation
1231

13-
For ES5 users,
32+
For ES5 users, refer to ``public/index.html``.
1433

1534
``` html
1635
<html>
1736
<body>
1837
<script src="../dist/url-knife.bundle.js"></script>
1938
<--! OR !-->
20-
<script src="https://cdn.jsdelivr.net/gh/patternknife/[email protected]/dist/url-knife.bundle.min.js"></script>
21-
22-
<script type="text/javascript">
23-
</script>
39+
<script src="https://cdn.jsdelivr.net/gh/patternknife/[email protected]/dist/url-knife.bundle.min.js"></script>
2440
</body>
2541
</html>
2642
```
2743

28-
For ES6 npm users, run 'npm install --save url-knife' on console.
44+
For ES6 npm users, run 'npm install --save url-knife' in the console.
2945
(**Requred Node v18.20.4**)
30-
3146
``` html
3247
import {TextArea, UrlArea, XmlArea} from 'url-knife';
3348
```
49+
For ES5, add Pattern before usage:
50+
```javascript
51+
Pattern.UrlArea...
52+
````
3453

3554
## Syntax & Usage
55+
3656
[Chapter 1. Normalize or parse one URL](#chapter-1-normalize-or-parse-one-url)
3757

3858
[Chapter 2. Extract all URLs or emails](#chapter-2-extract-all-urls-or-emails)

Diff for: dist/bo/UrlNormalizer.js

+59-28
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,40 @@ const FuzzyPartialUrlPatterns_1 = require("../pattern/FuzzyPartialUrlPatterns");
99
const BasePatterns_1 = require("../pattern/BasePatterns");
1010
const ProtocolPatterns_1 = require("../pattern/ProtocolPatterns");
1111
const DomainPatterns_1 = require("../pattern/DomainPatterns");
12+
const valid_1 = __importDefault(require("../valid"));
1213
exports.UrlNormalizer = {
13-
modifiedUrl: null,
14+
sacrificedUrl: null,
15+
currentStep: 0,
16+
/**
17+
* Initializes the UrlNormalizer with a given URL.
18+
* @param url - The URL to normalize.
19+
*/
20+
initializeSacrificedUrl(url) {
21+
this.sacrificedUrl = util_1.default.Text.removeAllSpaces(valid_1.default.validateAndTrimString(url));
22+
if (!this.sacrificedUrl) {
23+
throw new Error("modifiedUrl cannot be null or empty");
24+
}
25+
this.currentStep = 1;
26+
},
27+
/**
28+
* Check if the required previous step is completed.
29+
* @param requiredStep - The step that should have been completed.
30+
*/
31+
ensureStepCompleted(requiredStep) {
32+
if (this.currentStep != requiredStep) {
33+
throw new Error(`Step ${requiredStep} must be completed before this step ${this.currentStep}`);
34+
}
35+
},
1436
extractAndNormalizeProtocolFromSpacesRemovedUrl() {
15-
if (this.modifiedUrl == undefined) {
16-
throw new Error("modifiedUrl cannot be null");
37+
this.ensureStepCompleted(1);
38+
if (!this.sacrificedUrl) {
39+
throw new Error("modifiedUrl cannot be null or empty");
1740
}
1841
let protocol = null;
1942
let rx = new RegExp('^(' + FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.getFuzzyProtocolsRxStr + '|' + FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.fuzzierProtocol + ')' + FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.fuzzierProtocolDomainDelimiter);
2043
let match;
2144
let isMatched = false;
22-
while ((match = rx.exec(this.modifiedUrl)) !== null) {
45+
while ((match = rx.exec(this.sacrificedUrl)) !== null) {
2346
if (match && match[1]) {
2447
isMatched = true;
2548
if (match[1] === 'localhost') {
@@ -37,11 +60,13 @@ exports.UrlNormalizer = {
3760
break;
3861
}
3962
}
40-
this.modifiedUrl = this.modifiedUrl.replace(rx, '');
63+
this.sacrificedUrl = this.sacrificedUrl.replace(rx, '');
64+
this.currentStep = 2;
4165
return protocol;
4266
},
4367
extractAndNormalizeDomainFromProtocolRemovedUrl() {
44-
if (this.modifiedUrl == undefined) {
68+
this.ensureStepCompleted(2);
69+
if (this.sacrificedUrl == undefined) {
4570
throw new Error("modifiedUrl cannot be null");
4671
}
4772
let result = {
@@ -51,7 +76,7 @@ exports.UrlNormalizer = {
5176
let rx1 = new RegExp('(' + FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.getFuzzyDomainBody + '.*?)(' + FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.optionalFuzzyPort +
5277
FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.optionalFuzzyUrlParams + ')$', 'gi');
5378
let match1;
54-
while ((match1 = rx1.exec(this.modifiedUrl)) !== null) {
79+
while ((match1 = rx1.exec(this.sacrificedUrl)) !== null) {
5580
// remaining full url
5681
let domain_temp = match1[0];
5782
// domain
@@ -141,46 +166,49 @@ exports.UrlNormalizer = {
141166
else {
142167
result.domain = domain_temp2;
143168
}
144-
this.modifiedUrl = domain_temp3;
169+
this.sacrificedUrl = domain_temp3;
145170
}
146171
//console.log("before : " + this.modifiedUrl)
147172
// This sort of characters should NOT be located at the start.
148-
this.modifiedUrl = this.modifiedUrl.replace(new RegExp('^(?:' + BasePatterns_1.BasePatterns.twoBytesNum + '|' + BasePatterns_1.BasePatterns.langChar + ')+', 'i'), '');
149-
//console.log("after : " + this.modifiedUrl)
173+
this.sacrificedUrl = this.sacrificedUrl.replace(new RegExp('^(?:' + BasePatterns_1.BasePatterns.twoBytesNum + '|' + BasePatterns_1.BasePatterns.langChar + ')+', 'i'), '');
174+
this.currentStep = 3;
150175
return result;
151176
},
152177
extractAndNormalizePortFromDomainRemovedUrl() {
178+
this.ensureStepCompleted(3);
153179
let port = null;
154180
let rx = new RegExp('^' + FuzzyPartialUrlPatterns_1.FuzzyPartialUrlPatterns.mandatoryFuzzyPort, 'gi');
155181
let match;
156-
if (this.modifiedUrl == undefined) {
182+
if (this.sacrificedUrl == undefined) {
157183
throw new Error("modifiedUrl cannot be null");
158184
}
159-
while ((match = rx.exec(this.modifiedUrl)) !== null) {
185+
while ((match = rx.exec(this.sacrificedUrl)) !== null) {
160186
port = match[0].replace(/^\D+/g, '');
161-
if (this.modifiedUrl != undefined) {
162-
this.modifiedUrl = this.modifiedUrl.replace(rx, '');
187+
if (this.sacrificedUrl != undefined) {
188+
this.sacrificedUrl = this.sacrificedUrl.replace(rx, '');
163189
}
164190
}
191+
this.currentStep = 4;
165192
return port;
166193
},
167-
finalizeNormalization(protocol, port, domain) {
168-
if (this.modifiedUrl == undefined) {
194+
extractNormalizedUrl(protocol, port, domain) {
195+
this.ensureStepCompleted(4);
196+
if (this.sacrificedUrl == undefined) {
169197
throw new Error("modifiedUrl cannot be null");
170198
}
171199
/* Now, only the end part of a domain is left */
172200
/* Consecutive param delimiters should be replaced into one */
173-
this.modifiedUrl = this.modifiedUrl.replace(/[#]{2,}/gi, '#');
174-
this.modifiedUrl = this.modifiedUrl.replace(/[/]{2,}/gi, '/');
175-
this.modifiedUrl = this.modifiedUrl.replace(/(.*?)[?]{2,}([^/]*?(?:=|$))(.*)/i, function (match, $1, $2, $3) {
201+
this.sacrificedUrl = this.sacrificedUrl.replace(/[#]{2,}/gi, '#');
202+
this.sacrificedUrl = this.sacrificedUrl.replace(/[/]{2,}/gi, '/');
203+
this.sacrificedUrl = this.sacrificedUrl.replace(/(.*?)[?]{2,}([^/]*?(?:=|$))(.*)/i, function (match, $1, $2, $3) {
176204
//console.log(modified_url + ' a :' + $1 + '?' + $2 + $3);
177205
return $1 + '?' + $2 + $3;
178206
});
179207
/* 'modified_url' must start with '/,?,#' */
180208
let rx_modified_url = new RegExp('(?:\\/|\\?|\\#)', 'i');
181209
let match_modified_url;
182-
if ((match_modified_url = rx_modified_url.exec(this.modifiedUrl)) !== null) {
183-
this.modifiedUrl = this.modifiedUrl.replace(new RegExp('^.*?(' + util_1.default.Text.escapeRegex(match_modified_url[0]) + '.*)$', 'i'), function (match, $1) {
210+
if ((match_modified_url = rx_modified_url.exec(this.sacrificedUrl)) !== null) {
211+
this.sacrificedUrl = this.sacrificedUrl.replace(new RegExp('^.*?(' + util_1.default.Text.escapeRegex(match_modified_url[0]) + '.*)$', 'i'), function (match, $1) {
184212
return $1;
185213
});
186214
}
@@ -202,42 +230,45 @@ exports.UrlNormalizer = {
202230
if (!onlyDomain_str) {
203231
onlyDomain_str = '';
204232
}
205-
return protocol_str + onlyDomain_str + port_str + this.modifiedUrl;
233+
this.currentStep = 5;
234+
return protocol_str + onlyDomain_str + port_str + this.sacrificedUrl;
206235
},
207236
extractAndNormalizeUriParamsFromPortRemovedUrl() {
208-
if (this.modifiedUrl == undefined) {
237+
this.ensureStepCompleted(5);
238+
if (this.sacrificedUrl == undefined) {
209239
throw new Error("modifiedUrl cannot be null");
210240
}
211241
let result = {
212242
uri: null,
213243
params: null
214244
};
215-
if (!this.modifiedUrl || this.modifiedUrl.trim() === '') {
245+
if (!this.sacrificedUrl || this.sacrificedUrl.trim() === '') {
216246
result.params = null;
217247
result.uri = null;
218248
}
219249
else {
220250
// PARAMS
221251
let rx3 = new RegExp('\\?(?:.)*$', 'gi');
222252
let match3;
223-
while ((match3 = rx3.exec(this.modifiedUrl)) !== null) {
253+
while ((match3 = rx3.exec(this.sacrificedUrl)) !== null) {
224254
result.params = match3[0];
225255
}
226-
this.modifiedUrl = this.modifiedUrl.replace(rx3, '');
256+
this.sacrificedUrl = this.sacrificedUrl.replace(rx3, '');
227257
if (result.params === "?") {
228258
result.params = null;
229259
}
230260
// URI
231261
let rx4 = new RegExp('[#/](?:.)*$', 'gi');
232262
let match4;
233-
while ((match4 = rx4.exec(this.modifiedUrl)) !== null) {
263+
while ((match4 = rx4.exec(this.sacrificedUrl)) !== null) {
234264
result.uri = match4[0];
235265
}
236-
this.modifiedUrl = this.modifiedUrl.replace(rx4, '');
266+
this.sacrificedUrl = this.sacrificedUrl.replace(rx4, '');
237267
if (result.uri === "/") {
238268
result.uri = null;
239269
}
240270
}
271+
this.currentStep = 6;
241272
return result;
242273
}
243274
};

Diff for: dist/service/EmailAreaService.js

+19-19
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ const BasePatterns_1 = require("../pattern/BasePatterns");
1010
const DomainPatterns_1 = require("../pattern/DomainPatterns");
1111
exports.EmailAreaService = {
1212
parseEmail(email) {
13-
let obj = {
13+
let parsedEmailComponents = {
1414
email: null,
1515
removedTailOnEmail: null,
1616
type: null
@@ -21,49 +21,49 @@ exports.EmailAreaService = {
2121
if (!valid_1.default.isEmailPattern(email)) {
2222
throw new Error('This is not an email pattern');
2323
}
24-
obj.email = email;
24+
parsedEmailComponents.email = email;
2525
if (new RegExp('@' + BasePatterns_1.BasePatterns.everything + '*' + DomainPatterns_1.DomainPatterns.ipV4, 'i').test(email)) {
26-
obj.type = 'ipV4';
26+
parsedEmailComponents.type = 'ipV4';
2727
}
2828
else if (new RegExp('@' + BasePatterns_1.BasePatterns.everything + '*' + DomainPatterns_1.DomainPatterns.ipV6, 'i').test(email)) {
2929
//console.log('r : ' + url);
30-
obj.type = 'ipV6';
30+
parsedEmailComponents.type = 'ipV6';
3131
}
3232
else {
33-
obj.type = 'domain';
33+
parsedEmailComponents.type = 'domain';
3434
}
3535
// If no uris no params, we remove suffix in case that it is a meta character.
36-
if (obj.email) {
37-
if (obj.type !== 'ipV6') {
36+
if (parsedEmailComponents.email) {
37+
if (parsedEmailComponents.type !== 'ipV6') {
3838
// removedTailOnUrl
39-
let rm_part_matches = obj.email.match(new RegExp(BasePatterns_1.BasePatterns.noLangCharNum + '+$', 'gi'));
39+
let rm_part_matches = parsedEmailComponents.email.match(new RegExp(BasePatterns_1.BasePatterns.noLangCharNum + '+$', 'gi'));
4040
if (rm_part_matches) {
41-
obj.removedTailOnEmail = rm_part_matches[0];
42-
obj.email = obj.email.replace(new RegExp(BasePatterns_1.BasePatterns.noLangCharNum + '+$', 'gi'), '');
41+
parsedEmailComponents.removedTailOnEmail = rm_part_matches[0];
42+
parsedEmailComponents.email = parsedEmailComponents.email.replace(new RegExp(BasePatterns_1.BasePatterns.noLangCharNum + '+$', 'gi'), '');
4343
}
4444
}
4545
else {
4646
// removedTailOnUrl
47-
let rm_part_matches = obj.email.match(new RegExp('[^\\u005D]+$', 'gi'));
47+
let rm_part_matches = parsedEmailComponents.email.match(new RegExp('[^\\u005D]+$', 'gi'));
4848
if (rm_part_matches) {
49-
obj.removedTailOnEmail = rm_part_matches[0];
50-
obj.email = obj.email.replace(new RegExp('[^\\u005D]+$', 'gi'), '');
49+
parsedEmailComponents.removedTailOnEmail = rm_part_matches[0];
50+
parsedEmailComponents.email = parsedEmailComponents.email.replace(new RegExp('[^\\u005D]+$', 'gi'), '');
5151
}
5252
}
5353
// If no uri no params, we remove suffix in case that it is non-alphabets.
5454
// The regex below means "all except for '.'". It is for extracting all root domains, so non-domain types like ip are excepted.
55-
let onlyEnd = obj.email.match(new RegExp('[^.]+$', 'gi'));
55+
let onlyEnd = parsedEmailComponents.email.match(new RegExp('[^.]+$', 'gi'));
5656
if (onlyEnd && onlyEnd.length > 0) {
5757
// this is a root domain only in English like com, ac
5858
// but the situation is like com가, ac나
5959
if (/^[a-zA-Z]+/.test(onlyEnd[0])) {
60-
if (/[^a-zA-Z]+$/.test(obj.email)) {
60+
if (/[^a-zA-Z]+$/.test(parsedEmailComponents.email)) {
6161
// remove non alphabets
62-
const matchedEmail = obj.email.match(/[^a-zA-Z]+$/);
62+
const matchedEmail = parsedEmailComponents.email.match(/[^a-zA-Z]+$/);
6363
if (matchedEmail && matchedEmail.length > 0) {
64-
obj.removedTailOnEmail = matchedEmail[0] + obj.removedTailOnEmail;
64+
parsedEmailComponents.removedTailOnEmail = matchedEmail[0] + parsedEmailComponents.removedTailOnEmail;
6565
}
66-
obj.email = obj.email.replace(/[^a-zA-Z]+$/, '');
66+
parsedEmailComponents.email = parsedEmailComponents.email.replace(/[^a-zA-Z]+$/, '');
6767
}
6868
}
6969
}
@@ -73,7 +73,7 @@ exports.EmailAreaService = {
7373
console.log(e);
7474
}
7575
finally {
76-
return obj;
76+
return parsedEmailComponents;
7777
}
7878
},
7979
strictTest(email) {

0 commit comments

Comments
 (0)