Duplicate URL's #23

jaikishantulswani · 2020-10-25T14:23:08Z

@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls which increase the time to crawl for hours on same requests
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/

It also needed a switch to filter / skip urls throwing particular status code.

StasonJatham · 2021-10-28T15:12:16Z

Hey buddy,

i figured I would post it in here too
#21 (comment)

They actually try to get rid of duplicates with their "stringset" implementation.... funny thing is they actually don't need that entire code because colly handles this for them.
The issue seems to be that if the same URL is found in, for example, a form or an a tag, it is not checked, only URLs that are found inside forms are checked with URLs found in other forms not against a's .....
long story short.. you can modify crawler to actually use colly s built in filter and then it works.

I can't really share the code cause I use it as a library not a command line tool, so I took out cobra and start it with a config file

I don't really care about status code but implementing that filter is easy

// They give you a status code as such 
response.StatusCode

you can then add an if statement before .Visit() is run and check for that errorcode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate URL's #23

Duplicate URL's #23

jaikishantulswani commented Oct 25, 2020

StasonJatham commented Oct 28, 2021

Duplicate URL's #23

Duplicate URL's #23

Comments

jaikishantulswani commented Oct 25, 2020

StasonJatham commented Oct 28, 2021