Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate URL's #23

Open
jaikishantulswani opened this issue Oct 25, 2020 · 1 comment
Open

Duplicate URL's #23

jaikishantulswani opened this issue Oct 25, 2020 · 1 comment

Comments

@jaikishantulswani
Copy link

@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls which increase the time to crawl for hours on same requests
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/

It also needed a switch to filter / skip urls throwing particular status code.

@StasonJatham
Copy link

Hey buddy,

i figured I would post it in here too
#21 (comment)

They actually try to get rid of duplicates with their "stringset" implementation.... funny thing is they actually don't need that entire code because colly handles this for them.
The issue seems to be that if the same URL is found in, for example, a form or an a tag, it is not checked, only URLs that are found inside forms are checked with URLs found in other forms not against a's .....
long story short.. you can modify crawler to actually use colly s built in filter and then it works.

I can't really share the code cause I use it as a library not a command line tool, so I took out cobra and start it with a config file

I don't really care about status code but implementing that filter is easy

// They give you a status code as such 
response.StatusCode

you can then add an if statement before .Visit() is run and check for that errorcode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants