You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
They actually try to get rid of duplicates with their "stringset" implementation.... funny thing is they actually don't need that entire code because colly handles this for them.
The issue seems to be that if the same URL is found in, for example, a form or an a tag, it is not checked, only URLs that are found inside forms are checked with URLs found in other forms not against a's .....
long story short.. you can modify crawler to actually use colly s built in filter and then it works.
I can't really share the code cause I use it as a library not a command line tool, so I took out cobra and start it with a config file
I don't really care about status code but implementing that filter is easy
// They give you a status code as such
response.StatusCode
you can then add an if statement before .Visit() is run and check for that errorcode
@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls which increase the time to crawl for hours on same requests
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
It also needed a switch to filter / skip urls throwing particular status code.
The text was updated successfully, but these errors were encountered: