-
Notifications
You must be signed in to change notification settings - Fork 17
Webcrawler #22
Webcrawler #22
Conversation
Heyy, I would like to test this, to get a general idea of how it works, are there any major bugs in this or just minor ones ? |
Minor Bugs which sections are disabled. So everything working runs. Word of warning IT takes about 5k-6k seconds to complete Due to ratelimiting. Ctr+c for ending it a second time to force it. How to run and install is in the readme file. The Code that scrapes the pages is in the spider folder that generates items which are processed by the classes in Pipelines.py. |
Heyy, just one thing the crawler wastes a lot of time going through matches from 1850's - 1990's . When you actually think about it, nobody from that era plays anymore, is it possible to add some kind of filter to it? |
It looks good, the only issue would be the full names + country as it is not being able to find them |
Sorry i'm atm very short on time |
@scientes No problem! Do it at your own pace, since this is a enhancement, Its no problem if it takes some time |
@scientes as requested by you, I have shifted everything to a single csv in the branch |
@scientes Heyy it would be great if you could give me a update on when u will resume |
Latest in a Week or so.
|
My current solution is not the best, im atm not filtering the old games out, but i found a way to get the playernames thing working. |
That's great! You can push those changes to your branch. Ill test it out. We can make the web-crawler a WIP project if needed since you seem to be short on time. |
My current status: Missing:
|
I believe this is good enough. Once this is merged I'll make new issues for some of them and start working on them. EDIT: |
i've put some sample data in the output folders. not complete data from crawl |
5011,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18 | ||
5013,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18 | ||
5529,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18 | ||
5530,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, does the name
column contain player_ids
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should if fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Not neccessary as of now, Until we filter old matches, Let be there
Looks good, probably will need to make a DB server in the future |
@scientes Wonderful job! Thanks for contributing!! 😄 🎉 |
i think the problem with #20 is that you need to contribute a certain amount of code to the default branch to be counted, i think 4 sloc or so is the minimum |
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
* initial crawler commit * Added readme * Bugfixes * removed Junk * Bugfix * bugfixes, added player data, sample data Co-authored-by: Bastian Große <[email protected]> Co-authored-by: scientes <[email protected]> Co-authored-by: Royston E Tauro <[email protected]>
resolves #3
Its not completely finished, there still is one bug which i'm not able to fix which lies within the crawling framework and i can't figure out why its not working. but that part i disabled for now and only needed if we need the Long name of the player and the country he is playing for.(For people who know scrapy: Every request.meta Dictionary entry i make vanishes when reaching the parse function of the request which for now stops me from passing the short name of the player to the parse function which determines the long name and country he's playing for.)
But for the rest here's an example output:
U G Dowe.csv