Skip to content
This repository has been archived by the owner on Jun 22, 2024. It is now read-only.

Webcrawler #22

Merged
merged 7 commits into from
Nov 13, 2020
Merged

Conversation

scientes
Copy link
Contributor

@scientes scientes commented Oct 5, 2020

resolves #3

Its not completely finished, there still is one bug which i'm not able to fix which lies within the crawling framework and i can't figure out why its not working. but that part i disabled for now and only needed if we need the Long name of the player and the country he is playing for.(For people who know scrapy: Every request.meta Dictionary entry i make vanishes when reaching the parse function of the request which for now stops me from passing the short name of the player to the parse function which determines the long name and country he's playing for.)

But for the rest here's an example output:

U G Dowe.csv

matchid,date
Matches/MatchScorecard.asp?MatchCode=0714,1973-02-16
Matches/MatchScorecard.asp?MatchCode=0693,1972-02-16
Matches/MatchScorecard.asp?MatchCode=0684,1971-04-13
Matches/MatchScorecard.asp?MatchCode=0683,1971-04-01

@roysti10
Copy link
Member

roysti10 commented Oct 7, 2020

Heyy, I would like to test this, to get a general idea of how it works, are there any major bugs in this or just minor ones ?

@scientes
Copy link
Contributor Author

scientes commented Oct 7, 2020

Minor Bugs which sections are disabled. So everything working runs. Word of warning IT takes about 5k-6k seconds to complete Due to ratelimiting. Ctr+c for ending it a second time to force it. How to run and install is in the readme file. The Code that scrapes the pages is in the spider folder that generates items which are processed by the classes in Pipelines.py.

@roysti10
Copy link
Member

roysti10 commented Oct 8, 2020

Heyy, just one thing the crawler wastes a lot of time going through matches from 1850's - 1990's . When you actually think about it, nobody from that era plays anymore, is it possible to add some kind of filter to it?

@roysti10
Copy link
Member

roysti10 commented Oct 8, 2020

It looks good, the only issue would be the full names + country as it is not being able to find them
When I scraped them intially from http://www.howstat.com/cricket/Statistics/Players/PlayerListCurrent.asp
Each name in it had a Player ID attached to it . That player ID when added to http://www.howstat.com/cricket/Statistics/Players/
would take me to the page of that player from where i took the name
Hope this helps!

@scientes
Copy link
Contributor Author

Sorry i'm atm very short on time

@roysti10
Copy link
Member

@scientes No problem! Do it at your own pace, since this is a enhancement, Its no problem if it takes some time

@roysti10
Copy link
Member

roysti10 commented Oct 19, 2020

@scientes as requested by you, I have shifted everything to a single csv in the branch master
Coming to Long names i would actuall prefer them cause there a lot of players with the same intiials and it might cause errors . Similar for countries

@roysti10
Copy link
Member

roysti10 commented Nov 2, 2020

@scientes Heyy it would be great if you could give me a update on when u will resume
Thanks

@scientes
Copy link
Contributor Author

scientes commented Nov 2, 2020 via email

@scientes
Copy link
Contributor Author

My current solution is not the best, im atm not filtering the old games out, but i found a way to get the playernames thing working.

@roysti10
Copy link
Member

roysti10 commented Nov 13, 2020

My current solution is not the best, im atm not filtering the old games out, but i found a way to get the playernames thing working.

That's great! You can push those changes to your branch. Ill test it out. We can make the web-crawler a WIP project if needed since you seem to be short on time.
I created a new branch for this PR, do redirect it there. And I'll merge this there. You can continue to help once you are free.
Once the conflicts are resolved, We can merge it.
Thank you for this! Really appreciate your help . This really is a important part for our project😄

@roysti10 roysti10 marked this pull request as ready for review November 13, 2020 08:25
@scientes
Copy link
Contributor Author

My current status:
i get the player data in one file(id,name,gametype,retired)
i get the matchid in another file(grouped in folders by ODI/T20/TEST)

Missing:

  • bowling

  • batting

  • wicketkeeping

  • keeping track over different runs so that we can crawl incrementaly

  • a small script via pandas which maybe cleans and reformats the data

@roysti10
Copy link
Member

roysti10 commented Nov 13, 2020

My current status:
i get the player data in one file(id,name,gametype,retired)
i get the matchid in another file(grouped in folders by ODI/T20/TEST)

Missing:

  • bowling
  • batting
  • wicketkeeping
  • keeping track over different runs so that we can crawl incrementaly
  • a small script via pandas which maybe cleans and reformats the data

I believe this is good enough. Once this is merged I'll make new issues for some of them and start working on them.
The only conflicting files seems to be the .gitignore file should be easy to resolve
I recommend you sync your webcrawler to the repo's webcrawler branch and then we can merge 🎉

EDIT:
Since it looks like you are successfull in segregating records. I'm guessing issues #7 , #17 , and partly #5 also gets fixed
Ill be closing those issues as well then

@roysti10 roysti10 added data Organizing or adding more data enhancement New feature or request labels Nov 13, 2020
@roysti10
Copy link
Member

Also @scientes , even though I merged #20 , I didnt see your name in the contributors list for some reason
Just wanted to let you know, don't want it to happen in this case too

This was linked to issues Nov 13, 2020
@scientes
Copy link
Contributor Author

i've put some sample data in the output folders. not complete data from crawl

@roysti10 roysti10 changed the base branch from master to feature-webcrawler November 13, 2020 10:09
5011,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18
5013,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18
5529,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18
5530,Matches/MatchScorecard_T20.asp?MatchCode=0964,2019-10-18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, does the name column contain player_ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should if fix that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Not neccessary as of now, Until we filter old matches, Let be there

@roysti10
Copy link
Member

i've put some sample data in the output folders. not complete data from crawl

Looks good, probably will need to make a DB server in the future
I'll be merging this soon then

@roysti10
Copy link
Member

@scientes Wonderful job! Thanks for contributing!! 😄 🎉

@roysti10 roysti10 merged commit ad5fc79 into HackerSpace-PESU:feature-webcrawler Nov 13, 2020
@scientes
Copy link
Contributor Author

i think the problem with #20 is that you need to contribute a certain amount of code to the default branch to be counted, i think 4 sloc or so is the minimum

roysti10 added a commit that referenced this pull request Nov 13, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 14, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 14, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 14, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 15, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 15, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 15, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
roysti10 added a commit that referenced this pull request Nov 15, 2020
* initial crawler commit

* Added readme

* Bugfixes

* removed Junk

* Bugfix

* bugfixes, added player data, sample data

Co-authored-by: Bastian Große <[email protected]>
Co-authored-by: scientes <[email protected]>
Co-authored-by: Royston E Tauro <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data Organizing or adding more data enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More player records for ODI Test Player Records Required Setup Web-Crawler for daily updates
2 participants