-
Notifications
You must be signed in to change notification settings - Fork 1
Fixed adress parsing and page extraction #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! There are some minor things that I'd like to see fixed or don't understand, otherwise this is great. However, I still find some error messages in my logs when I run this:
Traceback (most recent call last):
File "./scrape.py", line 409, in <module>
get_new_listings(db)
File "./scrape.py", line 352, in get_new_listings
listings = extract_listings(page)
File "./scrape.py", line 159, in extract_listings
street_span = entry.find('div', class_='result-list-entry__address').find('span').contents[0]
AttributeError: 'NoneType' object has no attribute 'contents'
Looks like we either need to check whether that element we're looking for is really there (in case it's optional) or we need to fix the way we're looking for it (in case it's always there but our selector sometimes fails).
@@ -47,7 +47,7 @@ | |||
# Immobilienscout24 URLs for listings in Karlsruhe | |||
BASE_URL = 'http://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe' | |||
PAGE_URL = 'http://www.immobilienscout24.de/Suche/S-T/P-%d/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe?pagerReporting=true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these URLs need to be changed as well? The best way would be if one could set just the city using a single variable, perhaps we can use a generic search URL (if one exists)?
content = unicode(dd.string).strip() | ||
if content.endswith('€'): | ||
if content.endswith(' €'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new check covers less cases then the old (it needs an additional space). Why is that necessary?
rent = parse_german_float(content.split()[0]) | ||
elif content.endswith('m²'): | ||
elif content.endswith(' m²'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -173,15 +174,16 @@ def extract_listings(soup): | |||
'rent': rent, | |||
'area': area, | |||
} | |||
print(listings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the scraper doesn't output anything to STDOUT, all output goes into the database or into the log file. I'd like to keep it that way for the moment.
@@ -346,7 +348,7 @@ def get_new_listings(db): | |||
while (not num_pages) or (page_index <= num_pages): | |||
logger.info("Fetching page %d" % page_index) | |||
page = get_page(page_index) | |||
num_pages = num_pages or extract_number_of_pages(page) | |||
num_pages = extract_number_of_pages(page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary?
@@ -47,7 +47,7 @@ | |||
# Immobilienscout24 URLs for listings in Karlsruhe | |||
BASE_URL = 'http://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe' | |||
PAGE_URL = 'http://www.immobilienscout24.de/Suche/S-T/P-%d/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe?pagerReporting=true' | |||
|
|||
CITY = 'Karlsruhe' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can we please make this a command line parameter while we're at it?
No description provided.