Skip to content

Unable to fetch filters #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MarcSeebold opened this issue Jan 24, 2023 · 7 comments
Open

Unable to fetch filters #11

MarcSeebold opened this issue Jan 24, 2023 · 7 comments

Comments

@MarcSeebold
Copy link

query/filters.py:get_addl_filters is unable to crawl the page (search_html = next(sessions.yield_html(url)))

window.cl.specialCurtainMessages = { unsupportedBrowser: [ "We've detected you are using a browser that is missing critical features.", "Please visit craigslist from a modern browser." ], unrecoverableError: [ "There was an error loading the page." ] };

I guess Craigslist put some new anti-crawling features in place.

@irahorecka
Copy link
Owner

Hey @MarcSeebold - yes they did. I'm (passively) trying to figure out a way to bypass this. The first next step, Selenium, doesn't work. Alas, I tried other JS libraries similar to Selenium but to no avail.

@MarcSeebold
Copy link
Author

MarcSeebold commented Jan 24, 2023

curl 'https://sapi.craigslist.org/web/v7/postings/search/full?batch=13-0-360-0-0&cc=US&lang=en&searchPath=cta' works for me

(that's the first JSON response when loading the page in Firefox)

Screenshot_2023-01-24_10-23-05

@irahorecka
Copy link
Owner

@MarcSeebold Ah that is interesting - works in Python as well. However, it looks like a lot of the critical attributes of a CL post (price, number of cylinders, car make, car transmission, etc.) are omitted. I'm not sure how much useful information I could pull out of these entries.

@MarcSeebold
Copy link
Author

I think Selenium is the right way to go. We just have to figure out how CL detects that it's not a "supported" browser. From my experience, changing the User-Agent sometimes helped.

@MarcSeebold
Copy link
Author

MarcSeebold commented Jan 24, 2023

Also, try it w/o --headless first. I noticed that it sometimes behaves differently.
My Selenium settings I used to crawl KBB:

                user_agent = (                                                                                          
                    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"                                                
                    "(KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36"                                             
                )                                                                                                       
                option.add_argument("user-agent={0}".format(user_agent))

@irahorecka
Copy link
Owner

Ok - I could give the non --headless option a go; the --headless route is detected by CL.

@MarcSeebold
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants