Unable to fetch filters #11

MarcSeebold · 2023-01-24T01:08:17Z

query/filters.py:get_addl_filters is unable to crawl the page (search_html = next(sessions.yield_html(url)))

window.cl.specialCurtainMessages = { unsupportedBrowser: [ "We've detected you are using a browser that is missing critical features.", "Please visit craigslist from a modern browser." ], unrecoverableError: [ "There was an error loading the page." ] };

I guess Craigslist put some new anti-crawling features in place.

The text was updated successfully, but these errors were encountered:

irahorecka · 2023-01-24T01:43:47Z

Hey @MarcSeebold - yes they did. I'm (passively) trying to figure out a way to bypass this. The first next step, Selenium, doesn't work. Alas, I tried other JS libraries similar to Selenium but to no avail.

MarcSeebold · 2023-01-24T17:22:32Z

curl 'https://sapi.craigslist.org/web/v7/postings/search/full?batch=13-0-360-0-0&cc=US&lang=en&searchPath=cta' works for me

(that's the first JSON response when loading the page in Firefox)

irahorecka · 2023-01-24T22:49:07Z

@MarcSeebold Ah that is interesting - works in Python as well. However, it looks like a lot of the critical attributes of a CL post (price, number of cylinders, car make, car transmission, etc.) are omitted. I'm not sure how much useful information I could pull out of these entries.

MarcSeebold · 2023-01-24T22:51:23Z

I think Selenium is the right way to go. We just have to figure out how CL detects that it's not a "supported" browser. From my experience, changing the User-Agent sometimes helped.

MarcSeebold · 2023-01-24T22:53:16Z

Also, try it w/o --headless first. I noticed that it sometimes behaves differently.
My Selenium settings I used to crawl KBB:

                user_agent = (                                                                                          
                    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"                                                
                    "(KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36"                                             
                )                                                                                                       
                option.add_argument("user-agent={0}".format(user_agent))

irahorecka · 2023-01-24T22:54:33Z

Ok - I could give the non --headless option a go; the --headless route is detected by CL.

MarcSeebold · 2023-02-20T20:06:30Z

Maybe that will do the trick: https://antoinevastel.com/bot%20detection/2023/02/19/new-headless-chrome.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to fetch filters #11

Unable to fetch filters #11

MarcSeebold commented Jan 24, 2023

irahorecka commented Jan 24, 2023

MarcSeebold commented Jan 24, 2023 •

edited

Loading

irahorecka commented Jan 24, 2023

MarcSeebold commented Jan 24, 2023

MarcSeebold commented Jan 24, 2023 •

edited

Loading

irahorecka commented Jan 24, 2023

MarcSeebold commented Feb 20, 2023

Unable to fetch filters #11

Unable to fetch filters #11

Comments

MarcSeebold commented Jan 24, 2023

irahorecka commented Jan 24, 2023

MarcSeebold commented Jan 24, 2023 • edited Loading

irahorecka commented Jan 24, 2023

MarcSeebold commented Jan 24, 2023

MarcSeebold commented Jan 24, 2023 • edited Loading

irahorecka commented Jan 24, 2023

MarcSeebold commented Feb 20, 2023

MarcSeebold commented Jan 24, 2023 •

edited

Loading

MarcSeebold commented Jan 24, 2023 •

edited

Loading