-
Notifications
You must be signed in to change notification settings - Fork 67
sess.visit() sometimes hangs #7
Comments
I can reproduce this. I think it has to do with AJAX calls not finishing. The same problem occurs when visiting Gmail with the original webkit-server, for example. I have to do more investigation to find the root cause and maybe work around it. |
This should be fixed in version 0.9 |
Hi niklasb, sorry to comment on a closed issue, but this seems to be still happening for me in 0.9.1. However, I don't have any error/log messages to show yet, it just seems to be stuck on visit() for me (I can't/don't know how to interrupt the process as it's part of my init script, I start a server, the init script runs a python script that uses dryscrape to scrape some data, and then it shuts down). I'll update if I have more info again. |
@Wysie Does this happen on any website or just a particular one? If the latter, can you give an example URL? If the former, that's weird. |
@niklasb It seems to happen on a particular site, but it's inconsistent. Sometimes it works, sometimes it doesn't (basically the page will have some new info at a particular time, and I wake my server up and loop it every 2 minutes and once the new data is in it scrapes it). I did some logging it seems that the issue is with at_xpath and not with visit, will let you know if I have more info. Thanks for your reply. |
Can you give an example of a site where this happens sometimes? On Mon, Nov 24, 2014 at 3:19 PM, Soh Yuan Chin [email protected]
|
@niklasb @Wysie this exact thing happens to me at http://stats.nba.com/league/team/#!/advanced/?DateTo=11%2F3%2F2014 I can usually get like 10 successful scrapes as I work my way through all the dates and then it starts getting hung up, i've been trying to do a sess.reset() and then putting a sleep on before looping back and trying again, but it doesn't seem to help... |
I meet the same problem. I think maybe can add a parameter called wait_timeout. Like Ghost.py will have session.wait_timeout = 20. If exceed the max wait_timeout, it will throw an error |
Hello , i solveld this issue by adding " return None" at the end of the funciton using session and by putting session variable definition in the same function: import dryscrape: def dostuff(): dostuff() Return None ( same as return) will put back the function as it wasn't run before and erase the session object. Tell me if it helped. |
webkit_server.InvalidResponseError: {"class":"InvalidResponseError","message":"Unable to load URL: https://www.instagram.com/jessiej/ because of error loading https://www.instagram.com/jessiej/: Unknown error"} when i visit website with https thsi exception will raise |
I have a program which cycles through a list of several thousand urls of different domains, calling sess.visit() for each without creating a new session object. Usually after visiting several hundred of these urls, there will be a visit() that does not return.
Waiting several hours has no effect - the operation has hung on visit().
When the process is interrupted it displays this trace:
File "/home/user1/projects/MyBot/MyScraper.py", line 50, in Scrape
sess.visit(site_url)
File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 35, in visit
return self.driver.visit(self.complete_url(url))
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 211, in visit
self.conn.issue_command("Visit", url)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command
return self._read_response()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 433, in _read_response
result = self._readline()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 467, in _readline
c = self._sock.recv(1)
If then the url that caused the problem is attempted to be visited alone, visit() returns successfully. So the problem does not seem to be related to the url being visited, rather some internal webkit state.
The number of iterations before hanging seems random - sometimes it occurs after less than 100 visits, sometimes after several hundred.
Here's a script that visits the same site 1000 times that will probably demonstrate the problem at some point:
from dryscrape import Session
from dryscrape.driver.webkit import Driver
from webkit_server import InvalidResponseError
link = 'http://insert-some-site-here-that-doesnt-mind-being-hammered.com'
sess = Session(driver = Driver())
sess.set_error_tolerant(True)
for i in range(1,1000):
try:
sess.visit(link)
sess.wait()
print 'Success iteration', i
except InvalidResponseError as e:
print 'InvalidResponseError:', e
The text was updated successfully, but these errors were encountered: