sess.visit() sometimes hangs #7

pommygranite · 2012-04-04T15:41:04Z

I have a program which cycles through a list of several thousand urls of different domains, calling sess.visit() for each without creating a new session object. Usually after visiting several hundred of these urls, there will be a visit() that does not return.
Waiting several hours has no effect - the operation has hung on visit().
When the process is interrupted it displays this trace:

File "/home/user1/projects/MyBot/MyScraper.py", line 50, in Scrape
sess.visit(site_url)
File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 35, in visit
return self.driver.visit(self.complete_url(url))
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 211, in visit
self.conn.issue_command("Visit", url)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command
return self._read_response()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 433, in _read_response
result = self._readline()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 467, in _readline
c = self._sock.recv(1)

If then the url that caused the problem is attempted to be visited alone, visit() returns successfully. So the problem does not seem to be related to the url being visited, rather some internal webkit state.
The number of iterations before hanging seems random - sometimes it occurs after less than 100 visits, sometimes after several hundred.

Here's a script that visits the same site 1000 times that will probably demonstrate the problem at some point:

from dryscrape import Session
from dryscrape.driver.webkit import Driver
from webkit_server import InvalidResponseError

link = 'http://insert-some-site-here-that-doesnt-mind-being-hammered.com'
sess = Session(driver = Driver())
sess.set_error_tolerant(True)
for i in range(1,1000):
try:
sess.visit(link)
sess.wait()
print 'Success iteration', i
except InvalidResponseError as e:
print 'InvalidResponseError:', e

niklasb · 2012-04-27T12:46:08Z

I can reproduce this. I think it has to do with AJAX calls not finishing. The same problem occurs when visiting Gmail with the original webkit-server, for example. I have to do more investigation to find the root cause and maybe work around it.

niklasb · 2014-05-10T01:29:42Z

This should be fixed in version 0.9

Wysie · 2014-11-21T02:15:02Z

Hi niklasb, sorry to comment on a closed issue, but this seems to be still happening for me in 0.9.1. However, I don't have any error/log messages to show yet, it just seems to be stuck on visit() for me (I can't/don't know how to interrupt the process as it's part of my init script, I start a server, the init script runs a python script that uses dryscrape to scrape some data, and then it shuts down). I'll update if I have more info again.

niklasb · 2014-11-21T12:19:20Z

@Wysie Does this happen on any website or just a particular one? If the latter, can you give an example URL? If the former, that's weird.

Wysie · 2014-11-24T14:19:37Z

@niklasb It seems to happen on a particular site, but it's inconsistent. Sometimes it works, sometimes it doesn't (basically the page will have some new info at a particular time, and I wake my server up and loop it every 2 minutes and once the new data is in it scrapes it). I did some logging it seems that the issue is with at_xpath and not with visit, will let you know if I have more info. Thanks for your reply.

niklasb · 2014-11-24T16:50:56Z

Can you give an example of a site where this happens sometimes?

On Mon, Nov 24, 2014 at 3:19 PM, Soh Yuan Chin [email protected]
wrote:

@niklasb https://github.com/niklasb It seems to happen on a particular
site, but it's inconsistent. Sometimes it works, sometimes it doesn't
(basically the page will have some new info at a particular time, and I
wake my server up and loop it every 2 minutes and once the new data is in
it scrapes it). I did some logging it seems that the issue is with at_xpath
and not with visit, will let you know if I have more info. Thanks for your
reply.

—
Reply to this email directly or view it on GitHub
#7 (comment).

sjsnider · 2015-03-10T22:51:53Z

@niklasb @Wysie this exact thing happens to me at http://stats.nba.com/league/team/#!/advanced/?DateTo=11%2F3%2F2014

I can usually get like 10 successful scrapes as I work my way through all the dates and then it starts getting hung up, i've been trying to do a sess.reset() and then putting a sleep on before looping back and trying again, but it doesn't seem to help...

kaiwang0112006 · 2015-09-14T05:50:28Z

I meet the same problem. I think maybe can add a parameter called wait_timeout. Like Ghost.py will have session.wait_timeout = 20. If exceed the max wait_timeout, it will throw an error

YasserAntonio · 2016-08-03T21:11:27Z

Hello , i solveld this issue by adding " return None" at the end of the funciton using session and by putting session variable definition in the same function:

import dryscrape:
import time

def dostuff():
Session = dryscrape.Session()
Session.visit('url')
response = Session.body()
print (response)
return None

dostuff()
time.sleep(120)
dostuff() (new data will be printed)

Return None ( same as return) will put back the function as it wasn't run before and erase the session object. Tell me if it helped.

for-nia · 2017-03-09T06:15:18Z

webkit_server.InvalidResponseError: {"class":"InvalidResponseError","message":"Unable to load URL: https://www.instagram.com/jessiej/ because of error loading https://www.instagram.com/jessiej/: Unknown error"}

when i visit website with https thsi exception will raise

niklasb closed this as completed May 10, 2014

niklasb reopened this Nov 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sess.visit() sometimes hangs #7

sess.visit() sometimes hangs #7

pommygranite commented Apr 4, 2012

niklasb commented Apr 27, 2012

niklasb commented May 10, 2014

Wysie commented Nov 21, 2014

niklasb commented Nov 21, 2014

Wysie commented Nov 24, 2014

niklasb commented Nov 24, 2014

sjsnider commented Mar 10, 2015

kaiwang0112006 commented Sep 14, 2015

YasserAntonio commented Aug 3, 2016 •

edited

Loading

for-nia commented Mar 9, 2017 •

edited

Loading

sess.visit() sometimes hangs #7

sess.visit() sometimes hangs #7

Comments

pommygranite commented Apr 4, 2012

niklasb commented Apr 27, 2012

niklasb commented May 10, 2014

Wysie commented Nov 21, 2014

niklasb commented Nov 21, 2014

Wysie commented Nov 24, 2014

niklasb commented Nov 24, 2014

sjsnider commented Mar 10, 2015

kaiwang0112006 commented Sep 14, 2015

YasserAntonio commented Aug 3, 2016 • edited Loading

for-nia commented Mar 9, 2017 • edited Loading

YasserAntonio commented Aug 3, 2016 •

edited

Loading

for-nia commented Mar 9, 2017 •

edited

Loading