Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

sess.visit() sometimes hangs #7

Open
pommygranite opened this issue Apr 4, 2012 · 10 comments
Open

sess.visit() sometimes hangs #7

pommygranite opened this issue Apr 4, 2012 · 10 comments

Comments

@pommygranite
Copy link

I have a program which cycles through a list of several thousand urls of different domains, calling sess.visit() for each without creating a new session object. Usually after visiting several hundred of these urls, there will be a visit() that does not return.
Waiting several hours has no effect - the operation has hung on visit().
When the process is interrupted it displays this trace:

File "/home/user1/projects/MyBot/MyScraper.py", line 50, in Scrape
sess.visit(site_url)
File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 35, in visit
return self.driver.visit(self.complete_url(url))
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 211, in visit
self.conn.issue_command("Visit", url)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command
return self._read_response()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 433, in _read_response
result = self._readline()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 467, in _readline
c = self._sock.recv(1)

If then the url that caused the problem is attempted to be visited alone, visit() returns successfully. So the problem does not seem to be related to the url being visited, rather some internal webkit state.
The number of iterations before hanging seems random - sometimes it occurs after less than 100 visits, sometimes after several hundred.

Here's a script that visits the same site 1000 times that will probably demonstrate the problem at some point:

from dryscrape import Session
from dryscrape.driver.webkit import Driver
from webkit_server import InvalidResponseError

link = 'http://insert-some-site-here-that-doesnt-mind-being-hammered.com'
sess = Session(driver = Driver())
sess.set_error_tolerant(True)
for i in range(1,1000):
try:
sess.visit(link)
sess.wait()
print 'Success iteration', i
except InvalidResponseError as e:
print 'InvalidResponseError:', e

@niklasb
Copy link
Owner

niklasb commented Apr 27, 2012

I can reproduce this. I think it has to do with AJAX calls not finishing. The same problem occurs when visiting Gmail with the original webkit-server, for example. I have to do more investigation to find the root cause and maybe work around it.

@niklasb
Copy link
Owner

niklasb commented May 10, 2014

This should be fixed in version 0.9

@niklasb niklasb closed this as completed May 10, 2014
@Wysie
Copy link

Wysie commented Nov 21, 2014

Hi niklasb, sorry to comment on a closed issue, but this seems to be still happening for me in 0.9.1. However, I don't have any error/log messages to show yet, it just seems to be stuck on visit() for me (I can't/don't know how to interrupt the process as it's part of my init script, I start a server, the init script runs a python script that uses dryscrape to scrape some data, and then it shuts down). I'll update if I have more info again.

@niklasb
Copy link
Owner

niklasb commented Nov 21, 2014

@Wysie Does this happen on any website or just a particular one? If the latter, can you give an example URL? If the former, that's weird.

@niklasb niklasb reopened this Nov 21, 2014
@Wysie
Copy link

Wysie commented Nov 24, 2014

@niklasb It seems to happen on a particular site, but it's inconsistent. Sometimes it works, sometimes it doesn't (basically the page will have some new info at a particular time, and I wake my server up and loop it every 2 minutes and once the new data is in it scrapes it). I did some logging it seems that the issue is with at_xpath and not with visit, will let you know if I have more info. Thanks for your reply.

@niklasb
Copy link
Owner

niklasb commented Nov 24, 2014

Can you give an example of a site where this happens sometimes?

On Mon, Nov 24, 2014 at 3:19 PM, Soh Yuan Chin [email protected]
wrote:

@niklasb https://github.com/niklasb It seems to happen on a particular
site, but it's inconsistent. Sometimes it works, sometimes it doesn't
(basically the page will have some new info at a particular time, and I
wake my server up and loop it every 2 minutes and once the new data is in
it scrapes it). I did some logging it seems that the issue is with at_xpath
and not with visit, will let you know if I have more info. Thanks for your
reply.


Reply to this email directly or view it on GitHub
#7 (comment).

@sjsnider
Copy link

@niklasb @Wysie this exact thing happens to me at http://stats.nba.com/league/team/#!/advanced/?DateTo=11%2F3%2F2014

I can usually get like 10 successful scrapes as I work my way through all the dates and then it starts getting hung up, i've been trying to do a sess.reset() and then putting a sleep on before looping back and trying again, but it doesn't seem to help...

@kaiwang0112006
Copy link

I meet the same problem. I think maybe can add a parameter called wait_timeout. Like Ghost.py will have session.wait_timeout = 20. If exceed the max wait_timeout, it will throw an error

@YasserAntonio
Copy link

YasserAntonio commented Aug 3, 2016

Hello , i solveld this issue by adding " return None" at the end of the funciton using session and by putting session variable definition in the same function:

import dryscrape:
import time

def dostuff():
Session = dryscrape.Session()
Session.visit('url')
response = Session.body()
print (response)
return None

dostuff()
time.sleep(120)
dostuff() (new data will be printed)

Return None ( same as return) will put back the function as it wasn't run before and erase the session object. Tell me if it helped.

@for-nia
Copy link

for-nia commented Mar 9, 2017

webkit_server.InvalidResponseError: {"class":"InvalidResponseError","message":"Unable to load URL: https://www.instagram.com/jessiej/ because of error loading https://www.instagram.com/jessiej/: Unknown error"}

when i visit website with https thsi exception will raise

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants