Skip to content

iramykytyn/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

import urllib

def get_page(url): # This is a simulated get_page procedure so that you can test your # code on two pages "http://xkcd.com/353" and "http://xkcd.com/554". # A procedure which actually grabs a page from the web will be # introduced in unit 4. page = urllib.urlopen(url).read() #print page return page

''' try: if url == "http://xkcd.com/353": return ' <title>xkcd: Python</title>

Python





Python

Permanent link to this comic: http://xkcd.com/353/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/python.png

[[ Guy 1 is talking to Guy 2, who is floating in the sky ]]Guy 1: You39;re flying! How?Guy 2: Python!Guy 2: I learned it last night! Everything is so simple!Guy 2: Hello world is just 39;print "Hello, World!" 39;Guy 1: I dunno... Dynamic typing? Whitespace?Guy 2: Come join us! Programming is fun again! It39;s a whole new world up here!Guy 1: But how are you flying?Guy 2: I just typed 39;import antigravity39;Guy 1: That39;s it?Guy 2: ...I also sampled everything in the medicine cabinet for comparison.Guy 2: But i think this is the python.{{ I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I39;m leaving you. }}
Selected Comics

Search comic titles and transcripts:
<script type="text/javascript" src="//www.google.com/jsapi"></script><script type="text/javascript"> google.load("search", "1"); google.setOnLoadCallback(function() { google.search.CustomSearchControl.attachAutoCompletion( "012652707207066138651:zudjtuwe28q", document.getElementById("q"), "cse-search-box"); });</script>
<script type="text/javascript" src="//www.google.com/cse/brand?form=cse-search-box&lang=en"></script>RSS Feed - Atom Feed


Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.



This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
This means you"re free to copy and share these comics (but not to sell them). More details.
' elif url == "http://xkcd.com/554": return ' <title>xkcd: Not Enough Work</title>

Not Enough Work





Not Enough Work

Permanent link to this comic: http://xkcd.com/554/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/not_enough_work.png

Narration: Signs your coders don39;t have enough work to do: [[A man sitting at his workstation; a female co-worker behind him]] Man: I39;m almost up to my old typing speed in dvorak [[Two men standing by a server rack]] Man 1: Our servers now support gopher. Man 1: Just in case. [[A woman standing near her workstation speaking to a male co-worker]] Woman: Our pages are now HTML, XHTML-STRICT, and haiku-compliant Man: Haiku? Woman: <div class="main"> Woman: <span id="marquee"> Woman: Blog!< span>< div> [[A woman sitting at her workstation]] Woman: Hey! Have you guys seen this webcomic? {{title text: It39;s even harder if you39;re an asshole who pronounces <> brackets.}}
Selected Comics

Search comic titles and transcripts:
<script type="text/javascript" src="//www.google.com/jsapi"></script> <script type="text/javascript"> google.load("search", "1"); google.search.CustomSearchControl.attachAutoCompletion( "012652707207066138651:zudjtuwe28q", document.getElementById("q"), "cse-search-box"); }); </script>
<script type="text/javascript" src="//www.google.com/cse/brand?form=cse-search-box&lang=en"></script> RSS Feed - Atom Feed


Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.



This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
This means you"re free to copy and share these comics (but not to sell them). More details.
' except: return "" return "" '''

def get_next_target(page): start_link = page.find('<a href=') if start_link == -1: return None, 0 start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote + 1:end_quote] #print "url ->>>> " #print url #print "end_quote -->>>>" #print end_quote return url, end_quote

def union(p,q): for e in q: if e not in p: p.append(e) return p

def get_all_links(page): links = [] while True: url,endpos = get_next_target(page) if url: links.append(url) page = page[endpos:] else: break #print "LINKS ->>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" #print links return links

def crawl_web(seed): print seed tocrawl = [seed] crawled = [] while tocrawl: page = tocrawl.pop() if page not in crawled: union(tocrawl, get_all_links(get_page(page))) crawled.append(page) print "TO CRAWLE ->>>>>>>>>" print tocrawl print "CRAWLED ->>>>>>>>>>>" print crawled return crawled

url = "http://www.iramykytyn.com/blog" page = urllib.urlopen(url).read() #print page links = crawl_web(url) print links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published