Python

import urllib

def get_page(url): # This is a simulated get_page procedure so that you can test your # code on two pages "http://xkcd.com/353" and "http://xkcd.com/554". # A procedure which actually grabs a page from the web will be # introduced in unit 4. page = urllib.urlopen(url).read() #print page return page

''' try: if url == "http://xkcd.com/353": return ' <title>xkcd: Python</title>

\t

A webcomic of romance,
sarcasm, math, and language.

XKCD updates every Monday, Wednesday, and Friday.

Python

Permanent link to this comic: http://xkcd.com/353/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/python.png

[[ Guy 1 is talking to Guy 2, who is floating in the sky ]]Guy 1: You39;re flying! How?Guy 2: Python!Guy 2: I learned it last night! Everything is so simple!Guy 2: Hello world is just 39;print "Hello, World!" 39;Guy 1: I dunno... Dynamic typing? Whitespace?Guy 2: Come join us! Programming is fun again! It39;s a whole new world up here!Guy 1: But how are you flying?Guy 2: I just typed 39;import antigravity39;Guy 1: That39;s it?Guy 2: ...I also sampled everything in the medicine cabinet for comparison.Guy 2: But i think this is the python.{{ I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I39;m leaving you. }}

Search comic titles and transcripts:
<script type="text/javascript" src="//www.google.com/jsapi"></script><script type="text/javascript"> google.load("search", "1"); google.setOnLoadCallback(function() { google.search.CustomSearchControl.attachAutoCompletion( "012652707207066138651:zudjtuwe28q", document.getElementById("q"), "cse-search-box"); });</script>

Comics I enjoy:
Dinosaur Comics, A Softer World, Perry Bible Fellowship, Copper, Questionable Content, Achewood, Wondermark, Indexed, Buttercup Festival

Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
This means you"re free to copy and share these comics (but not to sell them). More details.

' elif url == "http://xkcd.com/554": return ' <title>xkcd: Not Enough Work</title>

A webcomic of romance,
sarcasm, math, and language.

XKCD updates every Monday, Wednesday, and Friday.
Blag: Remember geohashing? Something pretty cool happened Sunday.

Not Enough Work

Permanent link to this comic: http://xkcd.com/554/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/not_enough_work.png

Narration: Signs your coders don39;t have enough work to do: [[A man sitting at his workstation; a female co-worker behind him]] Man: I39;m almost up to my old typing speed in dvorak [[Two men standing by a server rack]] Man 1: Our servers now support gopher. Man 1: Just in case. [[A woman standing near her workstation speaking to a male co-worker]] Woman: Our pages are now HTML, XHTML-STRICT, and haiku-compliant Man: Haiku? Woman: <div class="main"> Woman: <span id="marquee"> Woman: Blog!< span>< div> [[A woman sitting at her workstation]] Woman: Hey! Have you guys seen this webcomic? {{title text: It39;s even harder if you39;re an asshole who pronounces <> brackets.}}

Search comic titles and transcripts:
<script type="text/javascript" src="//www.google.com/jsapi"></script> <script type="text/javascript"> google.load("search", "1"); google.search.CustomSearchControl.attachAutoCompletion( "012652707207066138651:zudjtuwe28q", document.getElementById("q"), "cse-search-box"); }); </script>

Comics I enjoy:
Three Word Phrase, Oglaf (nsfw), SMBC, Dinosaur Comics, A Softer World, Buttersafe, Perry Bible Fellowship, Questionable Content, Buttercup Festival

Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
This means you"re free to copy and share these comics (but not to sell them). More details.

' except: return "" return "" '''

def get_next_target(page): start_link = page.find('<a href=') if start_link == -1: return None, 0 start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote + 1:end_quote] #print "url ->>>> " #print url #print "end_quote -->>>>" #print end_quote return url, end_quote

def union(p,q): for e in q: if e not in p: p.append(e) return p

def get_all_links(page): links = [] while True: url,endpos = get_next_target(page) if url: links.append(url) page = page[endpos:] else: break #print "LINKS ->>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" #print links return links

def crawl_web(seed): print seed tocrawl = [seed] crawled = [] while tocrawl: page = tocrawl.pop() if page not in crawled: union(tocrawl, get_all_links(get_page(page))) crawled.append(page) print "TO CRAWLE ->>>>>>>>>" print tocrawl print "CRAWLED ->>>>>>>>>>>" print crawled return crawled

url = "http://www.iramykytyn.com/blog" page = urllib.urlopen(url).read() #print page links = crawl_web(url) print links

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A webcomic of romance,
sarcasm, math, and language.

Python

Permanent link to this comic: http://xkcd.com/353/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/python.png

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.

A webcomic of romance,
sarcasm, math, and language.

Not Enough Work

Permanent link to this comic: http://xkcd.com/554/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/not_enough_work.png

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.

About

Releases

Packages

iramykytyn/webcrawler

Folders and files

Latest commit

History

Repository files navigation

A webcomic of romance, sarcasm, math, and language.

Python

Permanent link to this comic: http://xkcd.com/353/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/python.png

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves. The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.This is not the algorithm. This is close.

A webcomic of romance, sarcasm, math, and language.

Not Enough Work

Permanent link to this comic: http://xkcd.com/554/

Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/not_enough_work.png

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves. The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.This is not the algorithm. This is close.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

A webcomic of romance,
sarcasm, math, and language.

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.

A webcomic of romance,
sarcasm, math, and language.

We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves.
The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.
This is not the algorithm. This is close.

Packages