simple link crawler [PYTHON]

Post: simple link crawler [PYTHON]

01-10-2014, 11:48 PM #1

Complete Speed

Do a barrel roll!

148

Posts

994

Reputation

Credits

Member

Apr 2011

NextGenUpdate

(adsbygoogle = window.adsbygoogle || []).push({}); This is a simple crawler for web pages, not very advanced. but it's a good starting/learning point if you want to build a search engine or want to index a page of your own. or whatever else you can think of.

this will go through some web pages and return all the links. pretty much anything inside an <a href=""

    

import urllib

import urllib2



def get_next_target(page):

    start_link = page.find('<a href='

    if start_link == -1:

        return None, 0

    start_quote = page.find('"', start_link)

    end_quote = page.find('"', start_quote + 1)

    url = page[start_quote + 1:end_quote]

    return url, end_quote





def print_all_links(page):

    while True:

        url, endpos = get_next_target(page)

        if url:

            print url

            page = page[endpos:]

        else:

            break

ans = 'N'

while ans != 'n'.lower():

    user_page=raw_input("Enter a url to attempt to crawl: ")

    try:

        user_content = urllib2.urlopen("https://" + user_page)

        user_page = user_content.read()

        print_all_links(user_page)

    except urllib2.HTTPError, e:

        print "Error can't crawl"

    ans=raw_input('Search another site?(y/n)'.lower()

some sites it can crawl others it can't this would be great for a django project and would be good to stylize it onto a web page.