[Python] Email Spider/Harvester .::Source::.

Post: [Python] Email Spider/Harvester .::Source::.

05-10-2011, 05:00 PM #1

CodingNation

Banned

619

Posts

Reputation

Credits

Banned

Jun 2009

NextGenUpdate

(adsbygoogle = window.adsbygoogle || []).push({}); I made this because I was bored, it crawls a given url for eMails (inside href="mailto:...") and then for every URL it starts a new thread, and searchs that email.

Example Algo:

Url
-- Find EMails
-- For Every URL and Found URLs
-- -- Url
-- -- Find Emails
-- -- For Every URL in Found URLs of that URL
-- -- -- Find Emails
....

Its a HUGE loop searching for eMails.
Its called a spider because its like a spider web, getting bigger and bigger.

Credits to You must login or register to view this content. for the eMail validation regex recipe.

    

#!/usr/bin/env python

# Email Spider/Harvester

# Made by : BlueMelon

# Project-Melon.com

# Credit if used

import urllib,threading,re,time



r = re.compile('(?<=href\=\"mailto.*?@.*?.[\w]{0,3}(?=\")' # Mails

r1 = re.compile('(?<=href\=\").*?(?=\")' # Links



count = int(0)



class Crawl(threading.Thread):

	def __init__(self,url):

		self.url = url

		threading.Thread.__init__ ( self )



	def run(self):

		try:

			global count

			source = urllib.urlopen(self.url).read() # Get page source

			mails = r.findall(source) # Get all eMails 

			mails = list(set(mails)) # Remove dupes if found

			log = open('log.txt','a'

			for i in mails: # For every eMail is found mails, append it to log 

				if re.match("^[_.0-9a-z-]+@([0-9a-z][0-9a-z-]+.)+[a-z]{2,4}$", i) != None: # Check for a valid Email 

		if (i+'\n' not in (open('log.txt','r'.readlines()): # If it does not exist in file

		print 'Saved: ',i

		log.write(i+'\n' #Append it

		count += 1

			log.close()

			urls = r1.findall(source) # Find all urls on that page

			for url in urls:

		Crawl(url).start() # Start a crawl for every url found



		except: #Error

			pass 



Crawl("https://www.ianr.unl.edu/internet/mailto.html").start() # Starting URL



while True:

	time.sleep(1)

	print 'Threads: ',threading.activeCount(), 'Saved: ', count

Pic:
You must login or register to view this content.

This could be used to get email lists, to mass mail, all kinds of solutions.

[size=large]IF USED PLEASE GIVE CREDITS[/size]

Enjoy.