Post: [Python] Email Spider/Harvester .::Source::.
05-10-2011, 05:00 PM #1
(adsbygoogle = window.adsbygoogle || []).push({}); I made this because I was bored, it crawls a given url for eMails (inside href="mailto:...") and then for every URL it starts a new thread, and searchs that email.

Example Algo:

Url
-- Find EMails
-- For Every URL and Found URLs
-- -- Url
-- -- Find Emails
-- -- For Every URL in Found URLs of that URL
-- -- -- Find Emails
....

Its a HUGE loop searching for eMails.
Its called a spider because its like a spider web, getting bigger and bigger.

Credits to You must login or register to view this content. for the eMail validation regex recipe.

    
#!/usr/bin/env python
# Email Spider/Harvester
# Made by : BlueMelon
# Project-Melon.com
# Credit if used
import urllib,threading,re,time

r = re.compile('(?<=href\=\"mailtoSmile.*?@.*?.[\w]{0,3}(?=\")'Winky Winky # Mails
r1 = re.compile('(?<=href\=\").*?(?=\")'Winky Winky # Links

count = int(0)

class Crawl(threading.Thread):
def __init__(self,url):
self.url = url
threading.Thread.__init__ ( self )

def run(self):
try:
global count
source = urllib.urlopen(self.url).read() # Get page source
mails = r.findall(source) # Get all eMails
mails = list(set(mails)) # Remove dupes if found
log = open('log.txt','a'Winky Winky
for i in mails: # For every eMail is found mails, append it to log
if re.match("^[_.0-9a-z-]+@([0-9a-z][0-9a-z-]+.)+[a-z]{2,4}$", i) != None: # Check for a valid Email
if (i+'\n'Winky Winky not in (open('log.txt','r'Winky Winky.readlines()): # If it does not exist in file
print 'Saved: ',i
log.write(i+'\n'Winky Winky #Append it
count += 1
log.close()
urls = r1.findall(source) # Find all urls on that page
for url in urls:
Crawl(url).start() # Start a crawl for every url found

except: #Error
pass

Crawl("https://www.ianr.unl.edu/internet/mailto.html").start() # Starting URL

while True:
time.sleep(1)
print 'Threads: ',threading.activeCount(), 'Saved: ', count



Pic:
You must login or register to view this content.

This could be used to get email lists, to mass mail, all kinds of solutions.

[size=large]IF USED PLEASE GIVE CREDITS[/size]

Enjoy.

Copyright © 2024, NextGenUpdate.
All Rights Reserved.

Gray NextGenUpdate Logo