Pages

Saturday, August 17, 2013

Recursively Find Hyperlinks In A Website

I was trying to write a script to crawl a website and fetch all the hyper links pointing to all the a particular file type e.g. .pdf or .mp3. Somehow the following command did not work for me.
wget -r -A .pdf <URL>

It did not go recursively and download all PDF files. I may have to ask in  stackoverflow.

Anyway I wrote my script in python and it worked well. At least for the site I was trying crawl. The following scripts give all the absolute URLs pointing to the desired type of files in the whole website. You may have to add few more strings in excludeList configuration variable to suite your target site else you have end up infinite loop.

[code language="python"]
import re
import urllib2
import urllib

## Configurations
# The starting point
baseURL = <home page url>
maxLinks = 1000
excludeList = ["None","/","./","#top"]
fileType = ".pdf"
outFile = "links.txt"

#Gloab list of links already visited , don't want to get into loop
vlinks = []
#This is where output is stored the list of files
files = []

# A recursive function which takes a url and adds the outpit links in the global
# output list.

def findFiles( baseURL ):
#URL encoding
baseURL = urllib.quote(baseURL, safe="/:=&?#+!$,;'@()*[]")
print "Scanning URL "+baseURL

#Check maximum number of links you want to store
print "Number of link stored - " + str(len(files))
if(len(files) > maxLinks):
return

# the current page
website = ""
try:
website = urllib2.urlopen(baseURL)
except urllib2.HTTPError, e:
print baseURL + " NOT FOUND"
return
# HTML content of the current page
html = website.read()
# fetch the anchor tags using regular expression from the html
# Beautifull Soup does it wonderfully in one go
links = re.findall('(?<=href=["']).*?(?=["'])', html)
#
for link in links:
#print link
url = str(link)
# Found the file type, then store and move to the next link
if(url.endswith(fileType)):
print "file link stored" + url
files.append(url)
f = open(outFile, 'a')
f.write(url+"n")
f.close
continue
# Exlude external links and self links , else it will keep looping
if not (url.startswith("http") or ( url in excludeList ) ):
#Build the absolute URL and show it !
print "abs url = " + baseURL.partition('?')[0].rpartition('/')[0]+"/"+url
absURL = baseURL.partition('?')[0].rpartition('/')[0]+"/"+ url
#Do not revisit the URL
if not (absURL in vlinks):
vlinks.append(absURL)
findFiles(absURL)
return

#Finally call the function
findFiles(baseURL)
print files
[/code]