Recursively Find Hyperlinks In A Website

I was trying to write a script to crawl a website and fetch all the hyper links pointing to all the a particular file type e.g. .pdf or .mp3. Somehow the following command did not work for me.
wget -r -A .pdf <URL>

It did not go recursively and download all PDF files. I may have to ask in  stackoverflow.

Anyway I wrote my script in python and it worked well. At least for the site I was trying crawl. The following scripts give all the absolute URLs pointing to the desired type of files in the whole website. You may have to add few more strings in excludeList configuration variable to suite your target site else you have end up infinite loop.

[code language="python"]
import re
import urllib2
import urllib

## Configurations
# The starting point
baseURL = <home page url>
maxLinks = 1000
excludeList = ["None","/","./","#top"]
fileType = ".pdf"
outFile = "links.txt"

#Gloab list of links already visited , don't want to get into loop
vlinks = []
#This is where output is stored the list of files
files = []

# A recursive function which takes a url and adds the outpit links in the global
# output list.

def findFiles( baseURL ):
#URL encoding
baseURL = urllib.quote(baseURL, safe="/:=&?#+!$,;'@()*[]")
print "Scanning URL "+baseURL

#Check maximum number of links you want to store
print "Number of link stored - " + str(len(files))
if(len(files) > maxLinks):

# the current page
website = ""
website = urllib2.urlopen(baseURL)
except urllib2.HTTPError, e:
print baseURL + " NOT FOUND"
# HTML content of the current page
html =
# fetch the anchor tags using regular expression from the html
# Beautifull Soup does it wonderfully in one go
links = re.findall('(?<=href=["']).*?(?=["'])', html)
for link in links:
#print link
url = str(link)
# Found the file type, then store and move to the next link
print "file link stored" + url
f = open(outFile, 'a')
# Exlude external links and self links , else it will keep looping
if not (url.startswith("http") or ( url in excludeList ) ):
#Build the absolute URL and show it !
print "abs url = " + baseURL.partition('?')[0].rpartition('/')[0]+"/"+url
absURL = baseURL.partition('?')[0].rpartition('/')[0]+"/"+ url
#Do not revisit the URL
if not (absURL in vlinks):

#Finally call the function
print files


  1. Read was interesting, stay in touch…

    [...]please visit the sites we follow, including this one, as it represents our picks from the web[...]…

  2. You should check this out

    [...] Wonderful story, reckoned we could combine a few unrelated data, nevertheless really worth taking a look, whoa did one learn about Mid East has got more problerms as well [...]…

  3. Hey there would you mind letting me know which webhost you're working with?
    I've loaded your blog in 3 completely different internet browsers
    and I must say this blog loads a lot quicker then most.
    Can you suggest a good web hosting provider at a honest price?
    Kudos, I appreciate it!

  4. That you are wright but , when you spend more $$, you may conversion gadgets to apply obtainable products using ipad although very costly person . -= Sunil Jain's very last website... [How to] Customise Retweet Attribute with Multiauthor WordPress personal blogs =-.

  5. Nice content, fine to find out folks taking some societal duty.


Post a Comment

Popular posts from this blog

Load Testing Using Gatling

Sharing Photosphere

Bengali English word definition in Google Dictionary