Pages

Saturday, August 17, 2013

Recursively Find Hyperlinks In A Website

I was trying to write a script to crawl a website and fetch all the hyper links pointing to all the a particular file type e.g. .pdf or .mp3. Somehow the following command did not work for me.
wget -r -A .pdf <URL>

It did not go recursively and download all PDF files. I may have to ask in  stackoverflow.

Anyway I wrote my script in python and it worked well. At least for the site I was trying crawl. The following scripts give all the absolute URLs pointing to the desired type of files in the whole website. You may have to add few more strings in excludeList configuration variable to suite your target site else you have end up infinite loop.

[code language="python"]
import re
import urllib2
import urllib

## Configurations
# The starting point
baseURL = <home page url>
maxLinks = 1000
excludeList = ["None","/","./","#top"]
fileType = ".pdf"
outFile = "links.txt"

#Gloab list of links already visited , don't want to get into loop
vlinks = []
#This is where output is stored the list of files
files = []

# A recursive function which takes a url and adds the outpit links in the global
# output list.

def findFiles( baseURL ):
#URL encoding
baseURL = urllib.quote(baseURL, safe="/:=&?#+!$,;'@()*[]")
print "Scanning URL "+baseURL

#Check maximum number of links you want to store
print "Number of link stored - " + str(len(files))
if(len(files) > maxLinks):
return

# the current page
website = ""
try:
website = urllib2.urlopen(baseURL)
except urllib2.HTTPError, e:
print baseURL + " NOT FOUND"
return
# HTML content of the current page
html = website.read()
# fetch the anchor tags using regular expression from the html
# Beautifull Soup does it wonderfully in one go
links = re.findall('(?<=href=["']).*?(?=["'])', html)
#
for link in links:
#print link
url = str(link)
# Found the file type, then store and move to the next link
if(url.endswith(fileType)):
print "file link stored" + url
files.append(url)
f = open(outFile, 'a')
f.write(url+"n")
f.close
continue
# Exlude external links and self links , else it will keep looping
if not (url.startswith("http") or ( url in excludeList ) ):
#Build the absolute URL and show it !
print "abs url = " + baseURL.partition('?')[0].rpartition('/')[0]+"/"+url
absURL = baseURL.partition('?')[0].rpartition('/')[0]+"/"+ url
#Do not revisit the URL
if not (absURL in vlinks):
vlinks.append(absURL)
findFiles(absURL)
return

#Finally call the function
findFiles(baseURL)
print files
[/code]

Sunday, August 11, 2013

Getting started with XBMC

XBMC is a free and open source software media player for various OS platforms especially mobile. This is very useful to convert your TV dongles e.g. Android PC or Apple TV or Raspberry PI to a media center in your TV. This can not only organize and play your local media but also can stream movies and TV series. I downloaded the Android APK and installed my RocketChip MK806 Android TV.



To begin with I scanned all my mp3 and videos in my SD card using XBMC and all of them were ready to play. I found couple of plugins which listed almost all recent movies and TV series very well organized by season and episodes. As of now I have installed

  1. Mash Up ( Installation Steps )

  2. 1 Channel ( Installation Steps )




Making it Full screen : However you can notice that the android navigation bar at the bottom always appears (Even during movie play). This is sometimes distracting. So I found an app Fullscreen which can help you get rid of this navigation bar.

Start on Boot : Also If XBMC is the only app you are going to use every time you boot your android TV then it makes sense to have it launched automatically every time you boot your device. I found this app Android Startup Manager. In fact you could disable all the user and system app which you think is not of your use in the android. For me the only app I want android to run is XMBC because all other app I use in my phone.

Remote Control : Having a remote control app in your mobile for XMBC is very important, because some of the feature does not work from a wireless mouse. The official remote app is good but I use the app Yaste because it has a feature of seek bar of current video being played. However I am unable to make it auto discover the IP of the XBMC device yet.

Tuesday, August 6, 2013

Moving the blog again

A year ago I moved from free shared Linux hosting to paid one to GoDaddy. Everything was good, there was no downtime like free hosting solutions. WHen I was in free hosting sometimes my site got blocked by antivirus software because somebody else would have hosted such content in the same server.  However I did not earn any revenue from this blog so paying for hosting was not really my favorite idea.

I read several articles in the internet why GoDaddy is not a very good choice for hosting blogs. I had also used GoDaddy for hosting the website of local chapter of IEEE section (ieeehyd.org) but faced a lot of problem during renewal. First of all only the 1st year hosting price was attractive but the renewal charge was almost four times than that I initially paid for starting the hosting. I had my credit card in GoDaddy payment methods and now they wouldn't let me remove it until I gave details of another card. This inspired me to close the account itself. Also lately my site got blocked by websense several times , probably because of their blacklisted servers.

I had considered to move to wordpress.com (the PaaS solution ) but they were charging a lot for assigning the domain name and they did not have a domain transfer facility so that I could delete all my account and GoDaddy and only pay wordpress for the domain name. I was also not happy with dealing with GoDaddy customer support who kept on replying me  with some templates from their manual book instead of looking into my hosting problem.

I came across an old post of one of my friend which shows how an wordpress blog could be installed in heroku even in the free tier. Ofcourse there were some limitations but I wanted to get rid of current hosting. Apart from the steps mentioned in the blog post I had to do some additional steps to overcome the limitations of heroku and have my blog up and running.

  1. Heroku now have ClearDB for MySQL so I did not have to go for heroku's posgreSQL service as mentioned in the blog post.

  2. Because mine is not a new blog and I was moving all the contents from Linux host. I used the wordpress export file and wordpress-importer plugin to migrate the database. For images stored in wp-content  I downloaded via ftp and pushed using git from my local directory. However I was able to reduce the size of the content by using some orphan image checker plugin and deleting unattached files.

  3. I had to create  the .htaccess at the root directory of my blog at heroku to make sure the permalink's of post and pages are redirected to the appropriate query strings. This is normally created by the wordpress installer itself but heroku file system can not be altered permanently unless attached to storage service which is paid.

  4. The DB size was small , so I added a plugin to optimize the wordpress  database once in a while. It also made the website fast. Initially I saw my export file is huge but later realized that there was a post in which I had dragged and dropped a lot of images from my desktop. All those were stored as data-uri scheme (plenty of junk characters in the post text itself) instead of separate image files.

  5. The MySQL user created by default by clearDB did not have insert and update grants in the database so the database upgrade did not go through after I manually upgraded the wordpress to 3.6.

  6. GoDaddy DNS manager did not have the option to forward the  the domain name (with masking )  to the heroku's url where my blog was hosted. It was accepting only the IP address which heroku did not give me. So I forwarded it to a subdomain (www.neilghosh.com) and created a CNAME record to forward the domain to the subdomain and subsequently to the heroku URL.


Following are the challenged I am ready to face during maintenance of the site but I think its worth it, if I could save some money. On the bright side you get used to git commands because of frequent usage :) .

  1. Every plugin/theme needs to added via the local git repository to the wp-contents/plugins directory because anything uploaded via the admin portal will not be persisted in storage. It may seem so temporarily but eventually they will go missing when heroku moves the app as part of load balancing. Paid Amazon S3 storage is recommended but as of now I am pushing everything via git. I will try to see of I can use Dropbox for this.

  2. Similarly pictures in the posts should be uploaded through git, otherwise it may be hosted in some 3rd party site and embedded in the blog. This is a better option because it saves the blog's bandwidth and if blog is moved , the HTML code still referees to the same image.


Thanks to my colleague Sridhar  who initially game the idea of heroku. I think I will be happy with it for sometime till my hunger to pay around it stops. If I could generate some money out of it , Google Compute Engine and AWS/VPS are definitely some good areas to play around. I also like the idea of static blogs so that I don't waste computing (DB operations and PHP interpretation ) every time some user requests a page from my blog.