Pages

Monday, July 22, 2013

Screen scraping using YQL

I have been using Yahoo pipes for a long time now. I have done some screen scraping mash up using them. While yahoo pipe provides a component to fetch the HTML content from a URL, it is bit difficult to cut a specific part because it totally relies on a string to match. I came across YQL console where I could write SQL like queries and fetch the HTML content of any URL. The best part was that it supports XPath expressions for selecting the exact node of the HTML to extract data. For example I write the following query to get the stock price from the web page

[code language="sql"]
select *
from html
where url ="http://getquote.icicidirect.com/NewSiteTrading/trading/equity/includes/trading_stock_quote.asp?Symbol=BSES"
and xpath='//td[p/text()="LAST TRADE PRICE"]/following-sibling::td[2]/p'
[/code]

See above code running in YQL Console.

Similarly this query can be made little bit complex and parametrized for the stock symbol to form the appropriate url

[code language="sql"]
select *
from html
where url in (
select url
from uritemplate
where template="http://getquote.icicidirect.com/NewSiteTrading/trading/equity/includes
/trading_stock_quote.asp?Symbol={item}" and item=@item)
and xpath='//td[p/text()="LAST TRADE PRICE"]/following-sibling::td[2]/p | //td[p/text()="LAST TRADED TIME"]/following-sibling::td[1]/p'
[/code]

See the above code in YQL console. However this will not directly run from the console. One could just create a query alias and pass the required query string like the following.

http://query.yahooapis.com/v1/public/yql/neilghosh/liveQuote?item=INFTEC

This could have been done using the built in YQL component in the Yahoo Pipes itself but it would be an extra layer if you just need to get the required content from the HTML instead of having to play around any feed (for which Pipes is still the best choice). Of course there some limits/quota while using such YQL queries, which I need to explore in coming days.

For screen scraping I could directly use Google App Engine's URLFetch or curl in PHP servers but this would unnecessarily transfer the whole content consuming quota and leading to time lag.