August 18, 2019

Downloading And Storing Financial Data With Google Cloud Scheduler, Cloud Storage and BigQuery

Financial data from the stock market usually are available through APIs like Bloomberg API. Almost 12 years ago I worked for a company where they needed download and store historical data with the C SDK and also run a periodic cronjob to execute the same every day for that day's data (e.g. Day high, close, low, open prices and volumes). This was mostly run aftermarket when it was supposed to be available for that day. Ultimately data was stored in PostgreSQL and cater to many applications like Option Simulation (predicting the premium in upcoming days before option expiry using the volatility calculated by Black Scholes model ) and plotting bar charts (.NET or Adobe Flex desktop charts). The Simulation software was all written in PHP (in an object-oriented way) and some business users also used connectors from Microsoft Exel to fetch data from PostgreSQL directly.

While all these financial features were ahead of time, had all these done been done now, the obvious choice was probably to do everything in close and make it accessible to users (maybe earn from subscription). I remember, because we did not have a static IP, some internal users accessed the tools and charts via GoToMyPC :).

Google Cloud Scheduler lets you trigger a specific endpoint at a specific interval. Following configuration runs an HTTP calls the GET endpoint everyday at midnight.

The above endpoint comes from a Google Cloud Function written in NodeJS which pulls a zip file from the National Stock Exchange and saves in a bucket in Google Cloud Storage. As of now, Cloud Function supports NodeJS as GA and NodeJS also have a very easy to use Client SDK to interact with Cloud Storage.

Following the code, a snippet to write raw data to Google Cloud Storage

async function writeToBucket(date, data) {
  var fileName = date.toISOString().split('T')[0] + '.csv';
  console.log("Saving in GCS as " + fileName);
  var file = gcsBucket.file(fileName);

  file.createWriteStream({
    metadata: {
      contentType: 'text/csv',
      metadata: {
        custom: 'metadata'
      }
    }
  })
    .on('error', function (err) { })
    .on('finish', function () { console.log("Uploaded"); })
    .end(data);
}

Full Code.

So now, every day as the new data is available in the market, new files would be created in the bucket.

Data in Cloud Storage is not queriable very easily. Hence this needs to go to a database system. Google BigQuery is a natural choice if you want to run some ad-hock query and analysis, especially when the volume of data grows a lot i.e. years of stock data.

Google Cloud Storage lets you trigger a cloud function automatically when a file is created (or changed) in a bucket. Using this feature we could invoke another cloud function which can keep appending data to Bigquery as and when data arrives in the Google Cloud Storage in the form of a new file. We would basically write another snipper of NodeJS code which would load a the newly arrived CSV file to the existing bigquery table periodically. In this way we do not have to schedule another cron job and co-ordinate the schedule, the two events (downloading data and loading to Bigquery) would basically chain one after another.

    // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
    const metadata = {
      sourceFormat: 'CSV',
      skipLeadingRows: 1,
      autodetect: true,
      location: 'US',
    };

    // Load data from a Google Cloud Storage file into the table
    const [job] = await bigquery
      .dataset(datasetId)
      .table(tableId)
      .load(storage.bucket(bucketName).file(filename), metadata);

    // load() waits for the job to finish
    console.log(`Job ${job.id} completed.`);

Full Code.

Note that DML is expensive in BigQuery but data can be loaded multiple times to append into the table. This is a synchronous job so we could very well choose not to wait for the job to be completed. However, for data integrity, we wanted to wait in case there is any failure. Also since the size of the daily data is predictable we can pretty much say that it will complete before the function times out.

Now that we have the data in Bigquery we could run any ad hoc queries to analyse them. Google Data Studio is a great tool for the same. You can plot charts or filter by various condition to analyse time-series data like historical prices.

If you want this data to be used by any live application like portfolio management system or charting application, may be these data can be stored in an RDBMS like CloudSQL.

The above code downloads data from NSE which is only for demo/tutorial purpose. Commercial use of that might be prohibited.

The code is originally hosted at the GitHub Repo nse-historical-data. However the cloud functions deployed are using the code from a https://cloud.google.com/source-repositories/Cloud Source Repository which in turns mirrors the above GitHub repository.

Posts by Year

Downloading And Storing Financial Data With Google Cloud Scheduler, Cloud Storage and BigQuery