Drugs prescribed by the NHS

Hi All,

As always, apologies for the length of time between posts – think this is a record. I was working on stuff I’m not allowed to share (work stuff, then Kaggle stuff) and got minorly derailed by Game of Thrones. Finished all the TV series, and one of the books. And Life of Pi. And I’m working my way through the Book Thief. So yeah, a tad derailed.

However, I’ve got a lazy Sunday and I saw an advert on TV that really annoyed me. It was basically along the lines of ‘did your doctor mess up? Why not sue them?’ and I think that it takes us towards where the Americans are. Which, when it comes to healthcare, is not where we want to go. Given that in the UK, if you’re being treated by a doctor it’s almost certainly an NHS doctor it seems pretty sucky that people are being encouraged to sue them. Puts the price up for everybody and makes people less likely to become doctors and whatnot. Anyway, this isn’t a political rant blog – I thought I’d have a look and see what data is available on the NHS to see if I could show the effects of an increase in litigation on the standard on medical care provided/costs (and ideally, contrast with America). In short, I couldn’t. The data that I wanted just wasn’t there. However, there was data on the drugs that the GPs for the NHS prescribes (at least between Jan and June in 2012) by practice with cost data. That seemed pretty interesting and so here we are.

I’ve been playing around with Google’s coLaboratory (check it out here) and would have loved to use this to do this particular bit of analysis. However, after playing round with it for a bit and struggling with external documents, finding my Google docs and various libraries I wanted I’ve decided to leave it for a while until coLaboratory becomes a tad more mature. Lots of promise there and with good Google Analytics API integration we could transform analytics practices at my company. Certainly one to watch.

Anyway, without that I’ll think of the questions I want to answer first and then pick my tool. Firstly, the data…

The data

As ever, data.gov.uk to the rescue – head here and download yourself a nice copy of the data:

http://data.gov.uk/dataset/gp-practice-prescribing-data

Once you’ve got a copy of all the data and the list of practices in England (I used the most recent one) we’re ready to start asking some questions of the data…

The Analysis

First off, let’s pick something simple – which drug costs the NHS most in each of the months of our test set, and overall.

Total Drug Cost

This actually seems to lend itself to the mapreduce paradigm pretty nicely – the mapper seems pretty unnecessary and as I’ve not got a cluster to hand (and this doesn’t warrant me spinning one up with AWS) I’ll just write a quick reducer in Python and use the Unix sort. For what it’s worth, I think I might write something about spinning up a quick cluster on AWS in the near future. It’s a fairly useful skill to have and given the increasing reluctance of my computer to perform the most basic of tasks, I think a fair bit of my future data analysis might have to happen in the cloud. Anyway, this is what my command will look like once I’m done:

awk 'FNR>1{print}' T201202PDP\ IEXT.CSV | sort -t , -k5 | python spending_reducer.py

Nothing too complicated there. I’m ignoring the first line (the awk command), then sorting the whole file based on the 4th column (the drug id) and piping the whole thing into the following reducer:

#!/usr/bin/python

import sys

current_drug = None
current_cost = 0.0

for line in sys.stdin:
    authority, trust, practice, drug_code, drug_name, number_bought, ni_cost, act_cost, period = line.strip().split(',')
    act_cost = float(act_cost.lstrip('0'))
    if drug_name == current_drug:
        current_cost += act_cost
    else:
        if current_drug:
            print current_drug + "\t" + str(current_cost)
        current_drug = drug_name
        current_cost = act_cost

print current_drug + "\t" + str(current_cost)

Nothing too difficult there – we’re just keeping track of the drug we’re on and adding up as we go. If I had no fear for the amount of RAM I had we could’ve accomplished the same thing without the laborious sorting step using associative arrays in awk. But for the next stage – the total across all 6 months, I am very afraid (RAM wise) and so we can run the same query with a bit of wildcarding:

awk 'FNR>1{print}' T20120[1-6]P*.CSV | sort -t , -k5 | python spending_reducer.py > drugs_by_spend.txt

This runs the same calculation over every file matching that wildcard pattern (all the data between 2012/01 and 2012/06. Note that this’ll take a little time – that sort is reasonably expensive over the 4 or so gigabytes of data we’ve got. Now we’ll pull together a few graphs and for this I think we’ll use R…

my_frame <- data.frame(read.csv('drugs_by_spend.txt', header=F, sep="\t", colnames=c("Drug", "Spend")))
my_frame <- my_frame[order(-my_frame$Spend),]
head(my_frame)

The top 5 drugs, by spend, in the first half of 2012 were:

  1. Fluticasone Propionate (Inh)
  2. Atorvastatin
  3. Enteral Nutrition
  4. Pregabalin
  5. Budesonide

Between them these cost: £738,620,789

Wow – that’s a hell of a lot. In 6 months, the actual costs of these drugs alone was more than £700million!

The total cost of drugs prescribed in that time period: £2,785,540,256

So I think we can surmise that lots of money is spent by the NHS – OK, I suppose that’s no surprise. For the non-doctors amongst us (that includes me) that list features two anti-asthmatic treatments i.e. those inhalers that I’m sure a lot of you have (also includes me). A quick Wikipedia shows that Pfizer holds the patent to at least a couple of those drugs (or at least did, Atorvastatin has expired) – it might be interesting to stick the patent holder next to these drugs. Maybe later…

Right, there’s lots that we could do here but I’m going to call it a day for now. In the future I think I’ll try and get more months of data and then start to look at evolving trends. To do that, I’ll use an AWS cluster and so will write something and using that.

Until then.

Advertisements

Kaggle Digit Recognizer Problem with Adaboost

Hi everybody,

Hi Dr Nick. But enough of that – today I’m going to be working through a Kaggle problem. For those of you who don’t know Kaggle, I can’t advise in favour of it strongly enough. It’s a great place to have a go at using real data sets to apply various machine learning techniques. There’s a leaderboard, discussions on methods and some non-too shabby prizes. I’ll come clean at this point – I’m not a natural salesperson.

I believe there’s something of a taboo against posting solutions/methods for Kaggle – however, I think I’m good to write about a method of solving this particular problem. The digit recognizer problem seems to be a rolling competition with a bunch of already published results and a few training classes on how to solve it. Let me know if you think this is overstepping the mark.

So, the problem:
Given a big set (42,000) of labelled training data (28 x 28 black and white images) of handwritten images (0-9) are we able to correctly identify other (identically dimensioned) handwritten digits.

There are a whole bunch of ways of doing this and the method I’ve had best success with is Support Vector Machines (using LibSVM). I may post an example of how to run that for this particular example but today I’d like to look at Adaboost (using the MultiBoost package)…

Until fairly recently I was entirely ignorant of Adaboost – I came across it on a different Kaggle problem (the Higgs one). There, a number of ‘out of the box’ methods were showcased – the most successful of which was Adaboost. A bit of reading on Adaboost suggests that it’s a fairly well-regarded, and successful method of performing a range of machine learning tasks. It’s also sometimes cited as being the best ‘out of the box’ (not specifically designed for the task at hand) algorithm in machine learning.

My current intuition on Adaboost is that it’s basically a ‘rule of thumb’ algorithm. It takes a lot of very simple decision boundaries and uses them to create a more complicated decision space. I say rule of thumb because I imagine a car mechanic or a doctor trying to diagnose a fault. The patient presents with symptom x, that makes a whole bunch of things less likely. However, if the patient falls into this age bucket and this ethnic group, some of the previously discounted things become more likely. I don’t know if that sort of explanation helps you but I quite like it. Basically, you create a simple rule that’s more often right than wrong. However, you can then update it with as many exceptions as you’ve got other bits of data. I think that’s a lot how the human decision-making process goes.

Anyway, all this talking isn’t getting us closer to a juicy set of predictions. Mad props to whoever first generated this particular procedure for the Higgs problem – I’ve shamelessly ripped it off, only making changes where necessary for this problem.

#!/usr/bin/python

import random
import csv
import subprocess
import numpy as np

def DataToArff(dataset, labels, header, title, filename):
    """
    With this data structure we're able to turn an arbitrary string of data into a .arff file
    These files allow us to import the data into Multiboost or Weka (amongst other machine learning libraries
    """
    with open(filename + ".arff", 'w') as f:
        f.write('@RELATION ' + title + '\n\n')
        for feature in header:
            f.write('@ATTRIBUTE ' + feature + ' NUMERIC\n')
        f.write('@ATTRIBUTE class {0,1,2,3,4,5,6,7,8,9}\n')
        f.write('\n@DATA\n')
        ## We could do this using all_data - however, we need the labels for further work
        ## Additionally, if the labels were numeric variables we'd be able to leave the rest of our work unchanged and handle them here
        for datarow, label in zip(dataset, labels):
            for value in datarow:
                f.write(str(value) + ',')
            f.write(str(label) + '\n')

all_data = list(csv.reader(open('train.csv', 'rb'), delimiter=','))
header = np.array(all_data[0][1:])
dataset = np.array([map(float, row[1:]) for row in all_data[1:]])
(numpoints, numfeatures) = dataset.shape

# Labels on the first column of the line
labels = np.array([row[0] for row in all_data[1:]])

randomPermutation = random.sample(range(len(dataset)), len(dataset))
## If this breaks halfway through, we'll be glad to be able to load our random permutation
np.savetxt('randomPermutation.csv', randomPermutation, fmt='%d', delimiter=',')

## I'll change the proportion of the train set and see how we get on.
numpointsTrain = int(numpoints*0.75)
numpointsValidatin = numpoints - numpointsTrain

## Because we've got a random permutation there's no problem taking slices of the total set to sort into train and validation
datasetTrain = dataset[randomPermutation[:numpointsTrain]]
datasetValidation = dataset[randomPermutation[numpointsTrain:]]

labelsTrain = labels[randomPermutation[:numpointsTrain]]
labelsValidation = labels[randomPermutation[numpointsTrain:]]

DataToArff(datasetTrain, labelsTrain, header, 'DigitsML_challenge_train', 'training')
DataToArff(datasetValidation, labelsValidation, header, 'DigitsML_challenge_validation', 'validation')

## Our Adaboost parameters are wholly contained in the relevant config files
p1 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'config.txt'])
p1.wait()
p2 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresValidation.txt'])
p2.wait()

testText = list(csv.reader(open('test.csv', 'rb'), delimiter=','))
datasetTest = np.array([map(float, row[:]) for row in testText[1:]])
labelsTest = np.repeat('0', len(testText) - 1)

DataToArff(datasetTest, labelsTest, header, 'Digit_challenge_test', 'test')

p3 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresTest.txt'])
p3.wait()

testScoresText = list(csv.reader(open('scoresTest.txt', 'rb'), delimiter=','))
with open('submission.csv', 'w') as f:
    f.write('ImageId,Label\n')
    for index, entry in enumerate(testScoresText):
        ## Take the index of the maximum value for a given row - this is the most likely value
        f.write(str(index + 1) + "," + str(np.argmax(entry)) + '\n')

I’d like to think that that bit of code is reasonably transparent and clear on what it’s doing. If I’m wrong, a basic explanation:
1.) Randomly split the data into a training and validation set
2.) Create .arff files for both of these sets
3.) Run Multiboost (our Adaboost implementation) on the training set and validation set
4.) Using the files created from our train/validation Adaboost, get the test set and generate predictions (again using Multiboost)
5.) Generate a submission file in the required format

Easy does it. Now for the configuration files that I’m using:

config.txt

fileformat arff
verbose 2
learnertype TreeLearner
constant
seed 50
weightpolicy balanced
baselearnertype SingleStumpLearner 8
outputinfo results.dta e01w01auc
traintest training.arff validation.arff 5000
shypname shyp.xml

configScoresValidation.txt

posteriors validation.arff shyp.xml scoresValidation.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

configScoresTest.txt

posteriors test.arff shyp.xml scoresTest.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

I’ll be honest and say I’ve not really found the ideal set-up for this problem here. I’m able to get a score of around 0.965 using the ones here but if you look at the leader board you’ll see that’s not all that good. Certainly the LibSVM method performed much better (something like 0.99). Not to worry, it’s doing the right thing, generating good results and is another tool in our arsenal.

The World Cup may stymie my blog posts for a bit – then again, supporting England, it might only be 3 games.

Football’s coming home.

Bayesian AB Testing using Python

Hi all,

A reasonably big part of the job I do involves running AB tests and I think that’s usually something that falls under the remit of data scientists. Now I could go on and on about the difficulties around tagging tests, choosing test groups, choosing goals and the like. I won’t, but I’ll make at least one point. All of the aforementioned points are very important and are very non-trivial. If you don’t absolutely understand all of the steps a user might take, and how your test will handle these, work a bit harder on that. And outsourcing this problem to your AB testing framework provider…well, I’d not advise it.

Anyway, I recently read this blog post by the engineers over at Lyst and thought it was a really interesting idea. I’m really not a fan of p-values and think that a probability distribution is a much nicer way of communicating a result. With that in mind, the basic logic behind Lyst’s post (if you’ve not got the time/inclination to read it):

1.) You presumably have a reasonable idea about the distribution of whatever metric you’re plotting – for instance, you’re reasonably confident conversion % lies somewhere between 0 and 25% with the most likely value to be around 15% (as an example). You assume both test groups follow this distribution until you start getting data in to corroborate/contradict that.

Beta Distribution

Beta Distribution with alpha parameter 8 and beta parameter 40.

2.) On getting data in, you update your distribution to take into account these new data points – the distributions evolve with the new data.
3.) Your end result (see below) gives you a probability distribution for the conversion of both test groups.

Bayesian Test Result

The distribution of conversion for two (randomly generated) test datasets

I’d argue the above picture is much clearer than even the best explained set of p-values. It also really clearly lends itself to calculations along the lines of ‘What is the probability that test group A is better than test group B?’ or ‘how sure are you that test group A is 2% better than test group B?’

Enamoured with this new way of reporting test results, I figured I mayerswell build something to do so. Instead of writing something where the test set-up is tied in with the result plotter I wrote my plotter to take input from stdin.

First things first then, I want something to generate a stream of conversion events:

import random
import time

for _ in range(2000):
    group = 'test' if random.random() > 0.3 else 'control'
    if group == 'test':
        convert = 1 if random.random() < 0.16 else 0
    else:
        convert = 1 if random.random() < 0.10 else 0
    print '%s:%d' % (group, convert)

Easy does it – we’ll look at varying those numbers later. For the uninitiated, that’ll give us fairly uneven test groups with reasonably different conversion percentages.

Now for the plotter. I’ll confess, this is still a work in progress. It currently doesn’t assume anything about the test groups, including the names (taking those from the input stream). However, in future I’m hoping to extend the program to be able to perform multivariate Bayesian AB tests. If it appears messy in places, that’s either because I’m expecting the poor coding practices to lead to me having an easier time extending the code to allow multivariate testing, or because I’m a messy coder.

At this point, massive props to this book: it forms the basis of almost all of this code.

import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
import sys

results_dictionary = {}

## Store all our test results in memory - doesn't allow real-time updating and could get a bit serious if we've got a big set of results
for line in sys.stdin:
    if line == '':
        break
    group, conversion = line.strip().split(':')
    try:
        results_dictionary[group].append(int(conversion))
    except:
        results_dictionary[group] = [int(conversion)]

test_group_a, test_group_b = results_dictionary.keys()

## We'll run this twice, once with uniform prior and once with a beta prior
prior_dict = dict((group, pm.Uniform(group, 0, 1)) for group in results_dictionary.keys())
prior_dict_beta = dict((group, pm.Beta(group, 3, 50)) for group in results_dictionary.keys())

@pm.deterministic
def delta(p_A = prior_dict[test_group_a], p_B = prior_dict[test_group_b]):
    return p_A - p_B

@pm.deterministic
def beta_delta(p_A = prior_dict_beta[test_group_a], p_B = prior_dict_beta[test_group_b]):
    return p_A - p_B

## Bernoulli distribution with the events we've got
observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict[group], value=events, observed=True)) for group, events in results_dictionary.items())
beta_observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict_beta[group], value=events, observed=True)) for group, events in results_dictionary.items())

## Markov chain Monte-Carlo methods - returning samples from our updated distributions
mcmc = pm.MCMC([prior_dict[test_group_a], prior_dict[test_group_b], delta, observations[test_group_a], observations[test_group_b]])
mcmc_beta = pm.MCMC([prior_dict_beta[test_group_a], prior_dict_beta[test_group_b], beta_delta, observations[test_group_a], observations[test_group_b]])
mcmc.sample(20000,1000)
mcmc_beta.sample(20000,1000)

## Grab all the samples we need
samples = dict((key, mcmc.trace(key)[:]) for key in results_dictionary.keys())
delta_samples = mcmc.trace('delta')[:]
beta_samples = dict((key, mcmc_beta.trace(key)[:]) for key in results_dictionary.keys())
beta_delta_samples = mcmc_beta.trace('beta_delta')[:]

## It's this easy to work out probabilities of success
prob_a_better = (delta_samples < 0).mean()
prob_a_better_beta = (beta_delta_samples < 0).mean()

### Plotting
ax = plt.subplot(321)
plt.hist(samples[test_group_a], histtype='stepfilled', bins=50, alpha=0.85, label='Uniform posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.suptitle('Posterior distributions of %s, %s, and $\Delta$ unknowns' % (test_group_a, test_group_b))
plt.title('Uniform posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(323)
plt.hist(samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Uniform posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(325)
plt.hist(delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of $\Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.title('Uniform posterior of $\Delta$')
plt.autoscale()
plt.annotate('Probability %s \nis greater \nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better), (0,30))
#
ax = plt.subplot(322)
plt.hist(beta_samples[test_group_a], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(324)
plt.hist(beta_samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(326)
plt.hist(beta_delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of $\Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.autoscale()
plt.annotate('Probability %s \nis greater \nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better_beta), (0,30))
plt.title('Beta posterior of $\Delta$')
#
plt.tight_layout()
plt.show()

Giving us:

Bayesian AB Testing Graphs

Graphs comparing the conversion of one test group against another, using a Beta distribution and a Uniform distribution as a prior.

First things first, why are there 6 graphs? Well, realistically, given that this is designed to model website conversion, I know what the distribution is likely to look like. Therefore, I say that my initial priors are beta distributions with parameters alpha = 10 and beta = 30. However, I’m well aware that naysayers might quibble with the idea of making such strong assumptions before running the test. Alongside that prior, I’ve included the completely uninformative uniform prior. That basically says that the conversion is equally likely to fall anywhere between 0 and 100%. Another reason for doing this is to show what difference it makes – when we’re looking at < 10 data points, you’ll see fairly big differences between the different prior assumptions. Increase the number of events up past 1000 and the two prior assumptions converge to the same value.

Additionally, we’ve fed in information about the test and control group – where has this delta come from and what’s it all about? That’s simply the difference between the test groups, as a probability distribution. How good is that? Really, that’s what people are interested in and, instead of making them compare two distributions and look at overlaps, we’ve done that for them and presented it as a probability distribution. Well done us.

Move along there.

Finding and Trading Volatile Stocks in Python

Hi all,

Can’t promise that this post won’t be bitty – I’m trying to simultaneously run an SVM and a random forest on a bunch of particle physics data for the Kaggle competition. Check it out, it’s pretty cool. Anyway, my computer is straining under the weight of those calculations and so while that was happening I decided to have a look at stock prices using Python/Pandas again.

After chatting with co-blogger Sean, and based on my (limited, and hilariously bad) experiences of stock trading we decided it’d be interesting to identify volatile stocks that don’t seem to have greatly varying fundamental value. We’re basically looking for the position of a harmonic oscillator in a stock. I’m not graphing that – look it up yourself. The logic being, there’ll be a point at which it it’s profitable to buy a stock like this on a down and sell again when it’s back up. Of course, this requires the assumption that the stock itself isn’t having a fundamental value shift – it’s just suffering from cyclicity. I don’t really know if/how this’ll work but that’s half the fun…

Right, back to it (I’ve caught up with Game of Thrones – get it watched). I’ve thought a reasonable amount about this and have decided our first job is to look at maximizing the following quantity:

\frac{Volatility}{Change_{daily}^n Change_{weekly}}

I might also throw in an additional concern – I’d like to be able to enter and exit the market whenever I want – I don’t see this being a big problem for me (I’m not going to be using a lot of money) but it’ll certainly be a concern for bigger players. Let’s cross that bridge if we need to.

So, to start off with, my daily volatility I’m going to define as
\frac{\sum_{i={day_1}}^{Today} \frac{HighPrice_i - LowPrice_i}{ClosePrice_i}}{NumberOfDays}

Hopefully nothing earth-shattering there, just want to see how much it varies over a day. Now while I want the stock price to vary a lot, I want it to head back to where it started. A rapidly increasing/decreasing stock is going to have wildly varying days. However, it’s also going to have a large overall trend. That’s no good for the purposes of finding stocks to buy/sell on a short time period.

Change_{daily} = \sqrt{\frac{\sum_{i={day_1}}^{Today} (\frac{ClosePrice_i - OpenPrice_i}{OpenPrice_i})^2}{NumberOfDays}}

Change_{weekly} = \sqrt{\frac{\sum_{i={week_1}}^{Today} (\frac{ClosePrice_i - OpenPrice_i}{OpenPrice_i})^2}{NumberOfWeeks}}

Easy does it – the reason I’ve squared the result is basically that I don’t care whether the stock is rising or falling. I’m trying to minimize the overall long-term variations from the mean.

So, how easy is this in Python? Remarkably so. Let’s start off by plotting a scatter graph of some of the more promising stocks.

import numpy as np
from pandas.io.data import DataReader
import pandas as pd
from datetime import datetime
from pylab import savefig

## A list of American Stock Symbols
company_information = pd.read_csv('allcompany.csv')

volatility_measure = []
daily_change_measure = []
weekly_change_measure = []
labels = []

## Let's start out with the biggest 10 companies in my list

for company in company_information.sort(['MarketCap'], ascending=False).head(10)['Symbol']:
    try:
        company_frame = DataReader(company.strip(), 'yahoo', datetime(2013,1,1), datetime.now().date())
        company_frame['Volatility'] = (company_frame['High'] - company_frame['Low'])/company_frame['Close']
        volatility_measure.append(company_frame['Volatility'].mean())
        company_frame['Daily_Change'] = company_frame['Close'].diff()
        daily_change_measure.append(np.sqrt(np.mean(company_frame['Daily_Change']**2)))
        ## Take every 5th row
        weekly_company_frame = company_frame[::5]
        weekly_company_frame['Change'] = weekly_company_frame['Close'].diff()
        weekly_change_measure.append(np.sqrt(np.mean(weekly_company_frame['Change']**2)))
        labels.append(company.strip())
    except:
        print "Problem parsing %s" % company.strip()

for i in range(1,7):
    change_metric = [daily * (weekly ** (1./i)) for daily, weekly in zip(daily_change_measure, weekly_change_measure)]
    ax = plt.subplot(3,2,i)
    plt.xlabel('Log of overall change metric')
    plt.ylabel('Volatility metric')
    plt.title('Weekly power %.2f' % float(1./i))
    plt.scatter(change_metric, volatility_measure, c = volatility_measure, cmap=plt.get_cmap('Spectral'), label='Weekly power %.2f' % float(1./i))
    for label, x, y in zip(labels, change_metric, volatility_measure):
        plt.annotate(label, (x,y), xytext=(0,8), textcoords='offset points')

    plt.gca().set_xscale('log')
    plt.gca().legend_ = None

plt.suptitle('Daily Volatility of Stocks versus Overall Trend')
plt.tight_layout()
plt.show()
savefig('StockVolatility.png')

OK – it’s not especially pretty but it gives us the following picture:

Stock Volatility

The 10 biggest US stocks – their daily & weekly change versus their daily volatility

You could also make a fair point that I’ve formatted it poorly. Open it up as big as your browser will let you and you’ll be able to see it nicely. Or, just run the code and create your own picture. It’s dead easy. I promise.

So what can we infer from that picture? I’m going to go ahead and say not a huge deal. Apple & Google have made some crazy ups and downs over the last year or two (mostly ups) and hence I’ve been forced to use a log plot. Other than that, we can see a cluster of the remaining companies with GE seeming the most stable all round. One point I’d like to make now: by defining my metrics in such a way that they don’t really match to anything in reality, I’ve lost the ability to understand exactly what I’ve plotted. What I’m trying to say, is that the log of an overall change metric isn’t an intuitive quantity. Usually, it’s a good idea to pick metrics that have a fairly firm grounding in the real world unless you’ve got a really good reason not to. In my case, my reason is that all I’m trying to do here is identify stocks in the upper left most corner – I don’t care what their values are yet.

I’d also like to make the point here that for this data set, the change of power associated with the weekly metric seems to make no difference. I put it there to express the idea that we’re likely to want a different weighting on the daily and weekly variability depending on how often we want to trade the stock. As I’m hoping to trade multiple times daily, the daily variability is more important to me than the weekly variability (hence my choice of fractional powers of the weekly variable). If you’re looking at trading less regularly, change your parameters accordingly.

Now I’m going to go out on a limb and say that, when looking for daily volatility, the biggest companies in America aren’t the place to go looking. I’m sure that the algorithmic trading people are all over this kind of trade with fancy-pants C++ code designed to execute multiple trades/second. To do this at a reasonably large scale (and to overcome transaction/infrastructure costs) I’m going to say those guys will play with these big companies where a purchase of £1 million+ of shares isn’t going to be such a big deal. Playing in those markets must be the equivalent of going barracuda fishing with a thumb tack and a tie. I think we should start our search towards the lower market caps and work our way up until we’ve got a few hopefuls.

volatility_measure = []
daily_change_measure = []
weekly_change_measure = []
labels = []

for company in company_information[company_information['MarketCap'] > 10000000].sort(['MarketCap']).head(25)['Symbol']:
    try:
        company_frame = DataReader(company.strip(), 'yahoo', datetime(2013,1,1), datetime.now().date())
        company_frame['Volatility'] = (company_frame['High'] - company_frame['Low'])/company_frame['Close']
        volatility_measure.append(company_frame['Volatility'].mean())
        company_frame['Daily_Change'] = company_frame['Close'].diff()
        daily_change_measure.append(np.sqrt(np.mean(company_frame['Daily_Change']**2)))
        ## Take every 5th row
        weekly_company_frame = company_frame[::5]
        weekly_company_frame['Change'] = weekly_company_frame['Close'].diff()
        weekly_change_measure.append(np.sqrt(np.mean(weekly_company_frame['Change']**2)))
        labels.append(company.strip())
    except:
        print "Problem parsing %s" % company.strip()

for i in range(1,7):
    change_metric = [daily * (weekly ** (1./i)) for daily, weekly in zip(daily_change_measure, weekly_change_measure)]
    ax = plt.subplot(3,2,i)
    plt.xlabel('Log of overall change metric')
    plt.ylabel('Volatility metric')
    plt.title('Weekly power %.2f' % float(1./i))
    plt.scatter(change_metric, volatility_measure, c = volatility_measure, cmap=plt.get_cmap('Spectral'), label='Weekly power %.2f' % float(1./i))
    for label, x, y in zip(labels, change_metric, volatility_measure):
        plt.annotate(label, (x,y), xytext=(0,8), textcoords='offset points')

    plt.gca().set_xscale('log')
    plt.gca().legend_ = None
    plt.autoscale(tight=True)

plt.suptitle('Daily Volatility of Stocks versus Overall Trend')
plt.tight_layout()
plt.show()
savefig('SmallerCompanies.png')
Small Companies Volatility

Volatility versus overall change for American companies with Market Caps > $10,000,000

Well bugger me. I don’t know about you but that looks pretty cool to me. Ignore all the gumph in the middle and look at the outliers – AMCO, GRVY, PRLS, DGLY and EEME. These are great examples of companies that are going to be either maximums or minimums for our given metric.

OK – I’m going to call it a night for now but just think of the possibilities open to us now! We can change our date ranges, play around with our metrics and loop through as many stocks as we can find symbols for (harder than you’d think!) until we’ve got a reasonable amount of stocks that we think are great candidates for regularly buying and selling.

Next time, I’ll finalize my list of stocks and hopefully start to gain an idea of when one of these stocks becomes a buy, and when it becomes a sell. That sounds fairly difficult actually. Ah well, that’s the fun.

Winter is coming.

Hadoop wordcount in Python

Hi all,

There’ll be a follow up post to this detailing how to run a mapreduce using Eclipse and Java but, as I’ve found myself in permissions hell in running that, I’ll go with the easy one first. Hadoop comes with a streaming jar that allows you to write your mappers and reducers in any language you like – just take input from stdin and output to stdout and you’re laughing. I’ll show you how to achieve this using Python.

Cluster Set-up

I’m going to assume you’ve followed a tutorial and have got Hadoop installed and working – if you haven’t, follow one (maybe even mine) and then come back here. Make sure you’ve got HDFS and Yarn running by executing the following commands:


su - hduser ## Only need this if you created a user called hduser to interface with Hadoop
cd /usr/local/hadoop ## If you followed the tutorial - otherwise, wherever your Hadoop home directory is
sbin/start-all.sh

Let’s see about putting a text file into HDFS for us to perform a word count on – I’m going to use The Count of Monte Cristo because it’s amazing. Honestly, get it read if you haven’t. It’s really really good. Anywho, enough fandom – this little command will download the whole book and stick it into whichever directory you happen to be in when you run the command.

cd ~
wget -O 'count_of_monte_cristo.txt' http://www.gutenberg.org/cache/epub/1184/pg1184.txt

Now we’ve got the file in our home directory (really, it was that easy, check it out if you don’t believe me – then read the book). However, that’s not in HDFS – we need to explicitly put it there. I’m going to create a directory in HDFS called input and then put the file in there:


/usr/local/hadoop/bin/hadoop fs -mkdir /input
/usr/local/hadoop/bin/hadoop fs -put ~/count_of_monte_cristo.txt /input

Has it worked?

Run this command:

/usr/local/hadoop/bin/hadoop fs -ls /input | grep count_of_monte_cristo | awk -F '\/' '{print $3}' | cut -d '.' -f1

If it returns a warning followed by ‘count_of_monte_cristo’ then you’re in the money. If you don’t understand the commands above, don’t worry. But do find out about them.

Otherwise, drop me a comment and I’ll see what can be done.

The Mapper

With this bit of code we’re going to go over every line in the text file and output the word and the number of instances of that word (one, for now) – easy does it:

#!/usr/bin/python

import sys

for line in sys.stdin:
    for word in line.strip().split():
        print "%s\t%d" % (word, 1)

Save that file as something sensible at a sensible location – I’m going to use /home/hduser/word_mapper.py.
Also, make sure it’s executable:

chmod +x /home/hduser/word_mapper.py

Has it worked?
Run this little beaut’ of a command:

/usr/local/hadoop/bin/hadoop fs -cat /input/count_of_monte_cristo.txt | /home/hduser/word_mapper.py

If you’ve gone maverick and used a different filename or file location then that’s fine – just substitute that in where I’ve used /home/hduser/word_mapper.py. If you’ve gone maverick but don’t really know what you’re doing and don’t know what I’ve just said, that’s basically on you. Keep trooping on, you’ll get there.

Either way, don’t stop until that code outputs a stream of words followed by the number 1. Don’t worry – you can stop it by pressing Ctrl and C together.

The Reducer

We’ve got ourselves a nice stream of words. The Hadoop streaming jar will take care of the sorting for us (though we can override the default behaviour should we choose) so we just need to decide what to do with that stream of words. I’m going to propose this:

#!/usr/bin/python

import sys

current_word = None
current_count = 1

for line in sys.stdin:
    word, count = line.strip().split('\t')
    if current_word:
        if word == current_word:
            current_count += int(count)
        else:
            print "%s\t%d" % (current_word, current_count)
            current_count = 1

    current_word = word

if current_count > 1:
    print "%s\t%d" % (current_word, current_count)

Follow the code through and try to think of the different cases it’s trying to catch. The first and last lines are tricky but play around with it – what happens if I just feed a file containing one word? What about a file with no duplicate words? Think about all the different cases and hopefully – the above code handles them all as you’d expect. If not, please let me know. That’d be real embarrassing.

Has it worked?

Make sure that file is executable:
chmod +x /home/hduser/word_reducer.py

Run this:

/usr/local/hadoop/bin/hadoop fs -cat /input/count_of_monte_cristo.txt | /home/hduser/word_mapper.py | head -n 100 | sort | /home/hduser/word_reducer.py

If everything’s gone to plan you should see a bunch of lines and associated counts – some of them should be non-one.

Super.

Run the Mapreduce

This is what you’ve been waiting for. Well – it’s what I’ve been waiting for at least. Run this command and you’ll basically be a Hadoop hero:

cd /usr/local/hadoop
bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -files /home/hduser/word_mapper.py,/home/hduser/word_reducer.py -mapper /home/hduser/word_mapper.py -reducer /home/hduser/word_reducer.py -input /input/count_of_monte_cristo.txt -output /output

And off it goes – enjoy watching your mapreduce race through at what I’m sure is a barely tolerable crawl.

Has it worked?

Run this beauty:

/usr/local/hadoop/bin/hadoop fs -cat /output/part-00000

If you see a stream of likely looking results – you’re golden. If you want to get the file out of HDFS for inspection run something like this:

/usr/local/hadoop/bin/hadoop fs -get /output/part-00000 /home/hduser/monte_cristo_counted.txt
less /home/hduser/monte_cristo_counted.txt

Hope that’s worked well for you – it’s not the most glamorous of Hadoop jobs but it’s a good stepping stone. In a post coming to you soon I should be able to show you how to get Eclipse set up to run Hadoop jobs and give you an example or two in Java.

(Pseudo) Distributed Wishes

Installing Eclipse 4.2.3 (Kepler) on Ubuntu 14.04

Hi all,

If you’ve followed any of my other posts you’ll know I recently wiped my OS (something I do alarmingly regularly) – as such, I’m reinstalling a bunch of stuff. Given that I do this so often, it makes sense for me to have a repository of tutorials for doing so!

Today I’m on Eclipse – I’m by no means an Eclipse regular. I do most of my coding in Python and find Vim works well enough for most of what I need to do. RStudio for R, Vim/IPython for Python, but when I’m doing anything in Java (inc. Android stuff) I’ll go with Eclipse. Now installing Eclipse on Ubuntu is really easy – there’s a version in the software centre that’ll work just fine. However, if you want a more up to date version then there’s a degree of hacking about required. Let’s go:

Check you’ve got Java installed

Give this command a cheeky little run (from terminal – press Ctrl + shift + t to get terminal on the screen):


which java
java -version

If they both worked you should have seen the path to where your Java is installed (mine was /usr/bin/java) and the version of Java you have installed (1.7 OpenJDK for me). If both commands worked (gave you a path and a version) then excellent, carry on wayward son. It doesn’t matter if your results are different to mine, just keep trooping on. If they didn’t work, you’ll want to install Java before continuing. I’ll not deal with that here but I mention how to do so in my post on installing Hadoop on Ubuntu.

Download Eclipse

As with installing Hadoop, there are a couple of ways to do this. I’m going to advise heading on over to the Eclipse download link:

Eclipse

Pick the top version (Eclipse standard) and decide which bit version you want (32 or 64). If you don’t know which you should be using, I’d advise running the following command:

uname -a | awk '{print $12}'

If the output is x86_64 you’ll likely want 64 bit, i386 tells me you’re after 32 bit. If it says something else entirely then I’m flummoxed – have a Google around to find out what bit your Ubuntu installation is.

If you’ve not got a GUI then I’m going to decide you shouldn’t be installing Eclipse. Wget the .tar.gz file if that’s you but really, what are you doing? Actually, maybe you’re setting up a bunch of computers over SSH which will have monitors in the future but don’t now. OK – wget this link if you’re on 64 bit:

http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz&r=1

and this link if you’re not:

http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-standard-kepler-SR2-linux-gtk.tar.gz&r=1

Installing Eclipse

At this point I’m assuming you’ve got a tarred Eclipse in your downloads folder and you know where Java is installed.

Open up a terminal, head over to your downloads directory and untar the Eclipse file:


cd ~/Downloads
tar -xzf eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz

Next we’re going to put it into a more sensible directory and make it easily launchable from terminal:


sudo mv eclipse /usr/local/eclipse
sudo ln -s /usr/local/eclipse/eclipse /usr/bin/eclipse

At this point, if you could kindly run the following command from terminal:

eclipse

I’d expect you to see Eclipse pop up and for you to be able to start developing.

Estimating Pi using the Monte Carlo Method in Python

Hi all,

If you were especially upset then I’m sorry it’s been a while since I posted – I discovered Game of Thrones. In a later post I’ll chart the effect of Game of Thrones on overall productivity. I think there’ll be some unsurprising results. Anyway, I spend a reasonable amount of time on the train with my (oft abused) laptop each morning/evening; I don’t have the internet and I don’t have any textbooks so it’s basically a question of what I can work on given only the documentation on my computer and whatever I can remember about programming/maths/stuff.

I was having a think and remembered that you could estimate Pi using a Monte Carlo method and thought like that sounded like the sort of thing I should do. The logic is basically as follows:

Let’s draw a square of side length 2r and a circle centred exactly in the middle of the square with radius r. A well organised blogger would show you a diagram of this set-up, screw it, this is the code to do it and this is what it looks like:

import matplotlib.pyplot as plt
fig = plt.figure()
axis = fig.add_subplot(1,1,1)
circle = plt.Circle((0,0), 1)
axis.add_patch(circle)
axis.set_xlim([-1,1])
axis.set_ylim([-1,1])
axis.set_title('A Circle in a Square')
plt.show()
A Circle in a Square

A Circle in a Square

Brilliant – was it worth it? Probably not. But there you have it – with that set up we can now start the Monte Carlo bit. We’ll throw darts at that picture randomly; you’d expect the number of darts in the circle to be proportional to the area of the circle and the number of darts in the square to be proportional to the area of the square. Using that fact and the formulae for the areas of a circle and a square you can estimate Pi using the ratio of darts in the circle and in the square.

Sound good? It’s fairly easy to run this in Python and graph the output using Matplotlib. You’ll see I’ve used Object Oriented Python for this particular exercise, I don’t really know why. Especially because I had a chance to use inheritance and didn’t. Well done me. I’ve let everybody down. Anyway – this is the code I came up with and the graph below shows what I ended up with:

#!/usr/bin/python

import numpy as np
import math
import matplotlib.pyplot as plt

"""
Calculate pi using Monte-Carlo Simulation
"""

"""
First - the maths:
A circle has area Pi*r^2
A square wholly enclosing above circle has area 4r^2
If we randomly generate points in that square we'd expect the ratio of points in the square/points in the circle to equal the area of the square divided by the circle.
By that logic n_in_sq/n_in_cir = 4/Pi and so Pi = (4 * n_in_cir)/n_in_sq
"""

class pi_calculator(object):

    def __init__(self, radius, iterations):
        self.radius = radius
        self.iterations = iterations
        self.square = square(radius)
        self.circle = circle(radius)

    def scatter_points(self):
        for _ in range(self.iterations):
            point_x, point_y = ((2*self.radius) * np.random.random_sample(2)) - self.radius
            self.square.increment_point_count(point_x, point_y)
            self.circle.increment_point_count(point_x, point_y)

    def return_pi(self):
        return (4.0*self.circle.return_point_count())/self.square.return_point_count()

    def calculate_accuracy(self, calc_pi):
        absolute_error = math.pi - calc_pi
        percent_error = 100*(math.pi - calc_pi)/math.pi
        return (absolute_error, percent_error)

    def return_iterations(self):
        return self.iterations

class square(object):

    def __init__(self, radius):
        self.radius = radius
        self.lower_x = -radius
        self.lower_y = -radius
        self.upper_x = radius
        self.upper_y = radius
        self.point_count = 0


    def in_square(self, point_x, point_y):
        return (self.upper_x > point_x > self.lower_x) and (self.upper_y > point_y > self.lower_y)


    def increment_point_count(self, point_x, point_y, increment = 1):
        if self.in_square(point_x, point_y):
            self.point_count += increment

    def return_point_count(self):
        return self.point_count

class circle(object):

    def __init__(self, radius):
        self.radius = radius
        self.point_count = 0

    def in_circle(self, point_x, point_y):
        return point_x**2 + point_y**2 < self.radius**2

    def increment_point_count(self, point_x, point_y, increment=1):
        if self.in_circle(point_x, point_y):
            self.point_count += increment

    def return_point_count(self):
        return self.point_count


if __name__ == '__main__':
    axis_values = []
    pi_values = []
    absolute_error_values = []
    percent_error_values = []
    for _ in range(1,3000,30):
        pi_calc = pi_calculator(1, _)
        pi_calc.scatter_points()
        print "Number of iterations: %d    Accuracy: %.5f" % (pi_calc.return_iterations(), math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())[0]))
        axis_values.append(_)
        pi_values.append(pi_calc.return_pi())
        absolute_error_values.append(math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())[0]))
        percent_error_values.append(math.fabs(pi_calc.calculate_accuracy(pi_calc.return_pi())[1]))

    improvement_per_iteration = [absolute_error_values[index] - absolute_error_values[index-1] for index, value in enumerate(absolute_error_values) if index > 0]
    fig = plt.figure()
    fig.suptitle('Calculating Pi - Monte Carlo Method')
    ax1 = fig.add_subplot(2,2,1)
    ax2 = fig.add_subplot(2,2,2)
    ax3 = fig.add_subplot(2,2,3)
    ax4 = fig.add_subplot(2,2,4)
    plt.subplots_adjust(wspace=0.3, hspace=0.3)
    ax1.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax1.set_xlabel('Iterations')
    ax1.set_ylabel('Calculated value of Pi')
    ax1.plot(pi_values, 'k')
    ax1.plot([math.pi for entry in axis_values], 'r')
    ax2.set_ylabel('Absolute error')
    ax2.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax2.set_xlabel('Iterations')
    ax2.plot(absolute_error_values, 'k', label="Total Error")
    ax3.set_ylabel('Absolute percentage error (%)')
    ax3.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax3.set_xlabel('Iterations')
    ax3.plot(percent_error_values, 'k', label="Percent Error")
    ax4.set_ylabel('Absolute improvement per iteration')
    ax4.set_xticklabels([str(entry) for entry in axis_values[::len(axis_values)/5]], rotation=30, fontsize='small')
    ax4.set_xlabel('Iterations')
    ax4.plot(improvement_per_iteration, 'k', label="Absolute change")
    plt.savefig('pi_calculation.png')
    plt.show()


giving us:

Monte Carlo estimation of Pi - an Investigation

Monte Carlo estimation of Pi – an Investigation

I can only apologise for any dodgy code in there – in my defence, it was early in the morning. As you can see, it only takes around 100 ‘darts thrown at the board’ to start to see a reasonable value for Pi. I ran it up to about 10,000 iterations without hitting any significant calculation time. The fourth graph doesn’t really show anything interesting – I just couldn’t think of anything to put there.

That’ll do for now – I built something that’ll stream tweets on the Scottish Independence Referendum but don’t know what to do with it yet; there’ll likely be some sort of blog post. There’s a chance I’ll do some sentiment analysis but I’m not sure yet.

When you play the Game of Thrones, you win or you die.