Finding Topics in Harry Potter using K-Means Clustering.

I’ll open up with the money-shot – these are all of the clusters that I was able to find using the whole Harry Potter and grouping by chapter:

all_clusters_subplot

Every cluster plotted separately.

That’s far too messy to be of any practical use so let’s have a look at a couple of those clusters in more detail:

Privet Drive Cluster

One of the clusters – a Dursley/Privet Drive heavy cluster!

and

Cluster-23

This is a pretty Griphook/Goblin heavy cluster featured on the storyline in book 7.

Hopefully that’s piqued your interest enough to continue on scrolling and see how we got these clusters – and see the words that tie them all together! The code for generating these is on my Github (https://github.com/Kali89/HarryPotterClusters) and all the graphs and documents are contained there.

I went to a really interesting talk at PyData that was about Latent Dirichlet Allocation, a topic entirely new to me. I thought I’d love to apply it to my favourite book series – Harry Potter. However, that didn’t happen…instead you get this. A heavy rip off of an excellent post (http://brandonrose.org/clustering) that walks through how to cluster documents using a bunch of techniques including K-Means.

Step 1

Get plain text copies of all the Harry Potter books and make sure they’re all formatted in roughly the same way. As is often the case, this step took bloody ages.

Step 2

I want a few different documents – more than 7 (the number of Harry Potter books for the heathens out there) but substantially less than the number of sentences in all 7 books. Treating each chapter as a separate document seemed to make most sense and so here I initialise everything I need and split my books into chapter.

So at this point we’ve nicely got ourselves a list of chapter titles and a list of the associated text (as a string).

Step 3

Now we’ve got our chapters we’re going to want to tokenise the text in them. Basically this means converting a string into discrete tokens – what we’d think of as words. The reason it’s got a fancy name and I’m being a bit careful about my terminology is because tokenisation also takes care of things like punctuation and isn’t as simple as just splitting a string into separate words. Having said that, I’ve basically taken the path of least resistance and so have gone with a very simple tokenisation scheme. I’m also going to skip over the pain of utf-8 encoding/decoding/recoding. I’ve basically just dropped any character that I’ve found in the least bit complicated.

Step 4

Next I’m going to perform TF-IDF on my chapters. Here, I convert each token into a number and look at how many times that token appears in a given document, and how many documents that token appears in overall. So in this instance I’m looking to see how often a given word appears in this particular chapter and in how many chapters throughout all 7 books the token appears in. This gives us an idea of how important/prominent a word is in a given chapter, taking into account how common the word is throughout all the chapters. As an example ‘Harry’ is likely to feature a lot in a given chapter but also is likely to feature in every chapter and so probably isn’t especially important to any given chapter’s classification. ‘Nicolas Flamel’, however, is going to appear a reasonable amount in a few chapters but not at all in all the rest. We therefore know that Nicolas Flamel is important in the chapters he does appear in.

Luckily, I don’t have to worry about the implementation of TF-IDF – sklearn has got it.

This gives me a large sparse matrix with the TF-IDF score for each word in each document as the entries. If you pay close attention to the parameters I’ve passed along to TfidfVectorizer as they are ripe for the changing. Firstly, `max_df=0.75` is saying that I don’t care about words that appear in more than 75% of the chapters. `min_df=0.05` is saying that I don’t care about words that appear in fewer than 5% of chapters. You can see I passed along my tokeniser and that I’m using English stop-words (that is, I’m removing the most common English words). Finally, I’m generating n-grams between 1 and 4.

For those uninitiated with n-grams they’re basically a way of splitting text up into handy little chunks. As an example, 3-grams of the following sentence:

“This is not the greatest song in the world”

would be:

“This is not”, “is not the”, “not the greatest”, “the greatest song”, e.t.c.

This allows me to pick out common phrases such as “Snape said” and “wizarding world”. Again, that’s a setting that is begging to be played about with.

Step 5

Performing K-means clustering we get output like so:


Cluster 18 words:
Top words: maxime,madame maxime,karkaroff,madame,hagrid,cedric,moody,krum,champions,tournament

Chapter: the hungarian horntail, Book: gobletOfFire
Chapter: the goblet of fire, Book: gobletOfFire
Chapter: the four champions, Book: gobletOfFire
Chapter: beauxbatons and durmstrang, Book: gobletOfFire
Chapter: the beginning, Book: gobletOfFire
Chapter: the yule ball, Book: gobletOfFire

which is obviously quite a handy little cluster. It’s successfully managed to only take chapters from one book (given that we’ve not allowed K-means access to the book information that is a bit of a triumph). For those aware of Harry Potter you’ll see this is a Triwizard Tournament heavy cluster. Another example:


Cluster 10 words:
Top words: wormtail,cold voice,voldemort,lord,cauldron,riddle,cedric,man,master,faithful

Chapter: the riddle house, Book: gobletOfFire
Chapter: flesh, blood, and bone, Book: gobletOfFire
Chapter: the death eaters, Book: gobletOfFire


Again, all the clusters are from one book – I’m counting that as a result. Added bonus, they’re not sequential chapters! I happen to know (by being a massive Harry Potter geek) that these chapters are towards the start and end of book 4 and focus heavily on Voldemort and Peter Pettigrew.

So far so good and in fact I could stop here but wouldn’t it be nice to visualise those clusters so we can see the topics we’ve picked out graphically? Even if you said no then it doesn’t really matter. I’m still going to do it.

Step 6

First things first, let’s plot all of the chapters on one graph and colour code them with the book from which they came. It’s not a great way of visualising clusters but it is a great way of seeing how everything is laid out:

all_clusters

A messy picture of all the chapters projected into 2d space.

Again, this section is shamelessly copied from the aforementioned blog but ultimately we’re projecting the cosine differences between the tf-idf matrix terms into 2-dimensional space. I declare a colour dictionary for each of the books and then rattle through the chapters plotting them.
I’m sure you’ll agree that’s far too messy for anybody to really do anything with. If you’re following along with the code you’ll see that next I generate the subplot figure shown at the top.

Finally, I create plots for each of the clusters – a few examples of which are:


Cluster 3 words:
Top words: umbridge,professor,dont,professor umbridge,snape,sirius,im,said hermione,harrys,youre

Chapter: the muggle born registration commission, Book: deathlyHallows
Chapter: the hogwarts high inquisitor, Book: orderOfThePhoenix
Chapter: o.w.l.s, Book: orderOfThePhoenix
Chapter: the second war begins, Book: orderOfThePhoenix
Chapter: educational decree number twenty four, Book: orderOfThePhoenix
Chapter: the centaur and the sneak, Book: orderOfThePhoenix
Chapter: percy and padfoot, Book: orderOfThePhoenix
Chapter: detention with dolores, Book: orderOfThePhoenix
Chapter: occlumency, Book: orderOfThePhoenix
Chapter: in the hogs head, Book: orderOfThePhoenix
Chapter: out of the fire, Book: orderOfThePhoenix
Chapter: professor umbridge, Book: orderOfThePhoenix
Chapter: fight and flight, Book: orderOfThePhoenix
Chapter: seen and unforeseen, Book: orderOfThePhoenix
Chapter: career advice, Book: orderOfThePhoenix
Chapter: snapes worst memory, Book: orderOfThePhoenix

generates:

Umbridge's cluster

Umbridge’s cluster

Everybody’s favourite professor – Dolores Umbridge!

And another:

Top words: hagrid,yeh,ter,said hagrid,professor,said hermione,malfoy,o,professor trelawney,trelawney

Chapter: professor trelawney's prediction, Book: prisonerOfAzkaban
Chapter: talons and tea leaves, Book: prisonerOfAzkaban
Chapter: the firebolt, Book: prisonerOfAzkaban
Chapter: diagon alley, Book: philosophersStone
Chapter: the foribidden forest, Book: philosophersStone
Chapter: the keeper of the keys, Book: philosophersStone
Chapter: norbert the norwegian ridgeback, Book: philosophersStone
Chapter: hagrids tale, Book: orderOfThePhoenix
Chapter: the eye of the snake, Book: orderOfThePhoenix
Chapter: grawp, Book: orderOfThePhoenix
Chapter: hermione's helping hand, Book: halfBloodPrince
Chapter: rita skeeter's scoop, Book: gobletOfFire


and visually:

Grawp's brother

Keeper of the Keys

The anti-Umbridge – it’s Hagrid’s cluster!

I’ll stop on the random copy/pasting of the clusters and stick them all on my Github – I think you’ve got the idea! All in all I’m pretty happy with how this has worked but I am very dependent on individual character names. I tried just looking at 2-grams but it usually just gave me ” said” with a few exceptions (‘wizarding world’, ‘said softly’, ‘death eater’, ‘godrics hollow’). I’ve also put almost zero effort into formatting the images – you know roughly what it’s meant to look like: having the pictures look good is an exercise I’m leaving to the reader’s imagination.

There’s loads more stuff I could do but I’m going to eat a chicken and go swimming so it’ll have to wait.

All was well.

Advertisements

Emailing multiple inline images in Python

Hi all,

Yet again, sorry for the lack of blog posts. In my defence, I’ve been keeping busy what with the Mining of Massive Data Sets course on Coursera and a few Kaggle problems amongst other distractions. I legit have at least double-digit subscriber numbers so to those people, thanks! Ignoring all that, today I’m going to share with you something I had to puzzle out myself due to lack of information online about it. In sharing, hopefully the next poor schmuck who has to do this will just be able to copy what I’ve done.

We’ll look at building a fairly robust email; one which will feature multiple inline images and multiple attachments. We’ll do the whole thing using Gmail because it’s good and easy to use Gmail and everybody can create one. The original version I wrote used Outlook and integrated Windows security but not everybody has that so Gmail it is. As a bit of background, I used this as part of a reporting pipeline that queries the Google Analytics API. I grab a whole bunch of data, filter and arrange it, combine it with another data source and produce some tables/graphs. These get saved down into a directory every day and my emailer will come along, pick up all the relevant images and hey presto – you’ve got a fairly cool reporting pipeline.

First things first, this is how you send lots of basic plain-text emails.


import smtplib

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('smtp.gmail.com', 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg)
    mailServer.quit()

for i in range(100):
    send_email('Hello everybody', gmail_user, gmail_pwd, ['test_recipient@email.com'])

Would the below code send 100 of the same email to the same person? Yes. Does that have the potential to be abused for perhaps-hilarious purposes. Of course. I’ll leave that up to you.

Next, let’s take a fairly easy step and attach a few files.


from email.mime.multipart import MIMEMultipart
from email.header         import Header
import smtplib
from email.MIMEBase import MIMEBase
from email import Encoders
import os

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def attach_file(filename):
    part = MIMEBase('application', 'octect-stream')
    part.set_payload(open(filename, 'rb').read())
    Encoders.encode_base64(part)
    part.add_header('Content-Disposition', 'attachment; filename=%s' % os.path.basename(filename))
    return part

def generate_email(gmail_user, to_list, data_path):
    msg = MIMEMultipart('related')
    msg['Subject'] = Header(u'Test Attachment Email', 'utf-8')
    msg['From'] = gmail_user
    msg['To'] = ','.join(to_list)
    msg.attach(attach_file(data_path))
    return msg

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('smtp.gmail.com', 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg.as_string())
    mailServer.quit()

email_msg = generate_email(gmail_user, ['recipient@email.com'], 'test_data.txt')
send_email(email_msg, gmail_user, gmail_pwd, ['recipient@email.com'])

All you need to do to allow multiple attachments is call the msg.attach(attach_file()) method a few times. Well, as many times as you’d like attachments would be ideal.

Glorious – so we can email multiple people with multiple attachments. You could argue at this point that, if we can attach the images we’re looking to send then surely we don’t need to create inline images and so we can cease this madness. If you can argue that point successfully then you’re better at arguing than me. Something to do with Blackberry compatibility, looking nicer and “can’t you just not argue and do it for once – is it possible or is it not?”

So we’ll be embedding images to the above email – the way to do so is…


import cgi
import uuid
from email.mime.multipart import MIMEMultipart
from email.mime.text      import MIMEText
from email.mime.image     import MIMEImage
from email.header         import Header
import os
import smtplib
from email.MIMEBase import MIMEBase
from email import Encoders

with open('credentials.csv', 'rb') as f:
    gmail_user = f.readline().strip().split(',')[1]
    gmail_pwd = f.readline().strip().split(',')[1]

def attach_image(img_dict):
    with open(img_dict['path'], 'rb') as file:
    msg_image = MIMEImage(file.read(), name = os.path.basename(img_dict['path']))
    msg_image.add_header('Content-ID', '<{}>'.format(img_dict['cid']))
    return msg_image

def attach_file(filename):
    part = MIMEBase('application', 'octect-stream')
    part.set_payload(open(filename, 'rb').read())
    Encoders.encode_base64(part)
    part.add_header('Content-Disposition', 'attachment; filename=%s' % os.path.basename(filename))
    return part

def generate_email(gmail_user, to_list, data_path_1, data_path_2,img1,img2):
    msg =MIMEMultipart('related')
    msg['Subject'] = Header(u'Images and Words', 'utf-8')
    msg['From'] = gmail_user
    msg['To'] = ','.join(to_list)
    msg_alternative = MIMEMultipart('alternative')
    msg_text = MIMEText(u'Image not working - maybe next time', 'plain', 'utf-8')
    msg_alternative.attach(msg_text)
    msg.attach(msg_alternative)
    msg_html = u'<h1>Some images coming up</h1>'
    msg_html += u'<h3>Image 1</h3><div dir="ltr">''<img src="cid:{cid}" alt="{alt}"><br></div>'.format(alt=cgi.escape(img1['title'], quote=True), **img1)
    msg_html += u'<h3>Image 2</h3><div dir="ltr">''<img src="cid:{cid}" alt="{alt}"><br></div>'.format(alt=cgi.escape(img2['title'], quote=True), **img2)
    msg_html = MIMEText(msg_html, 'html', 'utf-8')
    msg_alternative.attach(msg_html)
    msg.attach(attach_image(img1))
    msg.attach(attach_image(img2))
    msg.attach(attach_file(data_path_1))
    msg.attach(attach_file(data_path_2))
    return msg

def send_email(msg, gmail_user, gmail_pwd, to_list):
    mailServer = smtplib.SMTP('smtp.gmail.com', 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to_list, msg.as_string())
    mailServer.quit()

img1 = dict(title = 'Image 1', path = 'test_image_1.png', cid = str(uuid.uuid4()))
img2 = dict(title = 'Image 2', path = 'test_image_2.png', cid = str(uuid.uuid4()))

email_msg = generate_email(gmail_user, ['recipient@email.com'], 'test_data.txt', 'test_data_2.txt', img1, img2)
send_email(email_msg, gmail_user, gmail_pwd, ['recipient@email.com'])

So all I need to do is make sure I’m downloading the right images and data files everyday and putting them in the correct directory. Ordinarily I’d do that using wget on Unix but as I’ve done the rest in Python I’ll include the download in the script. This way the script should work irrespective of OS – a handy little bonus.

Get that run and you’ll see a lovely set of inline images, all the attachments as they were and we’re done. Nothing too complicated I hope and we’ve made a nice little ’email image aggregation program’ thingy.

As a possible extension I think this could be better used to annoy/amuse and so you could create an ’email digest’ – make something that automatically goes on a few websites, downloads an image and then sticks them all inline in an email. It’ll also download a few data files and attach them. Let’s say I really like the NASA Astronomy picture of the day (I do), the XKCD comic (I do), the Google doodle (I do) and a CSV of the week’s weather forecast (I do not). I could easily set up a bash script (using wget) and a cronjob that would get the images/csv files each day and put them into a directory. Equally, if we wanted to play nicely with Windows as well as *nix, we could use Python to download the images. I’m not going to build this as it’s not really a problem that needs solving (as far as I see it).

Peace out yo.

Building a Recommendation Service on AWS with Mahout

Hi all,

Those (two) of my regular readers will know I frequently curse my fool-of-a-took laptop. With this in mind I’ve been meaning to write a tutorial on working with Amazon’s web services for when you don’t have the hardware you need. This isn’t that post but I will assume you’ve already got a cluster to hand with all the bits and bobs you need installed.

If you don’t, don’t worry about it (for now) – I’ll do a post on how to get one of those set up at a later date. For now I’d like to talk about collaborative filtering. This is one of the most prevalent techniques for generating recommendations and with good reason – it’s pretty good. It’s quite a common sense method:

I know that person A likes items 1, 2 and 3 but not item 4.
I know that person B likes item 4 and 3 but not 1 or 2.
I know that person C likes item 1  but not 2, 3 or 4.
I know that person D doesn’t like item 4 – which of the items 1, 2 or 3 should I recommend them?

I’m sure most people reading this will say we should recommend them item 1, then probably item 2 then 3? Certainly item 1 with highest preference. Now if you deconstruct your thought process in coming up with that recommendation I’d wager (not a lot) that it’d be akin to the collaborative filtering algorithm.

Based on the behaviour of people A, B and C we can start to get an idea about which items are rated favourably by the same people versus those that aren’t. User-based collaborative filtering is about finding similar users to person D and looking at what they liked. Item-based collaborative filtering is about finding similar items to the ones person D has rated positively and recommending those.

There are lots of nice articles on collaborative filtering – more often than not when it comes to machine learning I end up recommending Andrew Ng material and this is no exception – check out his machine learning course on Coursera. There’s a whole lecture on collaborative filtering and it’s boss. If my explanation wasn’t enough (if you’re serious about this sort of thing, it shouldn’t be) check out his lectures on the subject and read around on the mathematical details of it all. Make sure you polish up on your linear algebra – you’ll have very large sparse matrices heading your way!

Also, please note that this whole post is just a re-imagining of this post by Amazon: Recommendations using EMR – I liked it and thought I’d give it a go myself with a different data set and technologies I was more familiar with. Now on with the show…

The Data Set

I’m going to be using a publicly available data set containing user ratings of books gathered on Amazon – dataset available here: Book Data Set

There are three files in there – one with all the details of the books, one with all the details of the users and finally one with all the user ratings. Don’t worry about downloading them yet if you’re following this as a tutorial/example – we’ll stick that bit in a script later on.

The Cluster Set-Up

Given that I’m on the free tier and I don’t use AWS at work (we’ve got a nice cluster that handles all our data just fine thank you) I’m a bit ‘over-compute’ happy at the moment and so I’m going to use a 5 node cluster (1 master node and 4 slaves). This isn’t really necessary but then nor is this as a piece of work so I’m not that bothered by practicality at this point. The benefit of using Amazon’s Elastic Mapreduce is that the nodes all come with Hadoop (& Hive/Pig/Mahout/whatever) installed and set up so you don’t have to bother with any of that.

Building a Service

I want to demonstrate how easy it is to turn this sort of thing into a recommendations service (instead of a one-off bit of analysis). Now I’m going to stress that I’m setting this up as an offline batch machine learning method and that there are a few steps I’d change if I were actually setting this up as a real, ‘production’, service.

I’m going to use a Redis database because I’ve got a reasonable amount of memory on these boxes (Redis is an in-memory key-value store), because Redis is really really quick and because I’m used to using Redis with Python. Redis will store the output of my collaborative filtering and I’ll stick a lightweight web framework (Flask) on top of it. This will allow a client to retrieve recommendations for a given user as well as delete them or overwrite them. I’ve decided to add the latter two functions because we don’t live in a perfect world.

I’ll also write a script that’ll update the books, users and ratings files and stick it on a cronjob to run every day. That way, by copying this method you’ll end up with a reasonably scalable, reasonably fast and reasonably up-to-date recommendation service. It’s not the best one out there (heck, it might be the worst) but it goes to show how simple these things are to set up given all the tools that are already written…

The Code

Firstly, we’ll create the cluster using the Elastic Mapreduce CLI:

./elastic-mapreduce --create --alive --name recommendation-service --num-instances 5 --master-instance-type m1.large --slave-instance-type m2.xlarge --ami-version 3.1 --ssh

The --ssh means that once the cluster has been created and all the necessary bits and bobs installed we ssh into the master node

Extracting the data set

wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
unzip BX-CSV-Dump.zip
sed '1d;' BX-Book-Ratings.csv | tr ';' ',' | tr -d '"' > Book-Ratings.csv

Unfortunately for us, we’ve got a bit of work to do – the ‘out of the box’ Mahout recommendation algorithm requires user and book ids to be integers and ISBNs are not. To get around this, we use this nifty little script:

#!/usr/bin/python

import sys
import re

isbn_dictionary = {}

counter = 0

with open('Book-Ratings.csv', 'rb') as f:
    with open('New-Book-Ratings.csv', 'w') as g:
        with open('Book-Mappings.csv', 'w') as h:
            for line in f:
                try:
                    user, book, rating = line.split(',')
                except:
                    continue
                try:
                    index = isbn_dictionary[book]
                except:
                    isbn_dictionary[book] = counter
                    index = counter
                    h.write("%s,%d\n" % (book, counter))
                    counter += 1
                g.write("%d,%d,%d\n" % (int(user), index, int(rating)))

Now we’ve got data in the format our Mahout recommender is expecting let’s put it in HDFS and run our algorithm
hadoop fs -put New-Book-Ratings.csv /New-Book-Ratings.csv
mahout recommenditembased --input /New-Book-Ratings.csv --output recommendations --numRecommendations 10 --similarityClassname SIMILARITY_COSINE

Now we play the waiting game…
Actually, while that’s running, open up another SSH session to that terminal and run these commands

sudo easy_install redis flask
wget http://download.redis.io/releases/redis-2.8.13.tar.gz
tar xzf redis-2.8.13.tar.gz
cd redis-2.8.13
make
./src/redis-server &
cd

Now Redis is installed and running and we’ve got the necessary Python libraries we can create the following server…

import redis
import os

isbn_dictionary = {}

with open('Book-Mappings.csv', 'rb') as f:
    for line in f:
        isbn, counter = line.split(',')
        isbn_dictionary[int(counter)] = isbn

r = redis.StrictRedis(host='localhost', port=6379, db=0)

p = os.popen('hadoop fs -cat recommendations/part*')

for recommendation in p:
    user, recommendations = recommendation.split('\t')
    recommendations = [entry.split(':')[0] for entry in recommendations.replace('[', '').replace(']','').split(',')]
    recommendations = [isbn_dictionary[int(entry)] for entry in recommendations]
    r.set(int(user), recommendations)

Now we’ve populated our database with all the recommendations we generated with our Mahout job. Now we’ll set something up (called webserver.py) to return these items to a client (as well as allow modifications to be made).


import redis
from flask import Flask
r = redis.StrictRedis(host='localhost', port=6379, db=0)

app = Flask(__name__)

@app.route('/user/<user_id>', methods = ['GET', 'POST', 'PUT', 'DELETE'])
def restful_api(user_id):
    if request.method == 'GET':
        recommendations = r.get(int(user_id))
        return recommendations if recommendations else "User %d does not exist" % int(user_id)
    elif request.method == 'POST':
        if request.headers['Content-Type'] == 'application/json':
            r.set(int(user_id), request.json)
            return "Successfully set recommendations for user %d" % int(user_id)
        else:
            return False
    elif request.method == 'DELETE':
        if r.exists(int(user_id)):
            r.delete(int(user_id))
            return "Successfully deleted recommendations for user %d" % int(user_id)
        else:
            return "User %d does not exist" % int(user_id)
    elif request.method == 'PUT':
        if request.headers['Content-Type'] == 'application/json':
            r.set(int(user_id), request.json)
            return "Successfully set recommendations for user %d" % int(user_id)
        else:
            return "Require JSON put"
    else:
        return "No idea what you've done here, but don't do it again"

if __name__ == '__main__':
    app.run()

Now we’ve built our client-facing recommendations service let’s ‘deploy’ it and give it a test

python webserver.py &
curl -X GET localhost:5000/user/277965
curl -X DELETE localhost:5000/user/277965
curl -X GET localhost:5000/user/277965

Summary

We’ve not done anything that tricky but in a fairly short time we’ve managed to build something that’d be at least part way useful to a small business somewhere looking for recommendations for their users. It’s not a great leap from this and a genuine production ready system – you don’t need 5 Hadoop nodes running all the time for a 20 minute Mahout job and I’m uncomfortable using Flask and local files for a serious system. As such, you’d likely put together a couple of nodes, each with their own copy of the recommendations set in memory. One of them would create a cluster, get any new data a generate the data set as often as you saw fit – communicating it to the other node. Finally, you’d add a few more actions to the server and make sure it could be called by whichever clients need it (and block everybody else).

As I say, nothing earth-shattering but a solid base on which to build a recommendations system.

I am become death.

 

Bayesian Testing of Conversion Rate

Hi all,

I’m on a train to the Fringe festival and I’ve managed to bag myself first class tickets! What this means is that I’ve got about 3 hours to kill, unlimited free wine (who knew that was thing!?) and a fairly flaky internet connection. All of that goes together to make now the perfect time to share with you a bit of work I’ve been up to recently.

Firstly, massive props to this particular book: Probabilistic Programming & Bayesian Methods for Hackers

It’s really really good and this testing is just a minor reworking of one of the examples in the book – they say imitation is the sincerest form of flattery no?

Anyway, enough of the wine-induced babbling, on with the Bayesian testing…

The setting of the scene

What I’ll be looking to do here is identify changes in conversion rate (can be any kind of conversion, I’ll use orders/visitors but it doesn’t really matter) in historical data. The reason I say historical data is thusly: this is a good technique for identifying a visualizing changes but it isn’t as good as running a legitimate A/B test. Ideally, we’d implement the change (whatever it is) for one group of customers and not for the rest of the customers and measure relative performance. However, let’s assume that, for whatever reason, you’re looking over a bunch of (conversion) data trying to identify a change.

I’ll artificially generate data so you can run similar examples and get an idea of what our data source looks like:

#!/usr/bin/python

import random
import numpy as np

total_points = 50

## Generate a list containing the number of trials
trials = [random.randint(20,100) for _ in range(total_points)]
results = [np.random.binomial(value, 0.4) if total_point/2 else np.random.binomial(value, 0.3) for index, value in enumerate(trials)]

for trial, result in zip(trials, results):
    print "%d\t%d" % (trial, result)

where in our example, trials is going to be the number of visitors on successive days and results is going to be the number of orders.

Pipe that into a file (python generate_conversions.py > conversion_data.txt ) and we’ve got ourself a nice list of conversion data with a change in the rate at some point (halfway in this example) through the data.

The Bayesian Bit

So the idea behind Bayesian statistics revolves around priors and posteriors – your prior is going to be a distribution that represents your (shocker) prior ideas about the result. You’re going to update this prior hypothesis with data as you get it and when you do so the resulting distribution is called the posterior. This is great for a number of reasons – my favourite two are:

  1. You get to set a prior that influences the final outcome. If I’ve got fairly strong ideas about what the conversion rate is (say I think it very likely lies between 80% and 90%) I can reflect that in my prior. When we start any frequentist (non-Bayesian) calculations we assume every probability between 0% and 100% is equally likely – generally that’s not the case.
  2. The final result is a distribution. Distributions are great for visualization, allow for easy comparisons against other distributions and are really easy to show uncertainty on. Do away with all the talk of p-values and show a graph with two ‘conversion distributions’ on and you’re on to a winner.

Shut up and Calculate

Code first – then commentary:

#!/usr/bin/python

import pymc as pm
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import datetime

basket_list = []
conversion_list = []

with open('conversion_data.txt', 'rb') as f:
    for line in f:
        baskets, conversions = line.strip().split()
        basket_list.append(int(baskets))
        conversion_list.append(int(conversions))

n_percent_list = len(basket_list)

uniform_one_samples = []
uniform_two_samples = []
tau_samples = []
uniform_one = pm.Uniform('uniform_one', 0, 1)
uniform_two = pm.Uniform('uniform_two', 0, 1)

tau = pm.DiscreteUniform('tau', lower=0, upper=n_percent_list)

@pm.deterministic
def lambda_(tau = tau, uniform_one = uniform_one, uniform_two = uniform_two):
    out = np.zeros(n_percent_list)
    out[:tau] = uniform_one
    out[tau:] = uniform_two
    return out

observations = pm.Binomial('obs', n=basket_list, p=lambda_, value=conversion_list, observed=True)

model = pm.Model([observations, uniform_one, uniform_two, tau])
mcmc = pm.MCMC(model)
mcmc.sample(10000, 2500, 1)

uniform_one_samples = mcmc.trace('uniform_one')[:]
uniform_two_samples = mcmc.trace('uniform_two')[:]
tau_samples = mcmc.trace('tau')[:]

N = tau_samples.shape[0]

conversion_rate = np.zeros(n_percent_list)
for day in range(0, n_percent_list):
    ix = day < tau_samples
    conversion_rate[day] = (uniform_one_samples[ix].sum() + uniform_two_samples[~ix].sum()) / N

plt.subplot(411)
plt.tight_layout()

plt.plot(range(n_percent_list), 100.*conversion_rate, lw=4, color='#E24A33', label='Expected conversion rate')
plt.xlim([0, n_percent_list])
plt.ylim([0,100])
plt.xlabel('Day')
plt.title("Changes in the probability of conversion")
plt.ylabel('Expected conversion rate')
plt.bar(np.arange(n_percent_list), [100.0*convert/basket for convert,basket in zip(conversion_list, basket_list)], color='#348ABD', alpha=0.65, label='Observed conversion rate')
plt.legend(loc='upper right')

ax = plt.subplot(412)

#
plt.hist(uniform_one_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of initial conversion probability", color='#A60628', normed=True)
plt.legend(loc='upper left')

plt.xlim([min(uniform_one_samples),max(uniform_one_samples)])
plt.xlabel("Probability of Conversion")
#
ax = plt.subplot(413)
#
plt.hist(uniform_two_samples, histtype='stepfilled', bins=50, alpha=0.85, label="posterior of later conversion probability", color='#7A68A6', normed=True)
plt.legend(loc='upper left')
plt.xlim([min(uniform_two_samples),max(uniform_two_samples)])
plt.xlabel("Probability of Conversion")
#
#
plt.subplot(414)
w = 1.0/tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_percent_list, alpha=1, label="posterior of conversion change date", color='#467821', weights=w, rwidth=2.)
plt.xticks(np.arange(n_percent_list))
#
plt.legend(loc='upper left')
plt.xlim([0, n_percent_list])
plt.xlabel("Day of change")
plt.ylabel('P(change occurred)')
#


plt.show()

That’s a fair bit of code – what I’m doing isn’t that complicated. After importing, defining and grabbing the conversion data that we generated in the first script we declare uniform_1 and uniform_2.

We’re going to say that our data ultimately comes from a Binomial distribution where people convert with probability p. However, we are going to say that at some time (given in this example as tau) the value of p changed. p is the conversion of a Binomial distribution and so saying that p changes at some point means we’re saying our conversion rate changed.

Firstly, I declare my priors – I say that uniform_1 and uniform_two, or the conversion before and after the change we’re trying to identify, are both drawn from a uniform distribution that runs between 0 and 1. I’m saying I think the conversion is equally likely to everywhere between 0 and 100%. If you’re doing this properly I’m sure you can improve on this (I’d advise looking into the Beta distribution) but with sufficient data the uniform distribution should work fine.

So, at this point we’ve got the distributions from which our p-values are drawn described by uniform distributions – our first pair of priors.

Next, I declare tau – the time at which our conversion rate changed. I’m using a discrete uniform distribution and saying that it could have happened with equal probability at any time between the first day and the last day. If you’re looking to identify when the biggest change in conversion occurred I’d advise using a discrete uniform distribution. If you’re trying to identify the effect a change on a particular day had then you can be more creative with this prior.

Now we declare _lambda. For the eagle-eyes pseudo-coders amongst you, I’m sure it’s clear but this represents our complete prior belief about the conversion rate (p of a Binomial). We say we think the value of p follows uniform_1 up until tau and then switches to uniform_2.

Then we let pymc take over – we specify our distribution is a binomial with as many trials as we generated in the first script, and with the probabilities given in _lambda. We tell pymc that we obtained the values we’re passing to it and created our distribution. I’m not going to go in what wizardy follows but check out chapter 3 of the book I mentioned at the start if you’re interested. What we end up with (when we grab the traces) is 10,000 numbers drawn from posterior distributions we’ve generated from uniform_1, uniform_2 and tau.

The final calculation builds a vector of Booleans for each day of the data set specifying whether the day is less than each of the 10,000 samples drawn from the tau distribution, or greater than the tau. Then we go along that 10,000 element boolean vector and create a sum – if the day number was less than the tau we take whatever is in the uniform_1 list at that point, otherwise taking what is in the uniform_2 point. Finally dividing by the number of points we drew, we can get an idea of the expected conversion rate (expected value of _lambda).

I get that that section is a bit complicated but if you get it:
a.) I’ve not had enough wine
b.) You’ve got the idea behind the testing so far.

Finally, there’s a lot of Matplotlib plotting stuff all leading us towards the following beautiful picture:

https://dogdogfish.files.wordpress.com/2014/08/simple_bayesian_testing.png

Basic Bayesian Conversion Testing/Change Recognition

I hope that shows more clearly what I’ve been trying to explain. Firstly, the top graph shows our expected and our observed conversions. The bars show our ‘actual conversion rates’ on given days – these were the values we artificially generated. The red line is the output of our model discussed in the final point above. The second and third graph show the distributions of the conversion rate from before the change and after the change. If you’re looking to paint a compelling picture, I’d advise putting them both on the same axis, drawing them as densities (not histograms) and adding lines at the 95th percentiles. Finally, the bottom graph shows us when we change in conversion rate likely changed. As you can see, it’s fairly heavily concentrated around the middle (good) but there’s quite a wide range of days around there where it could be.

I’ve found that in practice, it makes sense to run these simulations multiple times (they don’t take very long) and concatenate the results together when working out when a change occurred. I’ll leave that as an exercise given how long this post already is – there’s  a picture of what it looks like at the bottom – the posteriors of the conversion probabilities get a bit messy.

So there we have it – it’s been a bit of a slog and the woman who brings the wine round the train has started laughing when she pours me a new glass (surely a troubling sign). However, we’re now able to run Bayesian conversion tests like a boss and visualize them in a really funky kind of way.

As always, comments are welcome, questions too.

The Lannisters send their regards.

p.s.

Picture when run with multiple iterations…

Multiple Iteration Bayesian Conversion Plots

Mutliple Iterations of Bayesian Conversion Testing.

Drugs prescribed by the NHS

Hi All,

As always, apologies for the length of time between posts – think this is a record. I was working on stuff I’m not allowed to share (work stuff, then Kaggle stuff) and got minorly derailed by Game of Thrones. Finished all the TV series, and one of the books. And Life of Pi. And I’m working my way through the Book Thief. So yeah, a tad derailed.

However, I’ve got a lazy Sunday and I saw an advert on TV that really annoyed me. It was basically along the lines of ‘did your doctor mess up? Why not sue them?’ and I think that it takes us towards where the Americans are. Which, when it comes to healthcare, is not where we want to go. Given that in the UK, if you’re being treated by a doctor it’s almost certainly an NHS doctor it seems pretty sucky that people are being encouraged to sue them. Puts the price up for everybody and makes people less likely to become doctors and whatnot. Anyway, this isn’t a political rant blog – I thought I’d have a look and see what data is available on the NHS to see if I could show the effects of an increase in litigation on the standard on medical care provided/costs (and ideally, contrast with America). In short, I couldn’t. The data that I wanted just wasn’t there. However, there was data on the drugs that the GPs for the NHS prescribes (at least between Jan and June in 2012) by practice with cost data. That seemed pretty interesting and so here we are.

I’ve been playing around with Google’s coLaboratory (check it out here) and would have loved to use this to do this particular bit of analysis. However, after playing round with it for a bit and struggling with external documents, finding my Google docs and various libraries I wanted I’ve decided to leave it for a while until coLaboratory becomes a tad more mature. Lots of promise there and with good Google Analytics API integration we could transform analytics practices at my company. Certainly one to watch.

Anyway, without that I’ll think of the questions I want to answer first and then pick my tool. Firstly, the data…

The data

As ever, data.gov.uk to the rescue – head here and download yourself a nice copy of the data:

http://data.gov.uk/dataset/gp-practice-prescribing-data

Once you’ve got a copy of all the data and the list of practices in England (I used the most recent one) we’re ready to start asking some questions of the data…

The Analysis

First off, let’s pick something simple – which drug costs the NHS most in each of the months of our test set, and overall.

Total Drug Cost

This actually seems to lend itself to the mapreduce paradigm pretty nicely – the mapper seems pretty unnecessary and as I’ve not got a cluster to hand (and this doesn’t warrant me spinning one up with AWS) I’ll just write a quick reducer in Python and use the Unix sort. For what it’s worth, I think I might write something about spinning up a quick cluster on AWS in the near future. It’s a fairly useful skill to have and given the increasing reluctance of my computer to perform the most basic of tasks, I think a fair bit of my future data analysis might have to happen in the cloud. Anyway, this is what my command will look like once I’m done:

awk 'FNR>1{print}' T201202PDP\ IEXT.CSV | sort -t , -k5 | python spending_reducer.py

Nothing too complicated there. I’m ignoring the first line (the awk command), then sorting the whole file based on the 4th column (the drug id) and piping the whole thing into the following reducer:

#!/usr/bin/python

import sys

current_drug = None
current_cost = 0.0

for line in sys.stdin:
    authority, trust, practice, drug_code, drug_name, number_bought, ni_cost, act_cost, period = line.strip().split(',')
    act_cost = float(act_cost.lstrip('0'))
    if drug_name == current_drug:
        current_cost += act_cost
    else:
        if current_drug:
            print current_drug + "\t" + str(current_cost)
        current_drug = drug_name
        current_cost = act_cost

print current_drug + "\t" + str(current_cost)

Nothing too difficult there – we’re just keeping track of the drug we’re on and adding up as we go. If I had no fear for the amount of RAM I had we could’ve accomplished the same thing without the laborious sorting step using associative arrays in awk. But for the next stage – the total across all 6 months, I am very afraid (RAM wise) and so we can run the same query with a bit of wildcarding:

awk 'FNR>1{print}' T20120[1-6]P*.CSV | sort -t , -k5 | python spending_reducer.py > drugs_by_spend.txt

This runs the same calculation over every file matching that wildcard pattern (all the data between 2012/01 and 2012/06. Note that this’ll take a little time – that sort is reasonably expensive over the 4 or so gigabytes of data we’ve got. Now we’ll pull together a few graphs and for this I think we’ll use R…

my_frame <- data.frame(read.csv('drugs_by_spend.txt', header=F, sep="\t", colnames=c("Drug", "Spend")))
my_frame <- my_frame[order(-my_frame$Spend),]
head(my_frame)

The top 5 drugs, by spend, in the first half of 2012 were:

  1. Fluticasone Propionate (Inh)
  2. Atorvastatin
  3. Enteral Nutrition
  4. Pregabalin
  5. Budesonide

Between them these cost: £738,620,789

Wow – that’s a hell of a lot. In 6 months, the actual costs of these drugs alone was more than £700million!

The total cost of drugs prescribed in that time period: £2,785,540,256

So I think we can surmise that lots of money is spent by the NHS – OK, I suppose that’s no surprise. For the non-doctors amongst us (that includes me) that list features two anti-asthmatic treatments i.e. those inhalers that I’m sure a lot of you have (also includes me). A quick Wikipedia shows that Pfizer holds the patent to at least a couple of those drugs (or at least did, Atorvastatin has expired) – it might be interesting to stick the patent holder next to these drugs. Maybe later…

Right, there’s lots that we could do here but I’m going to call it a day for now. In the future I think I’ll try and get more months of data and then start to look at evolving trends. To do that, I’ll use an AWS cluster and so will write something and using that.

Until then.

Kaggle Digit Recognizer Problem with Adaboost

Hi everybody,

Hi Dr Nick. But enough of that – today I’m going to be working through a Kaggle problem. For those of you who don’t know Kaggle, I can’t advise in favour of it strongly enough. It’s a great place to have a go at using real data sets to apply various machine learning techniques. There’s a leaderboard, discussions on methods and some non-too shabby prizes. I’ll come clean at this point – I’m not a natural salesperson.

I believe there’s something of a taboo against posting solutions/methods for Kaggle – however, I think I’m good to write about a method of solving this particular problem. The digit recognizer problem seems to be a rolling competition with a bunch of already published results and a few training classes on how to solve it. Let me know if you think this is overstepping the mark.

So, the problem:
Given a big set (42,000) of labelled training data (28 x 28 black and white images) of handwritten images (0-9) are we able to correctly identify other (identically dimensioned) handwritten digits.

There are a whole bunch of ways of doing this and the method I’ve had best success with is Support Vector Machines (using LibSVM). I may post an example of how to run that for this particular example but today I’d like to look at Adaboost (using the MultiBoost package)…

Until fairly recently I was entirely ignorant of Adaboost – I came across it on a different Kaggle problem (the Higgs one). There, a number of ‘out of the box’ methods were showcased – the most successful of which was Adaboost. A bit of reading on Adaboost suggests that it’s a fairly well-regarded, and successful method of performing a range of machine learning tasks. It’s also sometimes cited as being the best ‘out of the box’ (not specifically designed for the task at hand) algorithm in machine learning.

My current intuition on Adaboost is that it’s basically a ‘rule of thumb’ algorithm. It takes a lot of very simple decision boundaries and uses them to create a more complicated decision space. I say rule of thumb because I imagine a car mechanic or a doctor trying to diagnose a fault. The patient presents with symptom x, that makes a whole bunch of things less likely. However, if the patient falls into this age bucket and this ethnic group, some of the previously discounted things become more likely. I don’t know if that sort of explanation helps you but I quite like it. Basically, you create a simple rule that’s more often right than wrong. However, you can then update it with as many exceptions as you’ve got other bits of data. I think that’s a lot how the human decision-making process goes.

Anyway, all this talking isn’t getting us closer to a juicy set of predictions. Mad props to whoever first generated this particular procedure for the Higgs problem – I’ve shamelessly ripped it off, only making changes where necessary for this problem.

#!/usr/bin/python

import random
import csv
import subprocess
import numpy as np

def DataToArff(dataset, labels, header, title, filename):
    """
    With this data structure we're able to turn an arbitrary string of data into a .arff file
    These files allow us to import the data into Multiboost or Weka (amongst other machine learning libraries
    """
    with open(filename + ".arff", 'w') as f:
        f.write('@RELATION ' + title + '\n\n')
        for feature in header:
            f.write('@ATTRIBUTE ' + feature + ' NUMERIC\n')
        f.write('@ATTRIBUTE class {0,1,2,3,4,5,6,7,8,9}\n')
        f.write('\n@DATA\n')
        ## We could do this using all_data - however, we need the labels for further work
        ## Additionally, if the labels were numeric variables we'd be able to leave the rest of our work unchanged and handle them here
        for datarow, label in zip(dataset, labels):
            for value in datarow:
                f.write(str(value) + ',')
            f.write(str(label) + '\n')

all_data = list(csv.reader(open('train.csv', 'rb'), delimiter=','))
header = np.array(all_data[0][1:])
dataset = np.array([map(float, row[1:]) for row in all_data[1:]])
(numpoints, numfeatures) = dataset.shape

# Labels on the first column of the line
labels = np.array([row[0] for row in all_data[1:]])

randomPermutation = random.sample(range(len(dataset)), len(dataset))
## If this breaks halfway through, we'll be glad to be able to load our random permutation
np.savetxt('randomPermutation.csv', randomPermutation, fmt='%d', delimiter=',')

## I'll change the proportion of the train set and see how we get on.
numpointsTrain = int(numpoints*0.75)
numpointsValidatin = numpoints - numpointsTrain

## Because we've got a random permutation there's no problem taking slices of the total set to sort into train and validation
datasetTrain = dataset[randomPermutation[:numpointsTrain]]
datasetValidation = dataset[randomPermutation[numpointsTrain:]]

labelsTrain = labels[randomPermutation[:numpointsTrain]]
labelsValidation = labels[randomPermutation[numpointsTrain:]]

DataToArff(datasetTrain, labelsTrain, header, 'DigitsML_challenge_train', 'training')
DataToArff(datasetValidation, labelsValidation, header, 'DigitsML_challenge_validation', 'validation')

## Our Adaboost parameters are wholly contained in the relevant config files
p1 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'config.txt'])
p1.wait()
p2 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresValidation.txt'])
p2.wait()

testText = list(csv.reader(open('test.csv', 'rb'), delimiter=','))
datasetTest = np.array([map(float, row[:]) for row in testText[1:]])
labelsTest = np.repeat('0', len(testText) - 1)

DataToArff(datasetTest, labelsTest, header, 'Digit_challenge_test', 'test')

p3 = subprocess.Popen(['/home/matt/Downloads/multiboost-1.2.02-x64-linux', '--configfile', 'configScoresTest.txt'])
p3.wait()

testScoresText = list(csv.reader(open('scoresTest.txt', 'rb'), delimiter=','))
with open('submission.csv', 'w') as f:
    f.write('ImageId,Label\n')
    for index, entry in enumerate(testScoresText):
        ## Take the index of the maximum value for a given row - this is the most likely value
        f.write(str(index + 1) + "," + str(np.argmax(entry)) + '\n')

I’d like to think that that bit of code is reasonably transparent and clear on what it’s doing. If I’m wrong, a basic explanation:
1.) Randomly split the data into a training and validation set
2.) Create .arff files for both of these sets
3.) Run Multiboost (our Adaboost implementation) on the training set and validation set
4.) Using the files created from our train/validation Adaboost, get the test set and generate predictions (again using Multiboost)
5.) Generate a submission file in the required format

Easy does it. Now for the configuration files that I’m using:

config.txt

fileformat arff
verbose 2
learnertype TreeLearner
constant
seed 50
weightpolicy balanced
baselearnertype SingleStumpLearner 8
outputinfo results.dta e01w01auc
traintest training.arff validation.arff 5000
shypname shyp.xml

configScoresValidation.txt

posteriors validation.arff shyp.xml scoresValidation.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

configScoresTest.txt

posteriors test.arff shyp.xml scoresTest.txt 5000
fileformat arff
verbose 2
learnertype TreeLearner
baselearnertype SingleStumpLearner 8

I’ll be honest and say I’ve not really found the ideal set-up for this problem here. I’m able to get a score of around 0.965 using the ones here but if you look at the leader board you’ll see that’s not all that good. Certainly the LibSVM method performed much better (something like 0.99). Not to worry, it’s doing the right thing, generating good results and is another tool in our arsenal.

The World Cup may stymie my blog posts for a bit – then again, supporting England, it might only be 3 games.

Football’s coming home.

Bayesian AB Testing using Python

Hi all,

A reasonably big part of the job I do involves running AB tests and I think that’s usually something that falls under the remit of data scientists. Now I could go on and on about the difficulties around tagging tests, choosing test groups, choosing goals and the like. I won’t, but I’ll make at least one point. All of the aforementioned points are very important and are very non-trivial. If you don’t absolutely understand all of the steps a user might take, and how your test will handle these, work a bit harder on that. And outsourcing this problem to your AB testing framework provider…well, I’d not advise it.

Anyway, I recently read this blog post by the engineers over at Lyst and thought it was a really interesting idea. I’m really not a fan of p-values and think that a probability distribution is a much nicer way of communicating a result. With that in mind, the basic logic behind Lyst’s post (if you’ve not got the time/inclination to read it):

1.) You presumably have a reasonable idea about the distribution of whatever metric you’re plotting – for instance, you’re reasonably confident conversion % lies somewhere between 0 and 25% with the most likely value to be around 15% (as an example). You assume both test groups follow this distribution until you start getting data in to corroborate/contradict that.

Beta Distribution

Beta Distribution with alpha parameter 8 and beta parameter 40.

2.) On getting data in, you update your distribution to take into account these new data points – the distributions evolve with the new data.
3.) Your end result (see below) gives you a probability distribution for the conversion of both test groups.

Bayesian Test Result

The distribution of conversion for two (randomly generated) test datasets

I’d argue the above picture is much clearer than even the best explained set of p-values. It also really clearly lends itself to calculations along the lines of ‘What is the probability that test group A is better than test group B?’ or ‘how sure are you that test group A is 2% better than test group B?’

Enamoured with this new way of reporting test results, I figured I mayerswell build something to do so. Instead of writing something where the test set-up is tied in with the result plotter I wrote my plotter to take input from stdin.

First things first then, I want something to generate a stream of conversion events:

import random
import time

for _ in range(2000):
    group = 'test' if random.random() > 0.3 else 'control'
    if group == 'test':
        convert = 1 if random.random() < 0.16 else 0
    else:
        convert = 1 if random.random() < 0.10 else 0
    print '%s:%d' % (group, convert)

Easy does it – we’ll look at varying those numbers later. For the uninitiated, that’ll give us fairly uneven test groups with reasonably different conversion percentages.

Now for the plotter. I’ll confess, this is still a work in progress. It currently doesn’t assume anything about the test groups, including the names (taking those from the input stream). However, in future I’m hoping to extend the program to be able to perform multivariate Bayesian AB tests. If it appears messy in places, that’s either because I’m expecting the poor coding practices to lead to me having an easier time extending the code to allow multivariate testing, or because I’m a messy coder.

At this point, massive props to this book: it forms the basis of almost all of this code.

import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
import sys

results_dictionary = {}

## Store all our test results in memory - doesn't allow real-time updating and could get a bit serious if we've got a big set of results
for line in sys.stdin:
    if line == '':
        break
    group, conversion = line.strip().split(':')
    try:
        results_dictionary[group].append(int(conversion))
    except:
        results_dictionary[group] = [int(conversion)]

test_group_a, test_group_b = results_dictionary.keys()

## We'll run this twice, once with uniform prior and once with a beta prior
prior_dict = dict((group, pm.Uniform(group, 0, 1)) for group in results_dictionary.keys())
prior_dict_beta = dict((group, pm.Beta(group, 3, 50)) for group in results_dictionary.keys())

@pm.deterministic
def delta(p_A = prior_dict[test_group_a], p_B = prior_dict[test_group_b]):
    return p_A - p_B

@pm.deterministic
def beta_delta(p_A = prior_dict_beta[test_group_a], p_B = prior_dict_beta[test_group_b]):
    return p_A - p_B

## Bernoulli distribution with the events we've got
observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict[group], value=events, observed=True)) for group, events in results_dictionary.items())
beta_observations = dict((group, pm.Bernoulli('obs_%s' % str(group), prior_dict_beta[group], value=events, observed=True)) for group, events in results_dictionary.items())

## Markov chain Monte-Carlo methods - returning samples from our updated distributions
mcmc = pm.MCMC([prior_dict[test_group_a], prior_dict[test_group_b], delta, observations[test_group_a], observations[test_group_b]])
mcmc_beta = pm.MCMC([prior_dict_beta[test_group_a], prior_dict_beta[test_group_b], beta_delta, observations[test_group_a], observations[test_group_b]])
mcmc.sample(20000,1000)
mcmc_beta.sample(20000,1000)

## Grab all the samples we need
samples = dict((key, mcmc.trace(key)[:]) for key in results_dictionary.keys())
delta_samples = mcmc.trace('delta')[:]
beta_samples = dict((key, mcmc_beta.trace(key)[:]) for key in results_dictionary.keys())
beta_delta_samples = mcmc_beta.trace('beta_delta')[:]

## It's this easy to work out probabilities of success
prob_a_better = (delta_samples < 0).mean()
prob_a_better_beta = (beta_delta_samples < 0).mean()

### Plotting
ax = plt.subplot(321)
plt.hist(samples[test_group_a], histtype='stepfilled', bins=50, alpha=0.85, label='Uniform posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.suptitle('Posterior distributions of %s, %s, and $\Delta$ unknowns' % (test_group_a, test_group_b))
plt.title('Uniform posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(323)
plt.hist(samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Uniform posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(325)
plt.hist(delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Uniform posterior of $\Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.title('Uniform posterior of $\Delta$')
plt.autoscale()
plt.annotate('Probability %s \nis greater \nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better), (0,30))
#
ax = plt.subplot(322)
plt.hist(beta_samples[test_group_a], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_a, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_a)
plt.autoscale()
#
ax = plt.subplot(324)
plt.hist(beta_samples[test_group_b], histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of %s' % test_group_b, color='#A60628', normed=True)
plt.title('Beta posterior of %s' % test_group_b)
plt.autoscale()
#
ax = plt.subplot(326)
plt.hist(beta_delta_samples, histtype='stepfilled', bins=25, alpha=0.85, label='Beta posterior of $\Delta$', color='#A60628', normed=True)
plt.vlines(0, 0, 50, linestyle='--', color='black')
plt.autoscale()
plt.annotate('Probability %s \nis greater \nthan %s: %.2f' % (test_group_a, test_group_b, prob_a_better_beta), (0,30))
plt.title('Beta posterior of $\Delta$')
#
plt.tight_layout()
plt.show()

Giving us:

Bayesian AB Testing Graphs

Graphs comparing the conversion of one test group against another, using a Beta distribution and a Uniform distribution as a prior.

First things first, why are there 6 graphs? Well, realistically, given that this is designed to model website conversion, I know what the distribution is likely to look like. Therefore, I say that my initial priors are beta distributions with parameters alpha = 10 and beta = 30. However, I’m well aware that naysayers might quibble with the idea of making such strong assumptions before running the test. Alongside that prior, I’ve included the completely uninformative uniform prior. That basically says that the conversion is equally likely to fall anywhere between 0 and 100%. Another reason for doing this is to show what difference it makes – when we’re looking at < 10 data points, you’ll see fairly big differences between the different prior assumptions. Increase the number of events up past 1000 and the two prior assumptions converge to the same value.

Additionally, we’ve fed in information about the test and control group – where has this delta come from and what’s it all about? That’s simply the difference between the test groups, as a probability distribution. How good is that? Really, that’s what people are interested in and, instead of making them compare two distributions and look at overlaps, we’ve done that for them and presented it as a probability distribution. Well done us.

Move along there.