Parashar's Digital Notebook!: Post 16: Basic Data Analysis With Simple Python

This post is a continuation of the Post 14 & 15, where we had retrieved the data from the web and stored it in python data structure and then queried the data respectively. In this post we would plot this data and see some python charts to visualize that data.

3) Plot Data

Write the commands below to load the data in to Python list and Counter, as mentioned in the previous post. Also, use %pylab inline so that we can plot the graphs.

import json

from collections import Counter
file = 'usagov_data.json'

with open(file, 'rb') as f:

data = f.readlines()

records = map(json.loads, data)

hashes = [record['h'] for record in records if 'h' in record]

c = Counter(hashes)

%pylab inline

First thing that we want to do now is to repack the hash count tuples into list. We do this with the combination of 3 techniques, which are commonly used altogether in idiomatic python.

First, we are going to unpack the elements of the most common list as an argument to a function:
>> top_urls, top_counts = zip(*c.most_common())

The function zip grabs the i-th element of each tuple and creates a new list, starting from 0. Since each tuple only contains two entries, we will be getting back the list of 2 lists.

Finally, we need to use the unpacking again on the left hand side of the expression to assign it to output lists, top_urls and top_counts.

We can go ahead look the top urls using:
>> print top_urls[:5]
and their associated top counts:
>> print top_counts[:5]

Let's make a histogram showing the frequency distribution of the urls.

We can import the matplotlib.pyplot, if it has not been imported already, using:
>> import matplotlib.pyplot as plt

Then we can plot the histogram, passing in some sort of list as:
>> plt.hist(top_counts)

To see the first 100 values in 10 bins:
>> plt.hist(top_counts[:100], bins=10);

To see the lower 100 URLs:
>> plt.hist(top_counts[100:], bins=10);

Bar chart showing 5 most common links and their frequency:

>> plt.bar(range(5), top_counts[:5], align='center')
>> plt.xticks(range(5), top_urls[:5], rotation=90)
>> plt.title('Top 5 URL hashes')
>> plt.xlabel('URL hash')
>> plt.ylabel('Frequency')

Pie chart of the same:
>> plt.pie(top_counts[:5], labels=top_urls[:5], autopct='%1.1f%%')
>> plt.title('Top 5 URL hashes in Pie Chart');

Parashar's Digital Notebook!

Friday, February 7, 2014

Post 16: Basic Data Analysis With Simple Python - Plot Data Example

No comments:

Post a Comment

About Me