Parashar's Digital Notebook!: Post 15: Basic Data Analysis With Simple Python

This post is a continuation of the Post 14, where we had retrieved the data from the web and stored it in python data structure. In this post we would query this data and retrieve more information about it.

2) Search Data

a) If we want to check if timezone 'Pacific/Honolulu' is present in the data set or not, first we would create a list of all the time zones as below:
>> tz = [record.setdefault('tz', '') for record in records]

Please note that the setdefault methods is used to set the default value of timezone to '' if the timezone key is absent for some records in the data set.

We can check the first 5 records in the data set using the command below:
>> tz[:5]

Now the timezone can be searched in the tz list created above as:
>> 'Pacific/Honolulu' in tz

The record number of the 'Pacific/Honolulu' can be found using the query as:
>> tz.index('Pacific/Honolulu')

b) If you are interested in northern most location that is currently accessing the data, we would first need to sort the data. For sorting the data, we would need to import itemgetter from operator module and then sort the latitude and longitude as:
>> from operator import itemgetter
>> records.sort(key=itemgetter('ll'))

But if every record doesn't contain the latitude-longitude key, ll, the sort would fail, as shown in the figure above. So to correct this, we would need to set a default value for the ll key when it is absent in a particular record. Let's define the default latitude and longitude value as:
>> wh = [38.8977, -77.0366]

Now while we sort, we use the lambda function instead to set the default.
>> records.sort(key = lambda record: record.setdefault('ll', wh))

Now we can see the first 20 sorted records using the command:
>> print [record['ll'] for record in records][:20]

If we want to check the first and last entry of the default (wh) in the data set to see how many of those are present in the data set, we write as below:
>> ll = [record['ll'] for record in records]
>> ll.index(wh)

if we want to search for the last occurrence of the wh in the sorted list, all we have to do is reverse the list and call the index method again:
>> ll[::-1].index(wh)

Now that we have sorted the list, it is easy to find the extrema. If we want to find the most northern and southern web accessors in the data set.

Most southern accessors of the dataset:
>> print json.dumps(records[:1], indent=2)

Most northern accessors of the dataset:
>> print json.dumps(records[-1:], indent=2)

c) Suppose we want to find out what our most popular links are. First we would go and bring the counter routine from collections library and then need to build up list of all the URLs in our data set.
>> from collections import Counter
>> urls = [record['u'] for record in records if 'u' in record]

Now we can construct our counter object from the urls list as:
>> c = Counter(urls)

So we extract a list of 2 tuples corresponding to all of the URLs and the counts associated with them using:
>> most_common = c.most_common()

Now we can check the length of most_common and URLs as:
>> len(urls)
>> len(most_common)

So the most common URL is:
>> most_common[0][0]

And to find the number of times it was visited:
>> most_common[0][1]

And the least common URL is:
>> most_common[-1:]

Parashar's Digital Notebook!

Thursday, February 6, 2014

Post 15: Basic Data Analysis With Simple Python - Serach Data Example

No comments:

Post a Comment

About Me