Python: Most frequent words in a string

This post is my practice on getting the most frequent words in a string in Python. Here is the code

#import the necessary packages
from collections import Counter
import pandas as pd
#open the text file. Here in this case, it is named "text"
with open('text.txt') as fin:
    counter = Counter(fin.read().strip().split())

numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]

counter.most_common() is the function to get all the words and their respective count from the string. If you put a number, let’s say 10, in the brackets, it means that you want to get only the first 10 elements of the array. Here is how counter.most_common(10) looks:

[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9)]

numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]

The above code is to get all the words and sort them in the descending order according to the words’ frequency. top15 is to get the first 15 elements of the sorted array. Here is how the top15 looks:

[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9), (‘are’, 9), (‘their’, 8), (‘video’, 7), (‘has’, 7), (‘by’, 7)]

After we get the top 15, we should put them into a data frame so that data processing can be easier. Here is how

text = [] #an array for the words
number = [] #an array for the frequency
for i in top15: #iterate through the top 15
   text.append(i[0])
   number.append(i[1])

#create the data frame
rawdata = {'words': text, 'frequency': number}
df = pd.DataFrame(rawdata, columns = ['words', 'frequency'])

This is the final data frame

    words  frequency
0      to         27
1     the         26
2      of         20
3     and         19
4      in         16
5       a         15
6      is         12
7     for         11
8      it          9
9     new          9
10    are          9
11  their          8
12  video          7
13    has          7
14     by          7

Data Analytics: Klay Thompson’s Performance

This is my data analytics practice by analyzing Klay Thompson’s performance so far in the 2018-2019 season up to 22nd Dec 2018. Klay Thompson is the shooting guard of Golden State Warriors. He is a three time world champion and I am a big fan of his playing style and deadly explosiveness. This post features my findings by analyzing his shot data this season from NBA website here. My code is available on my personal GitHub for your reference.

Findings

  • Klay made about 44% of his shots so far
  • Klay’s successful shots’ average distance to the basket is 15.92m
  • He made more shots in the first half than he did in the second half
  • 67% of Klay’s made shots are two pointers. The rest are three pointers
  • Living up to his name, Klay’s favorite play type is “catch and shoot jump shot”
  • Regarding Klay’s made two-pointers, below is the distribution by distance. He seems to be more effective within 10 feet of the basket and from 15 to 20 feet.
  • In regards to Klay’s three pointers, the distribution by distance to the basket is as follows: (no surprise that the farther he is from the basket, the less lethal he is)

  • As one of the best three point shooters in the league, Klay seems to be equally good throughout the periods of a game, except for the first quarter

Technical lessons I learned from this practice:Pie chart in Python with Matplot

Pie chart in Python

Let’s say you have two variables: TwoPT and ThreePT that stand for the shooting percentage of Klay’s two and three pointers respectively. Here is the code to draw a pie chart

labels = '2PT Field Goal', '3PT Field Goal'
sizes = [TwoPT, ThreePT]
colors = ['green', 'gold']
explode = (0, 0)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
 
plt.axis('equal')
plt.title("Klay's made shots by shot types")
plt.show()

Nunique function

Imagine if you have a data frame as the following

If you want to count how many events (whether it’s a missed or made shot) by Klay by period, instead of using SQL, the alternative is to use Nunique function. An advantage of using the aggregate is that the outcome is automatically a data frame. The code is as follows:

periodstats = madeshot.groupby(by='period', as_index=False).agg({"game_date": pd.Series.nunique, 'time_remaining': pd.Series.nunique}) #the data frame's name is madeshot. Pd is the abbreviation of Pandas

The result is:

Sort and get the top 10 of a data frame

If your data frame looks like the one below and your intention is to get the top 10 records in terms of “times”, what will you do?


The code I used is pretty straightforward and simple. (The data frame’s name is shotdistance

shotdistance = shotdistance.sort_values(by='times', ascending=False)
shotdistance_top10 = shotdistance.head(10)

Categorize a data frame by bins

If you want to categorize Klay’s shot by distance in terms of “less than 10 feet”, “from 10 to 15 feet” and “from 15 to 20 feet”, for instance, what will you do? The code to turn the distance to categories is:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

#pd stands for Pandas
#TwoPTtype is the name of the data frame in question

The result is:

If you merge that data frame with the frequencies in the original data frame:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

newdf = pd.concat([df1, TwoPTtype['times']], axis=1)

Checking if a point is in a polygon in Python and Javascript

This post will detail how to determine if a point with a longitude and a latitude resides within a polygon – an area comprised of multiple coordinates. There are two ways to accomplish the task, either in Python or in Javascript. 

In Javacript – Turf package

In order to accomplish the task in Javacript, I used turf package. The first thing to do is to add this line to the HTML where the task takes place

<script src='https://npmcdn.com/@turf/turf/turf.min.js'></script>

Then, this is how the task is completed:

point = turf.point([longitude,latitude]); #the point in question

polygon = turf.polygon(#the polygon in question); #the polygon should come in the form of an array of points

turf.booleanPointInPolygon(point,poly) #return either True or False 

In Python

First, install the following:

from shapely.geometry import shape, Point

Then, this is how the task is accomplished:

coord = Point([longitude, latitude]) #this is the point in question

polygon = shape(property['geometry']) #the geometry part of a typical GEOJSON

polygon.contains(coord) #should return either True or False

Hope it is helpful to whoever is looking for an answer to this issue 

Geocoding & reverse-geocoding with Google Map

Since I don’t think the documentation from Google is very clear on how to actually implement the Geocoding API, especially to a newbie like I am, here are the steps to geocode an address or reverse-geocode a set of coordinates and to get the key you want

Reverse-geocoding

Here is a short piece of code in Python

import googlemaps
gmaps = googlemaps.Client(key='<enter your Google API key here>’)

result = gmaps.reverse_geocode((40.714224, -73.961452))
print(result)

If you run that code, what you will get is a very long json. It is important to use an online tool to have a good understanding of your code’s result. Here is how it looks

full result

Go through each number to see what keys interest you. Let’s say if you are interested in [0] in the json. Here is the code to get it:

import googlemaps
gmaps = googlemaps.Client(key='<enter your Google API key here>’)

result = gmaps.reverse_geocode((40.714224, -73.961452))
print(result[0])

Here is how it looks

partial result

Continue to parse through the JSON to get the key you want such as result[0][‘geometry’]…

Geocoding

import googlemaps
gmaps = googlemaps.Client(key='<enter your Google API key here>’)

result = gmaps.reverse_geocode(‘1600 Amphitheatre Parkway, Mountain View, CA’)
print(result)

The rest is similar to Reverse-coding

Concatenate strings to form a full address

If you have address, city, state, zip code and countries in different columns, one quick way to form a full address is

df[‘fulladdress’] = df[[‘Address’, ‘City’, ‘State’, ‘Zip’]].apply(lambda x: ‘ ‘.join(x.astype(str)), axis=1)

Df stands for the current data frame. df[‘fulladdress’] refers to the creation of a new column called “full address”. The whole line of code means to add values on the same row in the address, city, state and zip columns to form a full address accordingly.

Hope it helps

Turn Excel into GEOJSON

My Capstone project requires me to turn Excel in GEOJSON for mapping purposes. Handling and preparing data is 50% of the whole process. I’d like to share what I did step by step, hoping that it will be useful to some who are learning the ropes like I am.

I am using Python as the programming language of choice and Pycharm as IDE. Create a folder on your computer and store the Excel file in question in it. Open the folder in Python and create a new Python file. Here is how it looks on my Pycharm

ExceltoGEOJSON_1

Before we move forward, it’s important to know what a GEOJSON is and how it looks. This website offers a great review on GEOJSON. In terms of structure, a GEOJSON file looks like this

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [0, 0]
      },
      "properties": {
        "name": "null island"
      }
    }
  ]
}

I am pretty sure we can have as many variables under “properties” as we want. The rest should be standard to be followed as possible.

This is how the Excel file looks. Notice that there are coordinates available already. In the future, I’ll work on geocoding an address into coordinates.

ExceltoGEOJSON_2

Let’s start working on the Python code.

import pandas as pd

df = pd.read_excel('CommunityPartner.xls')

Import the “pandas” package. Shorten the package’s name as pd because who would want to repeat a long name many times in the code?

The following line is to read the Excel file into a data frame called df. You can name it however you want. Since the Excel file and the Python code are in the same folder, there is no need to have a directory. Otherwise, it’s necessary to have a full directory.

collection = {'type': 'FeatureCollection', 'features': []}

The next step is to create a shell dictionary. Refer back to the sample structure of a GEOJSON file above to see why I structure the collection variable like that.

df['description'] = df[['Address', 'City', 'State', 'Zip']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

Since we don’t have a full address, the above line is to combine four columns together to form a full address string. The next step is to populate the dictionary

def feature_from_row(CommunityPartner, latitude, longitude, fulladdress, Primary, Website, Phone):
    feature = {'type': 'Feature', 'properties': {'PartnerName': '', 'Address': '', 'marker-color': '',
                                                 'Website': '', 'PrimaryMission': '', 'Phone': ''},
               'geometry': {'type': 'Point', 'coordinates': []}
               }
    feature['geometry']['coordinates'] = [longitude, latitude]
    feature['properties']['PartnerName'] = CommunityPartner
    feature['properties']['Address'] = fulladdress
    feature['properties']['Website'] = Website
    feature['properties']['PrimaryMission'] = Primary
    feature['properties']['Phone'] = Phone
    if Primary == "Economic Sufficiency":
        feature['properties']['marker-color'] = "FF5733"
    elif Primary == "Social Justice":
        feature['properties']['marker-color'] = "FFF033"
    elif Primary == "Health and Wellness":
        feature['properties']['marker-color'] = "74FF33"
    elif Primary == "Environmental Stewardship":
        feature['properties']['marker-color'] = "338DFF"
    elif Primary == "Educational Support":
        feature['properties']['marker-color'] = "CE33FF"
    else:
        feature['properties']['marker-color'] = "FF3374"
    collection['features'].append(feature)
    return feature

Create a function that will undertake the data processing. Between the brackets are the input variables.

feature = {'type': 'Feature', 'properties': {'PartnerName': '', 'Address': '', 'marker-color': '',
                                             'Website': '', 'PrimaryMission': '', 'Phone': ''},
           'geometry': {'type': 'Point', 'coordinates': []}
           }

Create a “feature”variable as above. Try to mirror it in “type” and “geometry” agains the standard GEOJSON (see above) as much as possible. Leave the “coordinate” value as empty to fill in later. Under “properties”, list the keys you want.

feature['geometry']['coordinates'] = [longitude, latitude]
feature['properties']['PartnerName'] = CommunityPartner
feature['properties']['Address'] = fulladdress
feature['properties']['Website'] = Website
feature['properties']['PrimaryMission'] = Primary
feature['properties']['Phone'] = Phone

Time to populate the keys. Remember to key the names of the keys and input variables consistent with what was already posted so far.

You must wonder: what about “marker-color”. You can use the conditional argument to assign values to the variable as follows:

if Primary == "Economic Sufficiency":
    feature['properties']['marker-color'] = "FF5733"
elif Primary == "Social Justice":
    feature['properties']['marker-color'] = "FFF033"
elif Primary == "Health and Wellness":
    feature['properties']['marker-color'] = "74FF33"
elif Primary == "Environmental Stewardship":
    feature['properties']['marker-color'] = "338DFF"
elif Primary == "Educational Support":
    feature['properties']['marker-color'] = "CE33FF"
else:
    feature['properties']['marker-color'] = "FF3374"

If you wonder about the HTML color code, just Google “HTML Color Code” and you’ll see it.

collection['features'].append(feature)
return feature

The first line of the block above dictates that we add every single row of the Excel file to the “features” key of the collection variable. “Return” is a mandatory feature of every function.

geojson_series = df.apply(
    lambda x: feature_from_row(x['CommunityPartner'], x['Lat'], x['Longitude'], x['description'], x['Primary'],
                               x['Website'], x['Phone']),
    axis=1)

jsonstring = pd.io.json.dumps(collection)

The first line is to add every single row of the Excel file to the function so that we can create the string needed for the GEOJSON. The second line is to turn it into json file.

output_filename = 'CommunityPartner.geojson' 
with open(output_filename, 'w') as output_file:
    output_file.write(format(jsonstring))

Name the file however you want and use the second line to write it into GEOJSON. The file product will look like this:

import pandas as pd

df = pd.read_excel('CommunityPartner.xls') #Get the Excel file from static/Excel

collection = {'type': 'FeatureCollection', 'features': []}

df['description'] = df[['Address', 'City', 'State', 'Zip']].apply(lambda x: ' '.join(x.astype(str)), axis=1)


def feature_from_row(CommunityPartner, latitude, longitude, fulladdress, Primary, Website, Phone):
    feature = {'type': 'Feature', 'properties': {'PartnerName': '', 'Address': '', 'marker-color': '',
                                                 'Website': '', 'PrimaryMission': '', 'Phone': ''},
               'geometry': {'type': 'Point', 'coordinates': []}
               }
    feature['geometry']['coordinates'] = [longitude, latitude]
    feature['properties']['PartnerName'] = CommunityPartner
    feature['properties']['Address'] = fulladdress
    feature['properties']['Website'] = Website
    feature['properties']['PrimaryMission'] = Primary
    feature['properties']['Phone'] = Phone
    if Primary == "Economic Sufficiency":
        feature['properties']['marker-color'] = "FF5733"
    elif Primary == "Social Justice":
        feature['properties']['marker-color'] = "FFF033"
    elif Primary == "Health and Wellness":
        feature['properties']['marker-color'] = "74FF33"
    elif Primary == "Environmental Stewardship":
        feature['properties']['marker-color'] = "338DFF"
    elif Primary == "Educational Support":
        feature['properties']['marker-color'] = "CE33FF"
    else:
        feature['properties']['marker-color'] = "FF3374"
    collection['features'].append(feature)
    return feature


geojson_series = df.apply(
    lambda x: feature_from_row(x['CommunityPartner'], x['Lat'], x['Longitude'], x['description'], x['Primary'],
                               x['Website'], x['Phone']),
    axis=1)

jsonstring = pd.io.json.dumps(collection)

output_filename = 'CommunityPartner.geojson' #The file will be saved under static/GEOJSON
with open(output_filename, 'w') as output_file:
    output_file.write(format(jsonstring))

ExceltoGEOJSON_3

This is how the GEOJSON looks:

 
   “type”:“FeatureCollection”,
   “features”: 
       
         “type”:“Feature”,
         “properties”: 
            “PartnerName”:“75 North”,
            “Address”:“4383 Nicholas St Suite 24 Omaha NE 68131.0”,
            “marker-color”:“FF5733”,
            “Website”:null,
            “PrimaryMission”:“Economic Sufficiency”,
            “Phone”:“402-502-2770”
         },
         “geometry”: 
            “type”:“Point”,
            “coordinates”: 
               -95.957309,
               41.276479
            ]
         }
      },
       
         “type”:“Feature”,
         “properties”: 
            “PartnerName”:“A Time to Heal”,
            “Address”:“6001 Dodge St CEC 216 Suite 219C  Omaha NE 68182.0”,
            “marker-color”:“74FF33”,
            “Website”:null,
            “PrimaryMission”:“Health and Wellness”,
            “Phone”:“402-401-6083” …

One important note. If you are a fan of Jupyter Notebook, beware that there may be a problem when it comes to the last step of the process. Here is how the collection variable looks before being dumped into the GEOJSON file.

ExceltoGEOJSON_4

But I ran into errors in the last step. I spent quite some time on fixing it but I couldn’t.

ExceltoGEOJSON_5

Creating the Python code in Pycharm is much easier and produces the same result. It’s even more convenient if you are in the middle of an application development project.

Hope this post helps. Much thanks to appendto and geoffboeing for inspiration.

 

Some basic SQL & Plotly in Python

In this post, I’ll document how to use SQL and plotly to create a basic bar chart in Python, specifically in Jupyter Notebook. I’ll try to be as detailed and visual as possible.

The dataset I am going to use is Nike Manufacturing Map:

The first step is to click download the dataset in Excel. Once the file is downloaded, open the file and remove the first row. The first row just has the name of the file, not the columns’ names we need. Since I haven’t figured out how to remove the first row, the fast way is to do it before starting the code to make our life easier. Click save and the data is ready to go

Open Jupyter Notebook. In my case, I use Anaconda. The first thing to do on a new Jupyter Notebook is to import necessary packages. The “as ABC” is to create an alias for packages with long names so that we won’t have a hard time calling those packages later in the code. Imagine having to type “plotly.graph_objs” 10 times instead of “go”.

import plotly.plotly as py #this package is for visualization
import plotly.graph_objs as go #this package is for visualization
import pandas as pd #this package is for handling the dataframe
from pandasql import sqldf #This is to allow users to use SQL queries to create a new dataframe
pysqldf = lambda q: sqldf(q, globals())

Packages

Then, load the file. To check if the file is loaded correctly, use df.head(). The command will show a snapshot of the data frame. If you want to see the whole data frame, just use df

df =pd.read_excel(‘/Users/camapcon/Documents/Github/Nike_Factory_Locations/export.xls’) #read the file in
df.head()

Read the file in

We need to change the names of the two columns highlighted in the screenshot. The way that they are named will make it tricky to process SQL queries. I tried it. It’s better to have more straightforward column names with no blank space in between. It’s pretty easy:

df = df.rename(columns={‘Nike, Inc. Brand(s)’: ‘Brands’, ‘% Female Workers’:’FemaleWorkers’, ‘Total Workers’:’TotalWorkers’})
df.head()

Column name change

The next step is to use SQL to create a summary table whose data will be used to draw bar charts. What we will analyze includes:

  • The total number of workers by brands
  • The number of factories by brands
  • Average number of workers at each factory by brands
  • Average percentage of female workers at each factory by brands

The code to use SQL to create a subset is below:

Primary = “””SELECT Brands, Sum(TotalWorkers) as TotalWorkers, Count(Brands) as Count, AVG(TotalWorkers) as AvgTotalWorkers, AVG(FemaleWorkers) as FemaleWorkers From df Group By Brands””” #create a data frame using a SQL
Primary_df = pysqldf(Primary)
Primary_df

Here is the result:

SQL 1

Notice that the figures aren’t pretty, right? Let’s round them up and make them look prettier

Second = “””SELECT Brands, TotalWorkers, Count, Round(AvgTotalWorkers, 0) as AvgTotalWorkers, Round(FemaleWorkers,0) as FemaleWorkers From Primary_df”””
Primary_df = pysqldf(Second)
Primary_df

Here is how it looks:

SQL 2

It’s time to create the bar charts:

trace1 = go.Bar( #create x and y axes
x=Primary_df[‘Brands’], #to use column Brands as the X axis
y=Primary_df[‘TotalWorkers’], #to use column TotalWorkers as the Y axis
name=’Total Global Workers by Nike Brands’
)
data1 = [trace1]
layout1 = go.Layout( #starting to plot a grouped bar chart
title=’Total Global Workers by Nike Brands’,
barmode=’group’
)

fig = go.Figure(data=data1, layout=layout1)
py.iplot(fig, filename=’grouped-bar’)

Here is how it looks

newplot

You can change the Y axis to another column as you wish. Here is an example:

trace2 = go.Bar( #Project Count by Mission Area
x=Primary_df[‘Brands’],
y=Primary_df[‘FemaleWorkers’],
name=’Average Percentage of Female Workers by Nike Brands’
)
data2 = [trace2]
layout2 = go.Layout( #starting to plot a grouped bar chart
title=’Average Percentage of Female Workers by Nike Brands’,
barmode=’group’
)

fig = go.Figure(data=data2, layout=layout2)
py.iplot(fig, filename=’grouped-bar’)

Try to label different charts differently to avoid confusion and errors. It would make debugging easier as well.

newplot-2

That’s all I have for this post. Just some basic SQL queries and Python code to handle and visualize data with Plotly and Pandas. I still need to practice and explore more. And I urge you to do the same, if you are as inexperienced and interested in data analysis like I am.