How to create Rolling average in Power BI without data tailing off in the end

Today, I will talk about how to do rolling average in Power BI and how to address the tailing off in the end.

Rolling average is a technique used to address the short term volatility and fluctuation of data. In this example, I used the Apple Mobility Data for the state of New York. The data is from 1st March 2020 to 14th June 2020, which is the latest date when the data is available as of this writing. My dataset has only two columns: date and mobility. Here is what it looks like in Power BI

Apple Mobility for 11th and 12th May 2020 is not available, but you don’t see on the graph above because I smoothed it out by setting the date field to “continuous”.

To create a rolling average in Power BI, it’s actually quite simple. Click on “New Quick Measure” and choose “Rolling Average”, then you just need to fill in the details

Depending on what you are trying to do, you can set an appropriate period as well as the “periods before” and “periods after” fields. In this example, I am looking at 30-day rolling average. Therefore, I set it up that way.

Your data will look like this. There are a few things worth pointing out:

  • To make a new quick measure work, the time variable on the X-axis has to be a date hierarchy
  • That’s why you no longer see the whole graph on the screen. Instead, to see every data point, you now have to scroll horizontally
  • Even though the data stops on 14th June 2020, the graph doesn’t stop until 14th July 2020. It is because right now, you don’t tell Power BI when to stop projecting the data. The value on 14th July 2020 is exactly the value of 14th June 2020 and the upward trend is misleading because the more the graph moves to the right, the fewer data points there are.

If you look at the code for the New Quick Measure, here is what it looks like

To fix the tailing off issue, you just need to modify the code a little bit in the “Return” part, as follows:

The result will look like this

You can see the original data with its own short-term fluctuation in orange and the 30-day rolling average in the blue line. After the code is modified, the blue line now stops on 14th June 2020.

To enhance user experience by eliminating the need to roll horizontally, make sure that the date hierarchy also has “Year” as you can see below

Don’t you think the line chart looks smoother and better now?

Hope this little tutorial helps

Common SQL date functions in Visual Studio

I use Visual Studio as a code editor a lot at work to connect with our data warehouse. It has its own downsides, but I have enjoyed it so far and in some aspects, it is better than SAS. One of the reasons why I prefer Visual Studio is that it has far simpler date formats than SAS. Due to a much smaller number of date formats, Visual Studio poses a lesser challenge in handling date functions compared to SAS. Today, I’ll discuss a few common SQL date functions in Visual Studio. This entry is not for those who are really experienced, but for someone who are like me a while ago and just started to code in Visual Studio.

Declare, set and print a date

These functions are pretty straightforward. There is no need to explain further. As you can see, “date” as a format is way simpler than SAS’ formats such as date9. or ddmmyy10. The print function helps to double check whether you declare and set the right format. You can either hard-code a value or use built-in functions like getdate()

Add/subtract days from a specific date

What if you want to go forward or backwards from a specific date? Dateadd function is your friend

Get to the end or beginning of a month

What if you want to get the end or beginning of a month? To get month end of a date, use EOMONTH. You can also get the end of a month in the future or in the past. See below for examples

Let’s say today is 14th June 2020, if you want to get the first month of June and May, you can follow the below

Convert variables between integers and dates

Sometimes, a date attribute is in integer format. How to convert back and forth?

Convert a date into formats like 2020Q2 or 2020-06

Continue to build on today as of 14th Jun 2020, what if you want to convert that to a variable looking like this: 2020Q2 or 2020-06?

I hope those examples are helpful. They may look easy, but there is one thing I have learned from coding and blogging for more than 2 years: there is always someone out there who needs to see specific examples in addition to instructions on what functions to use. I wrote about how to create a map with Mapbox and how to create a hover effect on Mapbox. It’s nothing sophisticated, but it has generated quite a bit of traffic for me even after 2 years.

Good practices in coding

Like many things in our society, there is also recommended etiquette in coding. There are two practices, in particular, that I find important and useful.

First, it’s beneficial to painstakingly document your code. At the beginning of any program, jog down some lines on what the program is about. Then, before any function, write something about it. If you give aliases to variables or tables that have long names, put down some notes as well. If there is any logic behind the code, make it visible to others too. Often times, folks may understand the mechanics of the code, but don’t understand what the code actually does since they don’t understand the logic.

Below is an excerpt from a document in one of my first coding classes. In our assignments, if we forgot to document our code, we would have 5-10% of our grade taken away.

As highlighted in the screenshot, a detailed documentation is very helpful to not only others looking at your code, but also yourself later on. If a program is complex and there is no documentation, you’ll find it more difficult than it should be to refresh your memory on the code. I have been there and I don’t even write complex code!

Above is an example I had from my programming class. In practice, it doesn’t need to be that detailed, but the description section and the date are necessary in my opinion.

The second practice that I think is useful is to format the code. Normally, we tend to get carried away while coding and neglect how the whole program actually looks. Lines are not aligned. Blocks of code are nested and difficult to read. Brackets are all over the place, making it challenging to debug and understand the code. What I usually do is that after I am sure my program works as expected, I search for a website to help with the formatting of code (it’s easy, just google, for instance, HTML formatter) and have the website re-format the code so that it’s easier to digest.

Easy guide on how to push code to your GitHub Master branch

This post is a simple guide on how to push your code to GitHub from a command line. I am using a Mac, so it will be a little different for Windows users.

Let’s say if I have a folder called MinhDuong in this directory Documents/GitHub/MinhDuong

Step 1: on your command line, go to the same directory. In my case, it will look like this on my command line

Step 2: set up the remote URL

Basically, you want to make sure you will push the code to the right place. If you have a new repository, run this code:

git remote set-url origin "your repo URL goes here" (without the quotation marks)

If you are updating an existing repo, run “git remote -v” to figure out which repo you’re currently connected to. Here is how it looks on mine, exactly where I want it to be

Step 3: run “git status” to see if there are pending files to be pushed. If you’re pushing to a new repo, this shouldn’t matter much. However, if you are updating an existing repo, this will show the difference between the current folder on your local and the repo.

Step 4: now is the time to add files. If you want to add all files, run “git add .”. Remember the space and the dot after the word “add”.

Step 5: run

git commit -m "whatever comment you want"

Step 6: run

git push origin master

Then you are done.

Python: Most frequent words in a string

This post is my practice on getting the most frequent words in a string in Python. Here is the code

#import the necessary packages
from collections import Counter
import pandas as pd
#open the text file. Here in this case, it is named "text"
with open('text.txt') as fin:
    counter = Counter(fin.read().strip().split())

numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]

counter.most_common() is the function to get all the words and their respective count from the string. If you put a number, let’s say 10, in the brackets, it means that you want to get only the first 10 elements of the array. Here is how counter.most_common(10) looks:

[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9)]

numbers = sorted(counter.most_common(), key=lambda student: student[1], reverse=True)
top15 = numbers[0:15]

The above code is to get all the words and sort them in the descending order according to the words’ frequency. top15 is to get the first 15 elements of the sorted array. Here is how the top15 looks:

[(‘to’, 27), (‘the’, 26), (‘of’, 20), (‘and’, 19), (‘in’, 16), (‘a’, 15), (‘is’, 12), (‘for’, 11), (‘it’, 9), (‘new’, 9), (‘are’, 9), (‘their’, 8), (‘video’, 7), (‘has’, 7), (‘by’, 7)]

After we get the top 15, we should put them into a data frame so that data processing can be easier. Here is how

text = [] #an array for the words
number = [] #an array for the frequency
for i in top15: #iterate through the top 15
   text.append(i[0])
   number.append(i[1])

#create the data frame
rawdata = {'words': text, 'frequency': number}
df = pd.DataFrame(rawdata, columns = ['words', 'frequency'])

This is the final data frame

    words  frequency
0      to         27
1     the         26
2      of         20
3     and         19
4      in         16
5       a         15
6      is         12
7     for         11
8      it          9
9     new          9
10    are          9
11  their          8
12  video          7
13    has          7
14     by          7

Data Analytics: Klay Thompson’s Performance

This is my data analytics practice by analyzing Klay Thompson’s performance so far in the 2018-2019 season up to 22nd Dec 2018. Klay Thompson is the shooting guard of Golden State Warriors. He is a three time world champion and I am a big fan of his playing style and deadly explosiveness. This post features my findings by analyzing his shot data this season from NBA website here. My code is available on my personal GitHub for your reference.

Findings

  • Klay made about 44% of his shots so far
  • Klay’s successful shots’ average distance to the basket is 15.92m
  • He made more shots in the first half than he did in the second half
  • 67% of Klay’s made shots are two pointers. The rest are three pointers
  • Living up to his name, Klay’s favorite play type is “catch and shoot jump shot”
  • Regarding Klay’s made two-pointers, below is the distribution by distance. He seems to be more effective within 10 feet of the basket and from 15 to 20 feet.
  • In regards to Klay’s three pointers, the distribution by distance to the basket is as follows: (no surprise that the farther he is from the basket, the less lethal he is)

  • As one of the best three point shooters in the league, Klay seems to be equally good throughout the periods of a game, except for the first quarter

Technical lessons I learned from this practice:Pie chart in Python with Matplot

Pie chart in Python

Let’s say you have two variables: TwoPT and ThreePT that stand for the shooting percentage of Klay’s two and three pointers respectively. Here is the code to draw a pie chart

labels = '2PT Field Goal', '3PT Field Goal'
sizes = [TwoPT, ThreePT]
colors = ['green', 'gold']
explode = (0, 0)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
 
plt.axis('equal')
plt.title("Klay's made shots by shot types")
plt.show()

Nunique function

Imagine if you have a data frame as the following

If you want to count how many events (whether it’s a missed or made shot) by Klay by period, instead of using SQL, the alternative is to use Nunique function. An advantage of using the aggregate is that the outcome is automatically a data frame. The code is as follows:

periodstats = madeshot.groupby(by='period', as_index=False).agg({"game_date": pd.Series.nunique, 'time_remaining': pd.Series.nunique}) #the data frame's name is madeshot. Pd is the abbreviation of Pandas

The result is:

Sort and get the top 10 of a data frame

If your data frame looks like the one below and your intention is to get the top 10 records in terms of “times”, what will you do?


The code I used is pretty straightforward and simple. (The data frame’s name is shotdistance

shotdistance = shotdistance.sort_values(by='times', ascending=False)
shotdistance_top10 = shotdistance.head(10)

Categorize a data frame by bins

If you want to categorize Klay’s shot by distance in terms of “less than 10 feet”, “from 10 to 15 feet” and “from 15 to 20 feet”, for instance, what will you do? The code to turn the distance to categories is:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

#pd stands for Pandas
#TwoPTtype is the name of the data frame in question

The result is:

If you merge that data frame with the frequencies in the original data frame:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

newdf = pd.concat([df1, TwoPTtype['times']], axis=1)

Checking if a point is in a polygon in Python and Javascript

This post will detail how to determine if a point with a longitude and a latitude resides within a polygon – an area comprised of multiple coordinates. There are two ways to accomplish the task, either in Python or in Javascript. 

In Javacript – Turf package

In order to accomplish the task in Javacript, I used turf package. The first thing to do is to add this line to the HTML where the task takes place

<script src='https://npmcdn.com/@turf/turf/turf.min.js'></script>

Then, this is how the task is completed:

point = turf.point([longitude,latitude]); #the point in question

polygon = turf.polygon(#the polygon in question); #the polygon should come in the form of an array of points

turf.booleanPointInPolygon(point,poly) #return either True or False 

In Python

First, install the following:

from shapely.geometry import shape, Point

Then, this is how the task is accomplished:

coord = Point([longitude, latitude]) #this is the point in question

polygon = shape(property['geometry']) #the geometry part of a typical GEOJSON

polygon.contains(coord) #should return either True or False

Hope it is helpful to whoever is looking for an answer to this issue