Stats on retail store size & revenue per square feet

I have been doing some industry research for work, specifically on the retail industry. One trend that CBInsights mentioned in their report was that retail stores were shrinking in size. CBInsights argued that retailers wanted to more conscious of how they made use of their retail space. The competition is so fierce that retailers cannot afford to do everything, be everything and sell everything. They tend to get more nimble in operations and conscious of what they have on display. Nonetheless, CBInsights’ latest year in their data was 2015. So I went through the financial reports by several retailers to find out if retail stores are actually shrinking in size. Before I go through the findings, below are a few important notes:

  • The list of retailers was from this article by WSJ. There are several retailers whose information was not retrieved. The omission was attributed to the way such retailers structured their data, making it time-consuming to retrieve data and complicated to explain. Hence, I decided to omit those retailers
  • Retail is a complex industry. The data is for reference only and may represent to some extent the players or trend in the industry. By no means do I believe that the data represents 100% the retail industry
  • Data from 2015 to 2017 was from the chosen retailers’ annual reports. Data in 2018 was from the latest quarterly reports. Only Walmart already filed their 2018 annual report
  • Apart from Walmart and Sam’s Club, no other companies had their revenue data retrieved. It doesn’t make sense to analyze revenue per square feet with only 3 quarters’ data recorded
  • Data is for the retailers’ US segment only
  • Revenue by Sam’s Club excludes fuel revenue

Number of stores

Among the surveyed companies, only Best Buy, JC Penny and Sam’s Club lowered their store count

Average Store Size

Best Buy, JC Penny and Sam’s Club increased their average store size. The rest decreased theirs, except Dick’s, which keeps their store size more or less the same for the past 4 years

Revenue per square feet

Only Target saw their revenue per square feet decrease in 2017, compared to 2016 and 2015. As the chart can show, 2018 looks to be a good year for Walmart. Both Walmart (the brand) and Sam’s Club increased revenue per square feet, especially the latter.


The majority of the surveyed companies reduced their total retail space, but managed to make the most of their selling space. This is in line with what CBInsights mentioned (I touched on it above as well).

The data I collected is available here on my Tableau profile

Data Analytics: Klay Thompson’s Performance

This is my data analytics practice by analyzing Klay Thompson’s performance so far in the 2018-2019 season up to 22nd Dec 2018. Klay Thompson is the shooting guard of Golden State Warriors. He is a three time world champion and I am a big fan of his playing style and deadly explosiveness. This post features my findings by analyzing his shot data this season from NBA website here. My code is available on my personal GitHub for your reference.


  • Klay made about 44% of his shots so far
  • Klay’s successful shots’ average distance to the basket is 15.92m
  • He made more shots in the first half than he did in the second half
  • 67% of Klay’s made shots are two pointers. The rest are three pointers
  • Living up to his name, Klay’s favorite play type is “catch and shoot jump shot”
  • Regarding Klay’s made two-pointers, below is the distribution by distance. He seems to be more effective within 10 feet of the basket and from 15 to 20 feet.
  • In regards to Klay’s three pointers, the distribution by distance to the basket is as follows: (no surprise that the farther he is from the basket, the less lethal he is)

  • As one of the best three point shooters in the league, Klay seems to be equally good throughout the periods of a game, except for the first quarter

Technical lessons I learned from this practice:Pie chart in Python with Matplot

Pie chart in Python

Let’s say you have two variables: TwoPT and ThreePT that stand for the shooting percentage of Klay’s two and three pointers respectively. Here is the code to draw a pie chart

labels = '2PT Field Goal', '3PT Field Goal'
sizes = [TwoPT, ThreePT]
colors = ['green', 'gold']
explode = (0, 0)  # explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.title("Klay's made shots by shot types")

Nunique function

Imagine if you have a data frame as the following

If you want to count how many events (whether it’s a missed or made shot) by Klay by period, instead of using SQL, the alternative is to use Nunique function. An advantage of using the aggregate is that the outcome is automatically a data frame. The code is as follows:

periodstats = madeshot.groupby(by='period', as_index=False).agg({"game_date": pd.Series.nunique, 'time_remaining': pd.Series.nunique}) #the data frame's name is madeshot. Pd is the abbreviation of Pandas

The result is:

Sort and get the top 10 of a data frame

If your data frame looks like the one below and your intention is to get the top 10 records in terms of “times”, what will you do?

The code I used is pretty straightforward and simple. (The data frame’s name is shotdistance

shotdistance = shotdistance.sort_values(by='times', ascending=False)
shotdistance_top10 = shotdistance.head(10)

Categorize a data frame by bins

If you want to categorize Klay’s shot by distance in terms of “less than 10 feet”, “from 10 to 15 feet” and “from 15 to 20 feet”, for instance, what will you do? The code to turn the distance to categories is:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

#pd stands for Pandas
#TwoPTtype is the name of the data frame in question

The result is:

If you merge that data frame with the frequencies in the original data frame:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

newdf = pd.concat([df1, TwoPTtype['times']], axis=1)

Data Visualization: Hotel Statistics from Singapore Tourism Board

The data from this Story is from 2014 to October 2018 and taken from Singapore Tourism Board website. You can find my Tableau dashboard here.

Accumulative figures for available room nights, booked room nights, revenue, average room rates and RevPar are on a monthly and yearly basis.

Drilling down further, hotel statistics are categorized by four hotel tiers: luxury, upscale, mid-tier and economy. Since segment data is available only in RevPar, average room rates and occupancy rate, it’s not possible to compute the accumulative figures for segments.

Basic terminologies:

Available room nights refer to the total possible number of room nights in a period of time.

Booked room nights refer to the total number of room nights booked by guests in a period of time.

Average room rates refer to the average revenue per booked room nights in a period of time. It’s calculated by revenue divided by the number of booked room nights

Occupancy rates refer to the percentage of booked room nights against the available room nights. It’s calculated by booked room nights divided by available room nights

Revenue per Available Room or RevPar refers to the revenue per a room night. It’s one of the most important metrics in the industry, showing how a property fills its room inventory and how much it can charge per room. It’s calculated by Occupancy Rate multiplied by Average Room Rates in the same period of time.


Yearly Statistics

From 2015 to October 2018, the Singaporean tourism industry averages more than 80% in occupancy rate; which is pretty positive

The industry seems also to be on expansion with increasingly bigger yearly available room nights (almost 3 million room nights from 2014 to 2017) & booked room nights

While the number of rooms and occupancy rate get increasingly bigger year after year, the average room rate and RevPar go in the opposite direction

The rise in revenue signals that the increase in available & booked room nights outweighs the drop in average room rates and RevPar

Data in 2018 doesn’t include figures in November and December. Given the past performances, it’s difficult and uncertain to predict how the average room rates and RevPar will fare. It seems; however, average room rate will go up in December, compared to the rate in November.

Also, December seems to be a low month for Singapore’s tourism as historical figures show below. On the other hand, July and August are the busiest months of the year for Singapore.

Regression Model

Regression models every year show that there seems to be a correlation between occupancy rate and RevPar. Every year’s model is accompanied with a P-value less than 0.05 and an R-Squared between 58% and 77%


Regarding seasonality of Singapore’s tourism industry, every September seems to take a sharp decline in terms of occupancy rate, compared to the same year’s July and August. It’s an interesting phenomenon given that the Singapore Formula 1 Grand Prix takes place every September. Surprisingly, luxury and upscale segments tend to have fared better than mid-tier and economy segments from 2014 to 2016 and in 2018. My theory is that September may be a traditionally low month for Singapore and as a result, sees a low occupancy rate. However, hotels either push prices alone higher or package room rates with Formula 1 weekend tickets. The average room rate; therefore, may be driven up higher despite a low occupancy rate.

In 2014

In 2015

In 2016

In 2017

In 2018

It’s very interesting since if the demand is not there, businesses don’t usually drive up the prices. As a matter of fact, prices are usually lowered to stimulate the demand. My theory for such a phenomenon is cited above with regards to the Formula 1 Grand Prix in Singapore.

Luxury Segment

Upscale Segment

Mid-Tier Segment

Economy Segment

Data Visualization: Artificial Intelligence Landscape in 2018

This is a little personal project, using Tableau to visualize the data from Artificial Intelligence Landscape in 2018 by Asgard. All copyrights related to the data are owned by Asgard. I just applied some simple analytics and visualization.

I tried to embed the code from Tableau, but I’d have to upgrade my WordPress plan to Business in order to do so. A bit of a bummer. Nonetheless, here are a few insights

Above is a diagram of the top 10 industries by the number of AI companies. Communications, Sales & Marketing and Healthcare make up the top 3.

On a global level, the US dwarfs every other country in terms of the number of AI companies. It has more AI companies than the next 6 countries combined.

Above are the top 10 cities by the number of AI companies. San Francisco, London and New York make up the top 3. The numbers by city on my dashboard differ from those presented by Asgard itself. The difference, I suspect, comes from the way cities are grouped. I didn’t do any grouping. All visualizations come straight from the data itself while I suspect that there must have been some grouping by Asgard. Nonetheless, the figures by country present no difference.

On my dashboard, users can click any country on the map to filter out the list of individual companies within that country. Furthermore, users can see the top 10 industries by country as well.

My Tableau dashboard can be found here.

Checking if a point is in a polygon in Python and Javascript

This post will detail how to determine if a point with a longitude and a latitude resides within a polygon – an area comprised of multiple coordinates. There are two ways to accomplish the task, either in Python or in Javascript. 

In Javacript – Turf package

In order to accomplish the task in Javacript, I used turf package. The first thing to do is to add this line to the HTML where the task takes place

<script src=''></script>

Then, this is how the task is completed:

point = turf.point([longitude,latitude]); #the point in question

polygon = turf.polygon(#the polygon in question); #the polygon should come in the form of an array of points

turf.booleanPointInPolygon(point,poly) #return either True or False 

In Python

First, install the following:

from shapely.geometry import shape, Point

Then, this is how the task is accomplished:

coord = Point([longitude, latitude]) #this is the point in question

polygon = shape(property['geometry']) #the geometry part of a typical GEOJSON

polygon.contains(coord) #should return either True or False

Hope it is helpful to whoever is looking for an answer to this issue 

Mapbox – Two ways to show a map

For the past few months, I have been charged with visualizing data onto maps using either Google Maps or Mapbox. I chose the latter. After days and nights of struggle, I am pretty close to the finishing line and have gained quite a bit of experience in Mapbox that I want to share. 

Long story short, to map data onto maps, you need to structure data into a specific structure called GEOJSON. It looks like this:

You can put anything in the “properties” key, but the rest essentially have to follow the above format. The coordinates will be used to locate markers or data on the map. 

Let’s say the data that I want to map has 6 different mission areas (see the screenshot above) and my job is to map them onto the map in 6 different colors. 

I have an array that contains 6 different mission areas like this and 6 different colors representing those areas

var missionarea = [area1, area2, area3, area4, area5, area6]
var colorcode = [color1, color2, color3, color4, color5, color6]

One layer approach

#create a map canvas

var map = new mapboxgl.Map({
    container: 'map',
    style: 'mapbox://styles/mapbox/light-v9',
    center: [-96.797885, 39.363438],
    // initial zoom
    zoom: 3.3

#load data and create the layer

map.on("load", function() {

    map.addSource('communityData', {   #name the data as communityData
        type: 'geojson',
        data: communityData, #this is the name of your GEOJSON

    map.addLayer({      #add the layer
        "id": "commMap", #your single layer's name
        "type": "circle", #meaning that each item will be a dot
        "source": "communityData", #the name of your data assigned above
        'layout': {},
        'paint': {
            "circle-radius": 8,
            "circle-opacity": 1,
            "circle-color": {
                "property": "Mission Area",
                "type": 'categorical',
                "stops": [ #assign color to respective mission areas
                    [Missionarea[0], colorcode[0]],
                    [Missionarea[1], colorcode[1]],
                    [Missionarea[2], colorcode[2]],
                    [Missionarea[3], colorcode[3]],
                    [Missionarea[4], colorcode[4]],
                    [Missionarea[5], colorcode[5]],

Multiple Layers

In this example, I’ll label each map layer as “show0”, “show1″…”show5”

var showlist = [] //the array of layerIDs, used to categorize layers
var base = "show"
for (var i = 0; i < Missionarea.length; i++) { //populate showlist array
    var text = base + i
var map = new mapboxgl.Map({
    container: 'map',
    style: 'mapbox://styles/mapbox/light-v9',
    center: [-95.957309, 41.276479],
    // initial zoom
    zoom: 6

map.on("load", function() {

    map.addSource('communityData', {
        type: 'geojson',
        data: communityData,
    //*********************************** Load partners *****************************************************

    communityData.features.forEach(function(feature) {
        var primary =["Mission Area"];
       // iterate through the mission area array
        for (var i = 0; i < missionarea.length; i++) {
            if (primary == missionarea[i]) {
      // assign a name to each layer
                layerID = showlist[i];
                if (!map.getLayer(layerID)) {
                        "id": layerID, #layer's name
                        "type": "circle",
                        "source": "communityData",
                        "paint": {
                            "circle-radius": 8,
                            "circle-opacity": 1,
                            "circle-color": colorcode[i], #color
                        "filter": ["all", ["==", "Mission Area", primary]]

Which approach is better?

If the intention is just to show the dots, there is no difference and it depends on personal preference. However, if your code gets more complicated and as in my case, I had to create at least 6 filters on the map, things will get messy and one approach will no longer allow you to do what you want. Unfortunately, I don’t have that much experience yet to tell you more and I personally believe it’s a case-by-case thing. 

Geocoding & reverse-geocoding with Google Map

Since I don’t think the documentation from Google is very clear on how to actually implement the Geocoding API, especially to a newbie like I am, here are the steps to geocode an address or reverse-geocode a set of coordinates and to get the key you want


Here is a short piece of code in Python

import googlemaps
gmaps = googlemaps.Client(key='<enter your Google API key here>’)

result = gmaps.reverse_geocode((40.714224, -73.961452))

If you run that code, what you will get is a very long json. It is important to use an online tool to have a good understanding of your code’s result. Here is how it looks

full result

Go through each number to see what keys interest you. Let’s say if you are interested in [0] in the json. Here is the code to get it:

import googlemaps
gmaps = googlemaps.Client(key='<enter your Google API key here>’)

result = gmaps.reverse_geocode((40.714224, -73.961452))

Here is how it looks

partial result

Continue to parse through the JSON to get the key you want such as result[0][‘geometry’]…


import googlemaps
gmaps = googlemaps.Client(key='<enter your Google API key here>’)

result = gmaps.reverse_geocode(‘1600 Amphitheatre Parkway, Mountain View, CA’)

The rest is similar to Reverse-coding

Concatenate strings to form a full address

If you have address, city, state, zip code and countries in different columns, one quick way to form a full address is

df[‘fulladdress’] = df[[‘Address’, ‘City’, ‘State’, ‘Zip’]].apply(lambda x: ‘ ‘.join(x.astype(str)), axis=1)

Df stands for the current data frame. df[‘fulladdress’] refers to the creation of a new column called “full address”. The whole line of code means to add values on the same row in the address, city, state and zip columns to form a full address accordingly.

Hope it helps