Challenges to Data Analysts – What I learned from my job

This job is the first time I have ever worked with a large amount of data and predictive methods in order to estimate the effectiveness of future marketing efforts. Below are a few challenges that I think data analysts may face:

Documentation and Universal Truth

The data and analyses can only be helpful when accepted by everyone, or at least almost everyone. Different departments need to be on the same page on definitions and calculation methods. Finance can’t have a different understanding of profit and revenue than Marketing does.

Also, the more data there is, the more important it is to have extensive and meticulous documentation. Everybody in an organization needs to be aware what each schema means, what each table means, what each column means, what each value in the column means and how the tables are connected with one another. Without a careful and detailed documentation as well as universal knowledge, an organization will encounter a lot of waste and inefficiency in operations.

Appropriate Tools

A large amount of data requires a powerful machine to retrieve it from a data warehouse. An insufficient piece of equipment such as a 4GB in RAM in a dated computer would mean hours of lost time and inefficiency. Also, it has to be said that data warehouses should be reliable. An offline data warehouse will render data analysts almost useless.

After data is retrieved, the next time is to clean, process and present data. Tools such as Tableau are awesome, but security concerns from compliance or IT can be a hindrance in adopting the tools. Plus, applications such as Tableau are expensive. If there are only a few individuals in an organization having access to it, the usefulness will be limited and the investment will not be as fruitful as it could be.

Hence, Excel is still a popular choice among organizations for data analysis. However, when it comes to processing a large dataset or using pivot tables with a lot of slices and filters, Excel is notorious for crashing more often than you can stomach. Furthermore, Excel isn’t a great visualization tool in my opinion. Presenting data to management or leadership teams usually demands sleek visuals to aid understanding and easy preparation for such visuals to save time. Unfortunately, Excel isn’t strong in either.

Connecting data to an outsourced application

Not every application that is useful to your job is internal. Sometimes, an external application is necessary. For instance, you may want to have a predictive analytics tool that can’t be built in-house. For the tool to work, you need to feed it with real data on a regular basis as much as possible since predictions often stem from historical data. However, getting data out of an organization’s walls can be a daunting task because of compliance and security concerns. Plus, ensuring that the data sent to the external application is correct and updated regularly is a time-consuming challenge. Data needs to be verified carefully. Files should be sent manually in time and in a format requested by the vendor. Automation of a data feed is ideal, but it would involve some programming and collaboration with IT and compliance.

Working at my job so far has been an eye-opener, especially in terms of working with huge datasets. I was shadowed on the job by a high schooler a few days ago. I explained to her what my job entailed and what we do everyday. I hope through this post I did shed some light on the challenges data analysts face. There are likely more, but I think these three are popular and among the biggest.

Weekly readings 25th May 2019

Inside Google’s Civil War. An interesting story on the internal rift between employees and management over controversial projects.

What makes ramen noodles so special? As a fanatical fan of Japanese food, it’s a very interesting read on one of the more known Japanese dishes

Lower pay and higher costs: The downside of Lyft’s car rental program. The ugly truth about ride-sharing business.

Skift Analysis: Amazon’s Travel Strategy Comes Into Focus

Shark Tank deep dive: A data analysis of all 10 seasons. I am not a huge fan of the show, but it’s cool to look at it from the data perspective

Carmageddon Sinks Tesla’s Bonds. I have been pretty bearish on Tesla and this article doesn’t do much to change my opinion.

An incredible story about a woman who was brave and incredible enough to go out on her own terms

How Data (and Some Breathtaking Soccer) Brought Liverpool to the Cusp of Glory. A fascinating read on how Liverpool used data to enable performance on the pitch.

The Legal Argument That Could Destroy Uber. A really interesting read on what can be a serious legal threat to Uber.

Recommendation to Vietnamese Tourism Board: Make data accessible and easy to process

I love my country. I want to promote my country as much and as honestly as I can. Given my past experience in the hospitality industry, I am a bit drawn towards reading and writing about it. I really want to do some analysis on the arrivals to Vietnam, but the government body responsible for recording data makes it annoyingly challenging for me to work with the data.

Problem 1 – No excel files

First of all, there is no feature on the website to download data in an Excel file. You have to download data and put it yourself in an Excel file. On the other hand, the Singapore Tourism Board makes it super easy to store data on an annual basis as you can see below

Source: Singapore Tourism Board
Each file stores data by month

Problem 2 – Inconsistent naming and order of entries

Copying data from an HTML table wouldn’t be so bad if the order of entries stayed the same across the tables. However, it isn’t the case. The order is all over the place as you can see below. Countries are mixed up differently from one month to another

Even that is the case, vlookup can still help overcome the challenge. However, vlookup requires consistency of variables’ names. In the screenshot above, Cambodia is spelled differently in April and March 2019 reports.

Problem 3 – Redundant variables’ names

Redundant variables’ names like in the screenshot above violate the integrity of data. If you use vlookup, the results will be redundant and inaccurate.

Given how they display the data online, I don’t have much faith that internally, things are different. My bet is that there is no data-centric approach and even if data is used, it must be a time-consuming, laborious and primitive endeavor.

Some basic SQL & Plotly in Python

In this post, I’ll document how to use SQL and plotly to create a basic bar chart in Python, specifically in Jupyter Notebook. I’ll try to be as detailed and visual as possible.

The dataset I am going to use is Nike Manufacturing Map:

The first step is to click download the dataset in Excel. Once the file is downloaded, open the file and remove the first row. The first row just has the name of the file, not the columns’ names we need. Since I haven’t figured out how to remove the first row, the fast way is to do it before starting the code to make our life easier. Click save and the data is ready to go

Open Jupyter Notebook. In my case, I use Anaconda. The first thing to do on a new Jupyter Notebook is to import necessary packages. The “as ABC” is to create an alias for packages with long names so that we won’t have a hard time calling those packages later in the code. Imagine having to type “plotly.graph_objs” 10 times instead of “go”.

import plotly.plotly as py #this package is for visualization
import plotly.graph_objs as go #this package is for visualization
import pandas as pd #this package is for handling the dataframe
from pandasql import sqldf #This is to allow users to use SQL queries to create a new dataframe
pysqldf = lambda q: sqldf(q, globals())

Packages

Then, load the file. To check if the file is loaded correctly, use df.head(). The command will show a snapshot of the data frame. If you want to see the whole data frame, just use df

df =pd.read_excel(‘/Users/camapcon/Documents/Github/Nike_Factory_Locations/export.xls’) #read the file in
df.head()

Read the file in

We need to change the names of the two columns highlighted in the screenshot. The way that they are named will make it tricky to process SQL queries. I tried it. It’s better to have more straightforward column names with no blank space in between. It’s pretty easy:

df = df.rename(columns={‘Nike, Inc. Brand(s)’: ‘Brands’, ‘% Female Workers’:’FemaleWorkers’, ‘Total Workers’:’TotalWorkers’})
df.head()

Column name change

The next step is to use SQL to create a summary table whose data will be used to draw bar charts. What we will analyze includes:

  • The total number of workers by brands
  • The number of factories by brands
  • Average number of workers at each factory by brands
  • Average percentage of female workers at each factory by brands

The code to use SQL to create a subset is below:

Primary = “””SELECT Brands, Sum(TotalWorkers) as TotalWorkers, Count(Brands) as Count, AVG(TotalWorkers) as AvgTotalWorkers, AVG(FemaleWorkers) as FemaleWorkers From df Group By Brands””” #create a data frame using a SQL
Primary_df = pysqldf(Primary)
Primary_df

Here is the result:

SQL 1

Notice that the figures aren’t pretty, right? Let’s round them up and make them look prettier

Second = “””SELECT Brands, TotalWorkers, Count, Round(AvgTotalWorkers, 0) as AvgTotalWorkers, Round(FemaleWorkers,0) as FemaleWorkers From Primary_df”””
Primary_df = pysqldf(Second)
Primary_df

Here is how it looks:

SQL 2

It’s time to create the bar charts:

trace1 = go.Bar( #create x and y axes
x=Primary_df[‘Brands’], #to use column Brands as the X axis
y=Primary_df[‘TotalWorkers’], #to use column TotalWorkers as the Y axis
name=’Total Global Workers by Nike Brands’
)
data1 = [trace1]
layout1 = go.Layout( #starting to plot a grouped bar chart
title=’Total Global Workers by Nike Brands’,
barmode=’group’
)

fig = go.Figure(data=data1, layout=layout1)
py.iplot(fig, filename=’grouped-bar’)

Here is how it looks

newplot

You can change the Y axis to another column as you wish. Here is an example:

trace2 = go.Bar( #Project Count by Mission Area
x=Primary_df[‘Brands’],
y=Primary_df[‘FemaleWorkers’],
name=’Average Percentage of Female Workers by Nike Brands’
)
data2 = [trace2]
layout2 = go.Layout( #starting to plot a grouped bar chart
title=’Average Percentage of Female Workers by Nike Brands’,
barmode=’group’
)

fig = go.Figure(data=data2, layout=layout2)
py.iplot(fig, filename=’grouped-bar’)

Try to label different charts differently to avoid confusion and errors. It would make debugging easier as well.

newplot-2

That’s all I have for this post. Just some basic SQL queries and Python code to handle and visualize data with Plotly and Pandas. I still need to practice and explore more. And I urge you to do the same, if you are as inexperienced and interested in data analysis like I am.