Having a lot of data is powerful. Handling is NOT easy

Brands’ blunders in their emails

Over the past few weeks, I had a couple of incidents in which brands sent me pretty awkward emails. The first incident was with the online pet store 1800PetMeds. A few weeks ago, I found out that my cat had ringworms. I went to 1800PetMeds to buy an oral solution that his vet prescribed. The transaction took place on 21st February 2021 and went smoothly; which I was thankful for. The bottle’s capacity is 52ml. My cat is supposed to take 1ml per day for 7 consecutive days on alternate weeks, meaning that we can’t use up that bottle in 30 days. Even if he takes it every day, it will have to take around 50 days to finish the bottle. Nonetheless, 1800PetMeds waited only for 30 days before they sent me an email encouraging me to buy another bottle. Their email headline read: Kimi’s next Intrafungol order is read. While I appreciate their initiative, I prefer my cat being healed completely to having to buy another bottle.

Another incident was with Hulu. It sent me an email at 10AM in the morning with a one-month free trial as a gift for my birthday. While the note was late by some margin, I still appreciated it. Less than 12 hours later, it sent me another email at 7PM, asking me to become a subscriber. If that’s not awkward enough, here is the kicker: I already received the same trial offer a few days ago and took it! I doubt that there is a system in place at Hulu that manages the delivery of marketing emails.

Handling a lot of data isn’t easy

1800PetMeds and Hulu aren’t some mom-and-pop shops that don’t have the resources to acquire and analyze data. On the contrary, they are Internet companies that should be experts in data analytics. Yet, they still have blunders like my examples above. To be clear, there is not a human-being sitting at a desk and sending out emails like above from Outlook. They are all automated from email tools such as Mailchimp. Hence, this is a product of my information being stored in their database and their operationalizing it.

This post isn’t to ridicule them. Since I have first-hand experience in dealing with data and knowing how difficult it is, I feel for them. At work, I deal with credit card data. There are many partners in our portfolio, some of which can have hundreds of thousands of accounts. Many accounts have hundreds of thousands of transactions every year. The sheer amount of transactions, coupled with their randomness in frequency, makes it a monumentally challenging task to figure out the purchase pattern for each person so that we can offer personalized marketing. For good measure, depending on how good your payment processor and internal data system are, the problem can be compounded by the irregularities in merchant descriptions. Below is what I have to do at work to categorize purchases into merchants. Think about what it is to do it for so many merchants out there

This is just one of the many aspects of what my job entails. We also have to look at how some attributes such as FICO, Balance, Credit Limit changed over time, how an account is engaged digitally (whether it enrolls online, mobile, e-statement, billpay, auto-pay or whether it is connected to a digital wallet), how profitable an account is and where that profitability comes from, and how we can be more efficient in acquiring account (whether direct mail, Internet, our retail branches, our Financial Institution partners or our Cobrand partners’ stores are the most efficient channels).

When I first joined my current team, my boss told me that it would take me a year or at least 6 months to be comfortable, not yet proficient, with what we do on a daily basis. He wasn’t wrong. Our learning curve is very steep. Plus, when dealing with a large amount of data, you have to take into account the infrastructure elements. Here are just a few on a high level

  • Is your current data infrastructure set up to assist fast data retrieval?
  • Does your data warehouse have high availability? Or does it crash a lot?
  • Is it easy to get the data you need or does it take hours to run complicated SQL queries?
  • Is there a set of universal definitions of metrics and fields?
  • Do you have a data visualization that can aid in presenting complex data? Is it connected straight to the data warehouse? Is it in a coding language that requires your team to learn?
  • Do you have a machine learning capability in-house to create proprietary models?
  • Is there a tool that can help eliminate biases to create apple-to-apple comparisons? If yes, how is data transferred from internal data warehouses to that tool?

I don’t believe my company is elite in data analytics. Not even close. I don’t know for sure, but companies like Netflix or Google should be an exciting place to work at because you’d be able to see how they handle an ocean of data at their fingertips. For many companies such as my employer, even though data driven operations are worthy visions, they are highly difficult to realize. Well, so is making money, I guess.

Get to know Snowflake

What is this company about?

Every business needs data to be competitive nowadays. The challenges related to data are daunting. First, data grows exponentially every day. Companies need to invest and maintain the right infrastructure to store a large amount of data. Second, the data infrastructure needs to support the business use cases. It must enable data staff to access data quickly and efficiently in order to build data-driven applications as well as carry out data analytics. As business leaders have to make quick decisions as a response to changing environments, it requires fast and accurate analytics.

For instance, I work at a bank and at the beginning of the pandemic, we needed to build monitoring dashboards quickly so that the management team could make quick decisions and have a finger on the pulse of the business. It was a challenge because our data warehouse’s structure made it so time-consuming to complete complex queries. And when multiple users tried to execute queries at the same time, the problem was exacerbated. I often had to work late at night to make sure my queries could finish faster. Our bank is just a regional one. I imagine that our data problems would be magnified if we had a bigger operation.

Snowflake is built to solve data problems. It offers cloud-based data storage and analytics services through which businesses can have a single truth of data, consistency, resiliency, flexibility and fast access to data. Snowflake products work on major public cloud providers such as AWS, Azure or GCP. On top of Snowflake, companies can build out a proper data warehouse, enable data analytics, power data applications and facilitate data exchange with other entities.

business2b11.jpg
Figure 1 – Overview of Snowflake. Source: Snowflake S-1
Figure 2 – Use Cases from Snowflake. Source: Snowflake

As a Software-as-a-Service, Snowflake allows customers to pay only for the resources and services they use. Since Snowflake relies on public cloud providers, their unit price mirrors that of such providers as in that prices change depending on which region you are located. Snowflake’s customer base is impressive. They count as clients corporations such as Akamai, Capital One, Neiman Marcus, AXA, McKesson, Hubspot. In terms of technology partners, Snowflake’s website lists 17 major partners, including GCP, AWS, Azure, Salesforce and Looker.

How has the company been doing?

In S-1, Snowflake revealed its financials all the way back to the quarter ending in October 2018. The company hasn’t been operationally profitable as it recorded losses in every quarter. Revenue increased from $29 million in October 2018 to $133 million in July 2020. Operating loss stood at $78 million in July 2020. Before Jan 2020, Snowflake’s operating loss was even bigger than its revenue every quarter, but since then the loss has gone down while revenue has accelerated.

While the company hasn’t been profitable, the good news is that its gross and operating margin have been improving. If the company can continue to keep its gross margin and increase operating leverage (spend less on acquiring customers & be more efficient), it’s slated to be profitable in the near future. The sign is already there. 94% of the company’s revenue in 6 months ending July 2020 was from existing customers that expanded their usage. That implied a customer stickiness, satisfaction with the products and cost savings in acquiring new customers to increase the top line.

The past 12 months were positive for Snowflake. The company doubled its customer base from 1,547 in July 2019 to 3,117 in July 2020, as well as the number of customers whose trailing 12-month product revenue was greater than $1 million increased from 22 to 56 in the same time frame. Remaining Performance Obligation (RPO), which is a performance metric that includes unrecognized revenue in the near future, grew from $221 million in July 2019 to $688 million a year later. Net revenue retention rate has been always higher than 150%, with the latest quarter seeing 158%. Snowflake expects this rate to decrease in the near future.

Regarding platform usage and customer acquisition, Snowflake has been fairly impressive. Its platforms are used by 7 of the top Fortune 10 (4% of revenue in July 2020) and 146 of the top Fortune 500 (26% of total revenue). In the month of July 2020, the company reported an average of 507 million daily queries, compared to 254 million daily queries in July 2019. Net Promoter Score (NPS), a metric that indicates how willing a customer is to promote a brand, stands at a very good 71.

Capital One is an important client for Snowflake. The bank made up 17% ($16 mil) and 11% ($29 mil) of Snowflake’s annual revenue in fiscal year ending Jan 2019 and Jan 2020, respectively. While its share of revenue was less than 10% in the quarter ending July 2020, it still made up 22% of Snowflake’s account receivables ($33 mil). The trend lowers risks for Snowflake and investors as the company is now less reliant on this one particular customer.

What are the tailwinds behind Snowflake?

In its S-1, Snowflake reported a Total Addressable Market of $137 billion for the company as of 2020. Given the annual run rate of just $520 million in revenue, Snowflake has a lot of room to grow in the future. Furthermore, 12% of the company’s revenue came from customers outside the US. That indicates a big opportunity internationally for Snowflake. Since the company is built on top of public cloud providers whose infrastructure spans the globe, it already has the base infrastructure to expand internationally.

More importantly, the digital transformation that is underway in the corporation world is positive for Snowflake. As companies become more agile and digital, they need data to make informed decisions and be competitive. Hence, products and solutions that live in the cloud like Snowflake’s are well-positioned to capture this trend. From a personal perspective, my company is still hosting data on legacy infrastructures which present a bottleneck to our data analytics and operations. I can’t tell you how many times I had hours off as our data warehouse went offline. I can’t count how many hours I wasted because it took too long for complex queries to run. My developer colleagues reported happiness with the upcoming transition to Snowflake. Hence, this gives me a bit of confidence in the company as I know that my employer isn’t the only one running on legacy data infrastructure.

Risks

Among the risk factors listed in its S-1, there is one that I think stands out for Snowflake: its reliance on public clouds. Here is what it wrote:

We currently only offer our platform on the public clouds provided by AWS, Azure, and GCP, which are also some of our primary competitors. Currently, a substantial majority of our business is run on the AWS public cloud. There is risk that one or more of these public cloud providers could use their respective control of their public clouds to embed innovations or privileged interoperating capabilities in competing products, bundle competing products, provide us unfavorable pricing, leverage its public cloud customer relationships to exclude us from opportunities, and treat us and our customers differently with respect to terms and conditions or regulatory requirements than it would treat its similarly situated customers. Further, they have the resources to acquire or partner with existing and emerging providers of competing technology and thereby accelerate adoption of those competing technologies. All of the foregoing could make it difficult or impossible for us to provide products and services that compete favorably with those of the public cloud providers.

Source: Snowflake’s S-1

This is really a genuine concern. I don’t see Snowflake will build out its own underlying infrastructure in the near future. Mirroring the scale and sophistication of these public providers, especially if they want to expand overseas, will be expensive and resource-consuming. While such a reliance presents a risk, particularly when all AWS, Azure and GCP have competing products with Snowflake, the company also brings a lot of revenue to these cloud providers. So it creates an interesting dynamic in which I also suspect the Big Three will do anything to harm its startup customer.

In summary, my personal experience gears me towards investing in this company. I also observed some folks lauding the business on Twitter. The company itself has seen impressive growth and has a lot of room to grow as well as tailwinds. I personally look forward to the IPO debut of Snowflake.

Challenges to Data Analysts – What I learned from my job

This job is the first time I have ever worked with a large amount of data and predictive methods in order to estimate the effectiveness of future marketing efforts. Below are a few challenges that I think data analysts may face:

Documentation and Universal Truth

The data and analyses can only be helpful when accepted by everyone, or at least almost everyone. Different departments need to be on the same page on definitions and calculation methods. Finance can’t have a different understanding of profit and revenue than Marketing does.

Also, the more data there is, the more important it is to have extensive and meticulous documentation. Everybody in an organization needs to be aware what each schema means, what each table means, what each column means, what each value in the column means and how the tables are connected with one another. Without a careful and detailed documentation as well as universal knowledge, an organization will encounter a lot of waste and inefficiency in operations.

Appropriate Tools

A large amount of data requires a powerful machine to retrieve it from a data warehouse. An insufficient piece of equipment such as a 4GB in RAM in a dated computer would mean hours of lost time and inefficiency. Also, it has to be said that data warehouses should be reliable. An offline data warehouse will render data analysts almost useless.

After data is retrieved, the next time is to clean, process and present data. Tools such as Tableau are awesome, but security concerns from compliance or IT can be a hindrance in adopting the tools. Plus, applications such as Tableau are expensive. If there are only a few individuals in an organization having access to it, the usefulness will be limited and the investment will not be as fruitful as it could be.

Hence, Excel is still a popular choice among organizations for data analysis. However, when it comes to processing a large dataset or using pivot tables with a lot of slices and filters, Excel is notorious for crashing more often than you can stomach. Furthermore, Excel isn’t a great visualization tool in my opinion. Presenting data to management or leadership teams usually demands sleek visuals to aid understanding and easy preparation for such visuals to save time. Unfortunately, Excel isn’t strong in either.

Connecting data to an outsourced application

Not every application that is useful to your job is internal. Sometimes, an external application is necessary. For instance, you may want to have a predictive analytics tool that can’t be built in-house. For the tool to work, you need to feed it with real data on a regular basis as much as possible since predictions often stem from historical data. However, getting data out of an organization’s walls can be a daunting task because of compliance and security concerns. Plus, ensuring that the data sent to the external application is correct and updated regularly is a time-consuming challenge. Data needs to be verified carefully. Files should be sent manually in time and in a format requested by the vendor. Automation of a data feed is ideal, but it would involve some programming and collaboration with IT and compliance.

Working at my job so far has been an eye-opener, especially in terms of working with huge datasets. I was shadowed on the job by a high schooler a few days ago. I explained to her what my job entailed and what we do everyday. I hope through this post I did shed some light on the challenges data analysts face. There are likely more, but I think these three are popular and among the biggest.

Data Analytics: Klay Thompson’s Performance

This is my data analytics practice by analyzing Klay Thompson’s performance so far in the 2018-2019 season up to 22nd Dec 2018. Klay Thompson is the shooting guard of Golden State Warriors. He is a three time world champion and I am a big fan of his playing style and deadly explosiveness. This post features my findings by analyzing his shot data this season from NBA website here. My code is available on my personal GitHub for your reference.

Findings

  • Klay made about 44% of his shots so far
  • Klay’s successful shots’ average distance to the basket is 15.92m
  • He made more shots in the first half than he did in the second half
  • 67% of Klay’s made shots are two pointers. The rest are three pointers
  • Living up to his name, Klay’s favorite play type is “catch and shoot jump shot”
  • Regarding Klay’s made two-pointers, below is the distribution by distance. He seems to be more effective within 10 feet of the basket and from 15 to 20 feet.
  • In regards to Klay’s three pointers, the distribution by distance to the basket is as follows: (no surprise that the farther he is from the basket, the less lethal he is)

  • As one of the best three point shooters in the league, Klay seems to be equally good throughout the periods of a game, except for the first quarter

Technical lessons I learned from this practice:Pie chart in Python with Matplot

Pie chart in Python

Let’s say you have two variables: TwoPT and ThreePT that stand for the shooting percentage of Klay’s two and three pointers respectively. Here is the code to draw a pie chart

labels = '2PT Field Goal', '3PT Field Goal'
sizes = [TwoPT, ThreePT]
colors = ['green', 'gold']
explode = (0, 0)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
 
plt.axis('equal')
plt.title("Klay's made shots by shot types")
plt.show()

Nunique function

Imagine if you have a data frame as the following

If you want to count how many events (whether it’s a missed or made shot) by Klay by period, instead of using SQL, the alternative is to use Nunique function. An advantage of using the aggregate is that the outcome is automatically a data frame. The code is as follows:

periodstats = madeshot.groupby(by='period', as_index=False).agg({"game_date": pd.Series.nunique, 'time_remaining': pd.Series.nunique}) #the data frame's name is madeshot. Pd is the abbreviation of Pandas

The result is:

Sort and get the top 10 of a data frame

If your data frame looks like the one below and your intention is to get the top 10 records in terms of “times”, what will you do?


The code I used is pretty straightforward and simple. (The data frame’s name is shotdistance

shotdistance = shotdistance.sort_values(by='times', ascending=False)
shotdistance_top10 = shotdistance.head(10)

Categorize a data frame by bins

If you want to categorize Klay’s shot by distance in terms of “less than 10 feet”, “from 10 to 15 feet” and “from 15 to 20 feet”, for instance, what will you do? The code to turn the distance to categories is:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

#pd stands for Pandas
#TwoPTtype is the name of the data frame in question

The result is:

If you merge that data frame with the frequencies in the original data frame:

df1 = pd.cut(TwoPTtype['shot_distance'], bins=[0, 10, 15, 20, 23], include_lowest=True, labels=['Less than 10 feet', 'From 10 to 15 feet', 'From 15 to 20 feet', 'From 20 to 23 feet'])

newdf = pd.concat([df1, TwoPTtype['times']], axis=1)

Data Visualization: Hotel Statistics from Singapore Tourism Board

The data from this Story is from 2014 to October 2018 and taken from Singapore Tourism Board website. You can find my Tableau dashboard here.

Accumulative figures for available room nights, booked room nights, revenue, average room rates and RevPar are on a monthly and yearly basis.

Drilling down further, hotel statistics are categorized by four hotel tiers: luxury, upscale, mid-tier and economy. Since segment data is available only in RevPar, average room rates and occupancy rate, it’s not possible to compute the accumulative figures for segments.

Basic terminologies:

Available room nights refer to the total possible number of room nights in a period of time.

Booked room nights refer to the total number of room nights booked by guests in a period of time.

Average room rates refer to the average revenue per booked room nights in a period of time. It’s calculated by revenue divided by the number of booked room nights

Occupancy rates refer to the percentage of booked room nights against the available room nights. It’s calculated by booked room nights divided by available room nights

Revenue per Available Room or RevPar refers to the revenue per a room night. It’s one of the most important metrics in the industry, showing how a property fills its room inventory and how much it can charge per room. It’s calculated by Occupancy Rate multiplied by Average Room Rates in the same period of time.

Insights

Yearly Statistics

From 2015 to October 2018, the Singaporean tourism industry averages more than 80% in occupancy rate; which is pretty positive

The industry seems also to be on expansion with increasingly bigger yearly available room nights (almost 3 million room nights from 2014 to 2017) & booked room nights


While the number of rooms and occupancy rate get increasingly bigger year after year, the average room rate and RevPar go in the opposite direction

The rise in revenue signals that the increase in available & booked room nights outweighs the drop in average room rates and RevPar


Data in 2018 doesn’t include figures in November and December. Given the past performances, it’s difficult and uncertain to predict how the average room rates and RevPar will fare. It seems; however, average room rate will go up in December, compared to the rate in November.

Also, December seems to be a low month for Singapore’s tourism as historical figures show below. On the other hand, July and August are the busiest months of the year for Singapore.

Regression Model

Regression models every year show that there seems to be a correlation between occupancy rate and RevPar. Every year’s model is accompanied with a P-value less than 0.05 and an R-Squared between 58% and 77%

Seasonality

Regarding seasonality of Singapore’s tourism industry, every September seems to take a sharp decline in terms of occupancy rate, compared to the same year’s July and August. It’s an interesting phenomenon given that the Singapore Formula 1 Grand Prix takes place every September. Surprisingly, luxury and upscale segments tend to have fared better than mid-tier and economy segments from 2014 to 2016 and in 2018. My theory is that September may be a traditionally low month for Singapore and as a result, sees a low occupancy rate. However, hotels either push prices alone higher or package room rates with Formula 1 weekend tickets. The average room rate; therefore, may be driven up higher despite a low occupancy rate.

In 2014

In 2015

In 2016

In 2017

In 2018

It’s very interesting since if the demand is not there, businesses don’t usually drive up the prices. As a matter of fact, prices are usually lowered to stimulate the demand. My theory for such a phenomenon is cited above with regards to the Formula 1 Grand Prix in Singapore.

Luxury Segment

Upscale Segment

Mid-Tier Segment

Economy Segment