DataDrivenInvestor

empowerment through data, knowledge, and expertise. subscribe to DDIntel at…

Follow publication

Seasonal Trend Decomposition

Elvis
DataDrivenInvestor
Published in
8 min readMar 28, 2021

--

Its a pretty common problem for businesses to stare at data plotted against time (e.g., sales over time) and try to separate seasonality from the underlying trend. More often than not one can sort of see the trend but its buried within a strong seasonal component. One typical strategy for dealing with the problem is to track Year-over-Year (YoY) changes expressed as a percentage but this can be just as difficult because the YoY change can swing from low single digits to high double-digits over the course of a few days. Fortunately there is a simple mathematical algorithm that can help: Seasonal Trend decomposition with Loess commonly known by its acronym STL. In case you are wondering why people don’t use the acronym for Seasonal Trend Decomposition… well consider the connotations of an acronym like STD. :-)

So how does STL help you better understand your business data? STL decomposes a time-series into a seasonal component, a trend component and “noise”:

Y = Y_seasonal + Y_trend + Y_noise

The seasonal component Y_seasonal will identically repeat every period (e.g., every year). It is the part that we want to isolate so that we may better see the underlying trend, which is in Y_trend. The last component Y_noise will be the portion of the data that can’t be attributed to either seasonality or a “smooth-wise” trend; it will be a Gaussian distributed variable with mean of zero. Its variance will measure how “volatile” the time-series is.

I want to avoid the problems with using confidential business data so lets use a data set that is openly available: stock market pricing data. This will not be an ideal dataset because stock prices do not exhibit strong seasonality but it will allow us to illustrate and understand what STL does, how it does and why you should care. Let’s use the stock price for Amazon.com going back to 1 Jan 2015 through 23 Oct 2020 — roughly 6 years of pricing data. I am a subscriber to Quandl, who I use to get financial data. They provide an API I use to down the data by day and store it in a MySQL database in this format:

Image by Author

There was some data preparation I had to do to make the market data amenable to analysis using STL. In particular here are issues I had to resolve:

  1. Fill in missing dates when the market was closed, which I did by copying the pricing data for the first prior day when it was available (i.e., assume the price did not change from whatever it previously was). There are better ways to do this but it is sufficiently easy to illustrate how STL works.
  2. Make every year 365 days, which I did by eliminating any pricing data for February 29th during leap years.
Image by Author

I will provide the full R code at the end but for now let me highlight just the relevant analytical lines:

# Convert the data into a time series structurestart_date <- merged.data$time[1];end_date <- merged.data$time[length(merged.data$close)];data.ts<-ts(merged.data$close, frequency=365, start=c(as.numeric(strftime(start_date, format = "%Y")),as.numeric(strftime(start_date, format = "%j"))), end=c(as.numeric(strftime(end_date, format = "%Y")), as.numeric(strftime(end_date, format = "%j"))-1));# Apply the seasonal trend decompositiondata.stl<-stl(data.ts, s.window = 365);

The first two lines (e.g., start_date and end_date) select the start and end date for the time series, which in this case will be the first and last row of data. The third line (e.g., data.ts) converts the the R dataframe into a time-series (ts) that the STL algorithm requires for its input. The final line (e.g. data.stl) applies the STL algorithm to the time-series using a 365 sample window (i.e., we are saying the seasonality will repeat every 365 days) to extract the seasonal component. The output is the chart below. Let’s talk about each row.

Image by Author

The first row labeled ‘data’ is just the input data to the STL algorithm. In this case it is closing price for Amazon.com stock. The second row labeled ‘seasonal’ is the seasonal component of the time series data. In this case the seasonal component oscillates between roughly +$150 and +$150 over the course of a year. The third row labeled ‘trend’ is the trend component of the time series data. Now stock market pricing data isn’t the best because it was obvious from the input data what the trend was but this example does illustrate that the STL algorithm can easily isolate the underlying trend. The third row labeled ‘remainder’ is the “noise” component of the data. Now let’s dive more deeply into each one.

The seasonal component repeats identically every year so let’s look at just 2015 to see the complete seasonality.

Image by Author

Amazon’s seasonality is such that the maximum (e.g., +$100 to +$150) occurs between July and September, and the minimum (e.g., -$100 to -$150) occurs between December and April. The trough to peak spread in the seasonality is about $300 over the course of a year, which is approximately +/-10% variance today when the price of Amazon.com stock is about $3,000 per share. So, the best time to buy Amazon.com stock is between December and April, and the best time to sell Amazon.com stock is between July and September.

The trend component pretty much speaks for itself. If you own Amazon.com stock beginning in 2015 you must be very happy.

Image by Author

Between 2015 and the end of 2020 the price of Amazon stock went up by more than a factor of 6x. So, if you had $100,000 in Amazon stock in 2015, by the end of 2020 that investment was worth more than $600,000. That is the equivalent of roughly 43% Compounded Annual Growth Rate (CAGR) for 5 years (i.e., 5-year CAGR), which far exceeds the overall stock market return as you will see further below (and far above inflation).

The remainder or noise component is interesting. This is the portion of the time-series that is neither part of the seasonality nor part of the trend.

Image by Author

I think of this component as a measure of the volatility of the time-series. The larger the standard deviation of the remainder, the more risk there is contained in the underlying data. I will show you an example in a few moments of why I think of this component as a measure of risk. For AMZN, the remainder has a standard deviation of $106.88 and a mean of -$10.10. The remainder is distributed as a Gaussian random variable as the following chart illustrates.

Image by Author

So why do I think of the remainder as a measure of volatility or risk? Let me answer that with what we see when we apply the STL algorithm to an Exchange Traded Fund (ETF), which is a type of index fund that contains many different stocks. The point of an index fund is to reduce risk by mixing many stocks together and reducing volatility. I will use Vanguard’s Total Index Fund (VTI) which invests across the entire stock market… more precisely it invests in more than 3,600 different individual stocks. This is what the STL algorithm returns:

Image by Author

If we focus on just the remainder, we see that it is bounded by roughly -$30 and +$10, unlike Amazon which is bounded roughly by +/-$400. So VTI’s noise is generally far smaller than what we see in AMZN as you would expect from an index fund.

Image by Author
Image by Author

The remainder for VTI is distributed as a Gaussian random variable with mean -$0.18 and standard deviation $6.05 (versus $106.88 for AMZN). Since VTI is composed of more than 3,600 individual stocks, its volatility is roughly 1/18 that of AMZN, which is what you’d expect for an ETF. In return for far lower risk (i.e., volatility), the 5-year CAGR for VTI was 9.7%, which is roughly 1/4 the 5-year CAGR for Amazon (i.e., 43% 5-year CAGR).

If you think about it for a while, you will realize you can use the STL algorithm to map stocks (and other financial instruments) onto a four quadrant chart with one axis being “volatility” and the other axis being “N-year CAGR”. Such a chart would enable you to decide which stocks to invest in by balancing risk with return. In fact, the STL algorithm can be used to make many other interesting every day decisions such “What is the best month of the year to sign a new apartment rental agreement?”.

Before I lose you to studying R-code, if you want to read something a little less technical, I recommend my article on best practices for prioritizing analytical work. If you are a bit more statistically curious today, try this article about A/B testing or this article about the three (3) most important statistical tests. Finally, if you want a very fast 1–2 minute read, try this article on eight (8) tips for improving communications between data science and business users.

As promised here is the complete R code I used. I removed confidential information like my passwords and servers URLs but otherwise its (almost) everything I used tailored to how I have the data stored in my MySQL instance.

library('DBI', quietly = TRUE);library('lubridate')user <- 'XYZ';password <- 'XYZ';dbname <- 'XYZ'host <- 'XYZ';mydb <- dbConnect(RMariaDB::MariaDB(), host=host, user=user, password=password, dbname);rs <- dbSendQuery(mydb, "SELECT * FROM SEP WHERE ticker ='AMZN' AND date >='2015-01-01'");data <- dbFetch(rs)dbClearResult(rs)names(data)[names(data) == 'date'] <- 'time'data$time <- as.Date(data$time)sorted.data <- data[order(data$time),]data.length <- length(sorted.data$time)time.min <- sorted.data$time[1]time.max <- sorted.data$time[data.length]all.dates <- seq(time.min, time.max, by="day")all.dates.frame <- data.frame(list(time=all.dates))merged.data <- merge(all.dates.frame, sorted.data, all=T)# Identify all the dates without data (e.g., the market is closed)x<-which(is.na(merged.data$close));y<-which(!is.na(merged.data$close));z<-lapply(x, function(z) y[max(which(y<z))]);# Set missing prices to the most recent price; interpolation would be bettermerged.data$volume[x]<-0;merged.data$ticker[x]<-'AMZN';merged.data$close[x] <- merged.data$close[unlist(z)];# Remove any leap year data to have every year be 365 daysmerged.data<-merged.data[-c(which(strftime(merged.data$time, format = "%m") == "02" & strftime(merged.data$time, format = "%d") == "29")),];# Convert the data into a time series structurestart_date <- merged.data$time[1];end_date <- merged.data$time[length(merged.data$close)];data.ts<-ts(merged.data$close, frequency=365, start=c(as.numeric(strftime(start_date, format = "%Y")),as.numeric(strftime(start_date, format = "%j"))), end=c(as.numeric(strftime(end_date, format = "%Y")), as.numeric(strftime(end_date, format = "%j"))-1));# Apply the seasonal trend decompositiondata.stl<-stl(data.ts, s.window = 365);

Gain Access to Expert View — Subscribe to DDI Intel

--

--

Written by Elvis

An Amazonian academically trained in Physics and Electrical Engineering experienced in Data Science, Data Engineering, Analytics and Business Intelligence.

Responses (1)

Write a response