What is time series analysis prediction

Time Series Analysis: An Introduction

We love data - and one of the most beautiful forms of data in the world of data is likely to come with every company: the time series. Today we want to give a brief introduction to time series analyzes using the example of corporate insolvencies. As is so often the case in the field of data science, this article is aimed at readers who want to read up a little.

Time series

The basic idea is very simple: The time series is about events that occur at more or less regular intervals, one after the other and therefore over time. So what is there in the company? Pretty much everything: sales, expenses, production figures ... simply everything that occurs over time and repeatedly.

The exciting thing about time series is that they usually contain a lot more information than you might think at first glance. things like Trends, seasonality and of course the forecast interested in it. Reason Enough to get to the bottom of the time series analysis a little. The goal for today is to roughly estimate what can be done quickly and easily to get a feel for a time series. We don't want to impart technical skills, we just want to show you how to proceed.

Of course we are going to take publicly available data again. This time it's about bankruptcies. The data is quite nice, because on the one hand it is available more or less continuously and on the other hand it is a good way of showing when forecast models cause difficulties.

Develop a feel for the data

The Federal Statistical Office provides us with monthly data on the insolvency proceedings that have been opened for companies since January 2003. By the way, it will look like this for Germany until January 2017:

The line in the middle is the 3-month average. But what does this little graph tell us? In total, somehow between 1200 and 2300 bankruptcy applications. But it shows even more. Maybe this is easier to see if we take a single year out of it:

The years seem to have a seasonal course. Rising in April, then falling, and rising again in June / July? At least the years don't look completely different.

So we now know two things: There seems to be a general trend - or maybe 3? Before 2008/2009, until 2011/12 and after? And seasonal trends - both over the year and over the entire period.

The simplest of all time series analyzes: ARIMA

This type of information is easy to process. To do this, we make use of a simple time series analysis: the so-called ARIMA method ... Autoregressive Integrated Moving Average ... Oh man. Again a rather boring definition and description here.

Understanding when to use it is more important than knowing what exactly an ARIMA model is doing: namely in situations where the mean value and variance over time Not vary - i.e. are stationary. Is that the case here? The mean value in 2005 is 1937.25 and the standard deviation is 152.65 companies. In 2009, however, at 2026.25 and 177.46 companies.

Without much testing, we can assume that the time series is not stationary - that is, it haunts the area in terms of mean and variance over time. So ARIMA doesn't actually fit ...

We can manipulate the data a little, however. So bend it so that it looks like it is stationary. Stationary data look roughly like white noise, i.e. more static flickering. But how do we do it? Now let's proceed as unscientific as possible. But quickly and, above all, graphically. No big math needed.

Data transformation

First of all, let's try to figure out the biggest fluctuations of all. To do this, we could e.g. logarithmize the time series - because then outliers are less important. Because a large number is still quite small in natural logarithm. By the way, it looks like this:

Our maximum of 2329 insolvencies in 2006 has now just shrunk to 7.75 (because ln (2329) = 7.75). That looks a bit more like white noise - but everything from around 2010 still has a clear trend. There are more options. If the logarithm trick is not enough, it can still be coupled with the first-order difference trick. That means, you form the difference from period to period. It will look like that:

The first order difference is shown, with the average in the middle. And that's pretty much what we're looking for! Nice white noise - so everything looks super similar. Hardly a trend can be seen. It seems arbitrary. It is important to note that we did not "destroy" the data - we transformed it and can also reverse this transformation.

The actual forecast and its problems

Now we can start working on the ARIMA method. We still have to help the ARIMA model a little on the jumps. The "A" in ARIMA stands for "Autoregressive". This means something like "one period influences another". And if that is the case, ARIMA wants to know from us. We can check whether this is the case with us. For this we use a pretty cool graphic - i.e. a correlogram and it looks like this:

The little pricks show us how strongly the time series that we observe is related to itself over time. To do this, we make use of the autocorrelation from one month to the next. The correlation between the number of bankruptcy applications in the last available month and the previous month is shown at Lag = 1. In our example, it is around -0.38 and is also significant - because the prick is outside the gray 95% confidence interval. Cool graphics, huh?

But what do we do with it: we tell our ARIMA model that at least the first three periods in their correlation are significantly different from zero and ask them to take this into account.

What's the point? We need this information to estimate an ARIMA model using tools like Python, R or Stata. We help these tools to price in the dependencies over time - and if we do that, we can make forecast statements. It looks like this, for example:

This graph shows a scatter plot of the bankruptcy applications actually incurred over time. These are the green dots. The line, however, is our prediction model. Of course, that's far from perfect. But it's not really bad either. We hit some points quite well. We have also taken the liberty of simply updating the forecast. We think it looks "halfway natural". While that is by no means a reason that fewer than 1,000 bankruptcy filings were actually recorded in mid-2018, it gives us a rough feeling of where we are headed.

When can ARIMA be used?

The problem also lies in the simple application of the ARIMA model - and this becomes clear when we pretend that we had carried out the same forecast in September 2007 and were now comparing our forecast with reality:

Well, shit happens. The financial crisis has thrown our bill ... and, of course, our forecast no longer looks right. With all the charm that ARIMA brings with it, there is still the risk that we will completely lose sight of underlying processes. ARIMA forecasts are therefore particularly useful when

  1. the underlying and determining motives can be well understood and explained - this is often the case with production data, for example.
  2. if seasonality and trend can be substantiated in terms of content. For this you need a lot of data points / periods - from 50 it is usually possible. Everything over 100 is better. In most cases, however, you should always use two to three forecast methods and think carefully about why there are differences.

Do you already have an idea where you could examine time series in your company? Please let us know. We always look forward to feedback!