Time-Series Forecasting in Microsoft Azure Automated Machine Learning (AutoML) PART 1.
A step-by-step guide to forecasting a time-series method and model deployment in Microsoft Azure AutoML.
In this article, I will show how to do the time-series forecasting in Microsoft Azure Automated Machine Learning using a Kaggle store item demand forecasting challenge. After building our model we will deploy it for testing as well. The link to the open dataset is given below, download the dataset from Kaggle.
Store Item Demand Forecasting Challenge | Kaggle
When I started learning Microsoft Azure AutoML, I struggled a lot in finding a good tutorial, a step-by-step tutorial that shows how to do a prediction or forecasting in Microsoft AutoML. There are a few but when you step on to something new, you need a simple tutorial to understand the basics. This blog will give you a good understanding of how to forecast in Microsoft Azure AutoML.
At a first glance, Azure AutoML looks very complicated and puzzled but as you start playing around, it becomes friendlier and an easy tool to do the forecasting. It has its pros and cons but it is a powerful tool especially for those people who are not good at coding and machine learning. If you want to do it manually, you can also use Python and R with the SDK.
In this part of the blog, I will show you how to predict, build and deploy a time-series forecasting model automatically without any knowledge of coding.
Introduction
Before starting with the forecasting we need to understand what is time-series forecasting and when it is used.
Second, we will see what is Microsoft Azure AutoML and few things in it.
Time-Series Forecasting
Time-series forecasting is a technique in machine learning, which analyzes data and the sequence of time to predict future events. Time-series forecasting provides near accurate assumptions about future trends based on historical time-series data.
When the data is recorded or collected over a set of a period, at regular intervals in terms of hours, days, months and years, then only the data is a time-series data. Not always a dataset consists of features such as date-time, sales can be considered as a time-series dataset.
By using time-series forecasting we can analyze major patterns such as trends, seasonality, cyclicity, and irregularity. Time-series forecasting can be used in stock market analysis, weather prediction, pattern recognition, earthquake prediction, economic forecasting, census analysis, and so on.
Remember, forecasting relies heavily on the historical and the current data points. Without this, there will be no basis for the possible trends to happen.
Microsoft AzureML
Microsoft AzureML is a cloud-based environment that helps you to train, deploy, automate, manage, and track machine learning models. AzureML can be used for almost all kind of machine learning algorithms whether it is supervised learning or unsupervised learning. I think they have around 18 different algorithms such as AutoArima, Fbprophet, Xgboost etc. It is also a very good and powerful tool to do image classification. If you want to read the official documentation and introduction, the link is given below:
Getting Started
Launch the AutomL studio. After launching this is how it looks like:
Note: There are three ways by which we can forecast and build a machine learning model in AutoML, first is Notebooks, second is AutomatedML and the third is using Designer.
Notebooks: Here you can do the coding using R or Python with the SDK.
Automated ML: Here you can do automatic forecasting without using any codes. It will automatically do the forecasting and build a model for you.
Designer: Here you can forecast and build a machine learning model by visually connecting the datasets and modules.
Datasets: Here you can import and store your datasets.
Experiments: Before doing any forecasting you have to give a name to your experiment. The Notebook and AutoMl experiments are listed in this.
Pipelines: The list of experiments using designer are listed in this.
Models: Here you see the list of registered models.
Endpoints: Here you can see your deployed models and you can also test them after deploying.
Compute: To do forecasting a machine learning model using any of the three, you need to create a compute cluster to run the experiments. More information is given in the link below.
What are compute targets — Azure Machine Learning | Microsoft Docs
Datastores: You can also ingest and fetch the data using azure blob storage, data lake etc.
Data Labeling: This is used for labelling the images for image classification or object detection.
Linked Services: Linked Services is a collection of external services you can connect with the workspace.
Now let’s start by importing our dataset.
Import Dataset
Download the store item demand forecast dataset from Kaggle. We are using the item-store dataset because it’s a huge dataset of 913001 observations, Besides, it has different stores and items. The dataset looks like this:
We will use the training dataset to predict future sales. The training dataset consists of 5 years of store-item sales data. The daily sales are given for 10 different stores and 50 different items from 2013.01.01 until 2017.12.31. The objective is to predict the sales for 50 different items at 10 different stores. The store locations could be different and they might have different behavior. We don’t have to predict sales altogether, we need to predict the sales for different stores and different items.
Note: The frequency of sales could be hourly, daily, weekly, monthly and yearly.
The time series forecasting method requires some additional information such as how is the seasonality, what is the frequency and forecast horizon. In our case, the frequency is the value of the daily sales. If the frequency is not there in the dataset then the AutomL won’t work. If you have some other dataset that doesn’t have a consistent frequency then it would be a problem. You first need to clean the data by resampling the dataset with some frequency such as hourly, daily, weekly etc. If there is no frequency or the frequency is not consistent then Azure AutoML will throw an error.
The time series forecasting is a real pain if the frequency is not there. The cleaning and resampling of the data is also an additional task which is not easy because while resampling you could lose the data. I won’t go deep into this at this moment, we will continue with our store-item demand forecasting.
Missing values and duplicates: There should be no missing values and duplicates in the dataset. So, if there are please get rid of them first and then import the data.
Go to the Datasets and create a dataset from local files. We need to first register our dataset by uploading it into Datasets. You can also use web and datastore. In our case, I imported the data by using local files.
After that give the name to the dataset. Keep it tabular because AutoML uses the tabular dataset. In description, you can describe your datasets. Click next.
Upload the training dataset and click next when it is loaded.
Next, you will see settings and preview. There is no need to change anything except column headers. We need to use the headers so change from no headers to use headers from the first line. Click next and keep the same schema, no change because it automatically fetches the schema of the dataset. Click next and finally create the dataset.
When it’s done, you can see your uploaded dataset in Datasets. If you click on the dataset you will see the description of the uploaded dataset.
You can go to explore to see the preview of the data. You can click on Profile to check the dataset profile.
To delete the dataset, you need to unregister it and it will get deleted.
To explore more about the dataset, you can Generate a Profile by using your compute target. We will create the compute target in the next steps.
We need to create a compute target to run our experiments. This compute target is mandatory, by using this compute target you can define your compute power. The link below explains what compute target is.
What are compute targets — Azure Machine Learning | Microsoft Docs
Create a new Automated ML run
After importing the dataset successfully, now let’s start by creating our first new AutoML experiment to forecast the sales. In few steps, you will create your first machine-learning algorithm to predict future sales.
Go to the AutoML, first, we need to select the dataset. I have selected the store dataset. After selecting the dataset click next.
Now we have to configure run. Here we will give the name to our experiment. Create new and give a name to the experiment. Second, select your target column. The target column is, what you want to predict, what is your target. In our case, we are predicting the sales, so our target column is sales.
Now the next thing is to select compute cluster. This is where we will create our compute cluster. You have to create a compute cluster to run the experiment. You can create one more clusters, depends on how much power do you need to run your experiment. This computes the cluster you have to create only once. Click create compute cluster.
Here you can select your preferences. I have selected the default which is dedicated, I am going to use CPU for time-series forecasting. Next, it will show you recommended options below.
I have chosen Dedicated, CPU and Standard_DS3_v2 which has 18 cores and 0.27$ per hour. It is a classical Machine Learning compute cluster. I would recommend this. It depends on you and your requirements. You can choose a cluster of your choice. Creating a cluster will take some time.
After successfully creating a cluster, select a compute and click next.
Next, what kind of task you want to do? Classification, Regression or Time-Series Forecasting. We are going to do the time-series forecasting, so we will select that.
After selecting time-series forecasting, we need to select the time column. Our time column is the date. Next, what is the time series identifier, here we will select the column names which we want to uniquely identify the time series in data that has multiple rows with the same timestamp.
In our case, we have multiple stores and items, so our time series identifiers are store and item. Suppose you just want to forecast only the total sales for all the store and items in the combine, then there is no need to put store and item there. But we have to predict the sales for every store and item.
Every store has a different behaviour depends on the geographic location. One store could be in Delhi, one in Mumbai, one in Bangalore etc. Different stores have different seasonality and trends. So we will do the time-series forecasting for multiple stores and items. AutoML will automatically do the grouping and create time-series forecasting for different groups individually. This is the best feature of AutoML. For more click on the link below.
Auto-train a time-series forecast model — Azure Machine Learning | Microsoft Docs
After selecting store and item. Next is the Frequency and Forecast Horizon. We know our frequency is daily. You can manually select a day or you can let AutoML auto-detect the frequency. If there is no frequency in the dataset, you will get the error whether you do it manually or autodetect. We will keep it autodetect the frequency.
The forecast horizon defines how many periods forward you would like to forecast. The horizon is in units of the time series frequency. Units are based on the time interval of your training data, for example, monthly, weekly that the forecaster should predict out.
Our time interval is daily but we will keep it autodetect. It will automatically detect the time interval of our dataset. Next, you can choose additional configuration settings.
The primary metric is the metric you want to use to optimize your model. When you will run your experiment, AutoML will run multiple algorithms such as AutotArima, Fbprophet etc. So, what metric you want to use to optimize the model. Click the check box of Explain best model.
The best model would be on the top because of the metric score. The model which has the best score would be on the top. Explain the best model will give you an explanation of the best model amongst all. We will see that in the end.
You can choose from Normalized RMSE, R2 score or Normalized MAE. I have kept Normalized RMSE. There are around 18 different algorithms and you can block the algorithms you don’t want to use. If you want to use only one algorithm out of 18 you can block 17 and it will run only the unblocked one.
Target lags and Target rolling window size we are not using. Target Lags are used when the relationship between the independent variables and the dependent variable does not match up or correlate by default. Target rolling window size we use to specify this parameter when you only want to consider a certain amount of history when training the model. You can read about it by clicking on the information (i) symbol next to it.
There is an important feature called Country or region for holidays.
This is important when using time-series forecasting. Holiday information for the selected country or region will be added to the training data. Holiday information includes the holiday name and whether most people have paid time off for the holiday. This will make your model better and will improve the forecasting accuracy as well. If you don’t know the region of the dataset, keep it blank.
Training job time: The default maximum amount of time, in hours, for an experiment to train the model is 3 hours and you can select how much time you want to train the model. I kept it default i.e. 3 hours. You can choose 1 hour and I think that is enough.
Validation type: The type of validation, it only supports k-fold cross validation. I kept it 3.
Concurrency: Maximum number of iterations that should be executed in parallel. This should be less than the number of nodes of the compute target or training cluster. Save your settings and click finish.
View featurization settings are used to select or deselect the features for training the model. We are not touching it.
After finish, you will see your experiment is running in AutoML. It will take time to finish the experiments.
You will see Run, Run ID, Experiment Name etc. Green tick marks are when it is completed and red when there is some error or it didn’t work or complete. You will get a red mark or error in the first 10–15 minutes if there is some problem. Otherwise, it will complete with a green mark.
When you click on the Run ID, you will see the screen like this below.
Details regarding the experiment. Click on the models and you will see all the models. When it’s finished and you get a green mark, Congratulations!
After completion, I checked and the StackEnsemble model was the best followed by ProphetModel based on their metric score.
I had clicked on the explain model, so it automatically created that explanation. For Prophet, I did it later to check the difference between the two.
When you click on the StackEnsemble model or your model you will see the details. You can go to the Explanations Preview and check how our model performed, the top best features etc. After that you can go to the Metrics to check the scores and the accuracy. As you can see our model worked very good with good scores. You can also check the predicted vs true graph. Predicted vs True shows the relationship between a predicted value and its correlating true value for a regression problem. As you can see our model predicted almost the same as the actual values. The dotted lines are the actual values and the line is what our model has predicted.
Deployment
In the next blog, we will deploy our model for testing. We will not be doing the web service deployment, we will deploy the best model just to test it, how it works and shows the result.
Later we will also do the store-item demand forecasting using the python SDK . This will give you a good understand of both AutoML and Notebooks (Python SDK).
We will also test it in the postman with the REST endpoint. I hope you liked the step by step tutorial. Keep playing around and you will learn more.
If you liked my blog and my tutorial helped you to understand AutoML, please click the 👏 button and share it with your friends and others. I’d love to hear your thoughts, so feel free to leave comments below.
Thank you so much :)
References:
The Best Guide to Time Series Forecasting in R (simplilearn.com)
Store Item Demand Forecasting Challenge | Kaggle
What are compute targets — Azure Machine Learning | Microsoft Docs
Use AutoML to create models & deploy — Azure Machine Learning | Microsoft Docs
What are compute targets — Azure Machine Learning | Microsoft Docs