How to Scale Machine Learning Data From Scratch With Python

Many machine learning algorithms expect data to be scaled consistently.

There are two popular methods that you should consider when scaling your data for machine learning.

In this tutorial, you will discover how you can rescale your data for machine learning. After reading this tutorial you will know:

  • How to normalize your data from scratch.
  • How to standardize your data from scratch.
  • When to normalize as opposed to standardize data.

Let’s get started.

How To Prepare Machine Learning Data From Scratch With Python
Photo by Ondra Chotovinsky, some rights reserved.

Description

Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.

It can help in methods that weight inputs in order to make a prediction, such as in linear regression and logistic regression.

It is practically required in methods that combine weighted inputs in complex ways such as in artificial neural networks and deep learning.

In this tutorial, we are going to practice rescaling one standard machine learning dataset in CSV format.

Specifically, the Pima Indians dataset. It contains 768 rows and 9 columns. All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.

You can learn more about this dataset on the UCI Machine Learning Repository.

Tutorial

This tutorial is divided into 3 parts:

  1. Normalize Data.
  2. Standardize Data.
  3. When to Normalize and Standardize.

These steps will provide the foundations you need to handle scaling your own data.

1. Normalize Data

Normalization can refer to different techniques depending on context.

Here, we use normalization to refer to rescaling an input variable to the range between 0 and 1.

Normalization requires that you know the minimum and maximum values for each attribute.

This can be estimated from training data or specified directly if you have deep knowledge of the problem domain.

You can easily estimate the minimum and maximum values for each attribute in a dataset by enumerating through the values.

The snippet of code below defines the dataset_minmax() function that calculates the min and max value for each attribute in a dataset, then returns an array of these minimum and maximum values.



We can contrive a small dataset for testing as follows:




With this contrived dataset, we can test our function for calculating the min and max for each column.



Running the example produces the following output.

First, the dataset is printed in a list of lists format, then the min and max for each column is printed in the format column1: min,max and column2: min,max.

For example:



Once we have estimates of the maximum and minimum allowed values for each column, we can now normalize the raw data to the range 0 and 1.

The calculation to normalize a single value for a column is:



Below is an implementation of this in a function called normalize_dataset() that normalizes values in each column of a provided dataset.



We can tie this function together with the dataset_minmax() function and normalize the contrived dataset.



Running this example prints the output below, including the normalized dataset.



We can combine this code with code for loading a CSV dataset and load and normalize the Pima Indians diabetes dataset.

Download the Pima Indians dataset from the UCI Machine Learning Repository and place it in your current directory with the name pima-indians-diabetes.csv. Open the file and delete any empty lines at the bottom.

The example first loads the dataset and converts the values for each column from string to floating point values. The minimum and maximum values for each column are estimated from the dataset, and finally, the values in the dataset are normalized.



Running the example produces the output below.

The first record from the dataset is printed before and after normalization, showing the effect of the scaling.



2. Standardize Data

Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value 1.

Together, the mean and the standard deviation can be used to summarize a normal distribution, also called the Gaussian distribution or bell curve.

It requires that the mean and standard deviation of the values for each column be known prior to scaling. As with normalizing above, we can estimate these values from training data, or use domain knowledge to specify their values.

Let’s start with creating functions to estimate the mean and standard deviation statistics for each column from a dataset.

The mean describes the middle or central tendency for a collection of numbers. The mean for a column is calculated as the sum of all values for a column divided by the total number of values.



The function below named column_means() calculates the mean values for each column in the dataset.



The standard deviation describes the average spread of values from the mean. It can be calculated as the square root of the sum of the squared difference between each value and the mean and dividing by the number of values minus 1.



The function below named column_stdevs() calculates the standard deviation of values for each column in the dataset and assumes the means have already been calculated.



Again, we can contrive a small dataset to demonstrate the estimate of the mean and standard deviation from a dataset.




Using an excel spreadsheet, we can estimate the mean and standard deviation for each column as follows:



Using the contrived dataset, we can estimate the summary statistics.



Executing the example provides the following output, matching the numbers calculated in the spreadsheet.



Once the summary statistics are calculated, we can easily standardize the values in each column.

The calculation to standardize a given value is as follows:



Below is a function named standardize_dataset() that implements this equation



Combining this with the functions to estimate the mean and standard deviation summary statistics, we can standardize our contrived dataset.



Executing this example produces the following output, showing standardized values for the contrived dataset.



Again, we can demonstrate the standardization of a machine learning dataset.

The example below demonstrate how to load and standardize the Pima Indians diabetes dataset, assumed to be in the current working directory as in the previous normalization example.



Running the example prints the first row of the dataset, first in a raw format as loaded, and then standardized which allows us to see the difference for comparison.



3. When to Normalize and Standardize

Standardization is a scaling technique that assumes your data conforms to a normal distribution.

If a given data attribute is normal or close to normal, this is probably the scaling method to use.

It is good practice to record the summary statistics used in the standardization process, so that you can apply them when standardizing data in the future that you may want to use with your model.

Normalization is a scaling technique that does not assume any specific distribution.

If your data is not normally distributed, consider normalizing it prior to applying your machine learning algorithm.

It is good practice to record the minimum and maximum values for each column used in the normalization process, again, in case you need to normalize new data in the future to be used with your model.

Extensions

There are many other data transforms you could apply.

The idea of data transforms is to best expose the structure of your problem in your data to the learning algorithm.

It may not be clear what transforms are required upfront. A combination of trial and error and exploratory data analysis (plots and stats) can help tease out what may work.

Below are some additional transforms you may want to consider researching and implementing:

  • Normalization that permits a configurable range, such as -1 to 1 and more.
  • Standardization that permits a configurable spread, such as 1, 2 or more standard deviations from the mean.
  • Exponential transforms such as logarithm, square root and exponents.
  • Power transforms such as box-cox for fixing the skew in normally distributed data.

Review

In this tutorial, you discovered how to rescale your data for machine learning from scratch.

Specifically, you learned:

  • How to normalize data from scratch.
  • How to standardize data from scratch.
  • When to use normalization or standardization on your data.

Do you have any questions about scaling your data or about this post?
Ask your question in the comments below and I will do my best to answer.