How to Evaluate Gradient Boosting Models with XGBoost in Python

The goal of developing a predictive model is to develop a model that is accurate on unseen data.

This can be achieved using statistical techniques where the training dataset is carefully used to estimate the performance of the model on new and unseen data.

In this tutorial you will discover how you can evaluate the performance of your gradient boosting models with XGBoost in Python.

After completing this tutorial, you will know.

  • How to evaluate the performance of your XGBoost models using train and test datasets.
  • How to evaluate the performance of your XGBoost models using k-fold cross validation.

Let’s get started.

How to Evaluate Gradient Boosting Models with XGBoost in Python
Photo by Timitrius, some rights reserved.

The Algorithm that is Winning Competitions
…XGBoost for fast gradient boosting

XGBoost With Python Mini CourseXGBoost is the high performance implementation of gradient boosting that you can now access directly in Python. 

Your PDF Download and Email Course.

FREE 7-Day Mini-Course on 
XGBoost With Python

Download Your FREE Mini-Course

  Download your PDF containing all 7 lessons.

Daily lesson via email with tips and tricks.

Evaluate XGBoost Models With Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets.

We can take our original dataset and split it into two parts. Train the algorithm on the first part, then make predictions on the second part and evaluate the predictions against the expected results.

The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.

A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of model accuracy.

We can split the dataset into a train and test set using the train_test_split() function from the scikit-learn library. For example, we can split the dataset into a 67% and 33% split for training and test sets as follows:

The full code listing is provided below using the Pima Indians onset of diabetes dataset, assumed to be in the current working directory. An XGBoost model with default configuration is fit on the training dataset and evaluated on the test dataset.

Running this example summarizes the performance of the model on the test set.

Evaluate XGBoost Models With k-Fold Cross Validation

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split.

It works by splitting the dataset into k-parts (e.g. k=5 or k=10). Each split of the data is called a fold. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.

After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.

The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of observations, k values of 3, 5 and 10 are common.

We can use k-fold cross validation support provided in scikit-learn. First we must create the KFold object specifying the number of folds and the size of the dataset. We can then use this scheme with the specific dataset. The cross_val_score() function from scikit-learn allows us to evaluate a model using the cross validation scheme and returns a list of the scores for each model trained on each fold.

The full code listing for evaluating an XGBoost model with k-fold cross validation is provided below for completeness.

Running this example summarizes the performance of the default model configuration on the dataset including both the mean and standard deviation classification accuracy.

If you have many classes for a classification type predictive modeling problem or the classes are imbalanced (there are a lot more instances for one class than another), it can be a good idea to create stratified folds when performing cross validation.

This has the effect of enforcing the same distribution of classes in each fold as in the whole training dataset when performing the cross validation evaluation. The scikit-learn library provides this capability in the StratifiedKFold class.

Below is the same example modified to use stratified cross validation to evaluate an XGBoost model.

Running this example produces the following output.

What Techniques to Use When

  • Generally k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
  • Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.
  • Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions.

If in doubt, use 10-fold cross validation for regression problems and stratified 10-fold cross validation on classification problems.

Want to Systematically Learn How To Use XGBoost?

XGBoost With Python

You can develop and evaluate XGBoost models in just a few lines of Python code. You need:

>> XGBoost With Python

Take the next step with 15 self-study tutorial lessons.

Covers building large models on Amazon Web Services, feature importance, tree visualization, hyperparameter tuning, and much more…

Ideal for machine learning practitioners already familiar with the Python ecosystem.

Bring XGBoost To Your Machine Learning Projects


In this tutorial you discovered how you can evaluate your XGBoost models by estimating how well they are likely to perform on unseen data.

Specifically, you learned:

  • How to split your dataset into train and test subsets for training and evaluating the performance of your model.
  • How you can create k XGBoost models on different subsets of the dataset and average the scores to get a more robust estimate of model performance.
  • Heuristics to help choose between train-test split and k-fold cross validation for your problem.

Do you have any questions on how to evaluate the performance of XGBoost models or about this post? Ask your questions in the comments below and I will do my best to answer.