The goal of developing a predictive model is to develop a model that is accurate on unseen data.
This can be achieved using statistical techniques where the training dataset is carefully used to estimate the performance of the model on new and unseen data.
In this tutorial you will discover how you can evaluate the performance of your gradient boosting models with XGBoost in Python.
After completing this tutorial, you will know.
 How to evaluate the performance of your XGBoost models using train and test datasets.
 How to evaluate the performance of your XGBoost models using kfold cross validation.
Let’s get started.
The Algorithm that is Winning Competitions
…XGBoost for fast gradient boosting
XGBoost is the high performance implementation of gradient boosting that you can now access directly in Python.
Your PDF Download and Email Course.
FREE 7Day MiniCourse on
XGBoost With Python
Download your PDF containing all 7 lessons.
Daily lesson via email with tips and tricks.
Evaluate XGBoost Models With Train and Test Sets
The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets.
We can take our original dataset and split it into two parts. Train the algorithm on the first part, then make predictions on the second part and evaluate the predictions against the expected results.
The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.
This algorithm evaluation technique is fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.
A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of model accuracy.
We can split the dataset into a train and test set using the train_test_split() function from the scikitlearn library. For example, we can split the dataset into a 67% and 33% split for training and test sets as follows:

# split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) 
The full code listing is provided below using the Pima Indians onset of diabetes dataset, assumed to be in the current working directory. An XGBoost model with default configuration is fit on the training dataset and evaluated on the test dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

# traintest split evaluation of xgboost model from numpy import loadtxt from xgboost import XGBClassifier from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt(‘pimaindiansdiabetes.csv’, delimiter=“,”) # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model no training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print(“Accuracy: %.2f%%” % (accuracy * 100.0)) 
Running this example summarizes the performance of the model on the test set.
Evaluate XGBoost Models With kFold Cross Validation
Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single traintest set split.
It works by splitting the dataset into kparts (e.g. k=5 or k=10). Each split of the data is called a fold. The algorithm is trained on k1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.
After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.
The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.
The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the traintest evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of observations, k values of 3, 5 and 10 are common.
We can use kfold cross validation support provided in scikitlearn. First we must create the KFold object specifying the number of folds and the size of the dataset. We can then use this scheme with the specific dataset. The cross_val_score() function from scikitlearn allows us to evaluate a model using the cross validation scheme and returns a list of the scores for each model trained on each fold.

kfold = KFold(n=len(X), n_folds=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) 
The full code listing for evaluating an XGBoost model with kfold cross validation is provided below for completeness.

# kfold cross validation evaluation of xgboost model from numpy import loadtxt import xgboost from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score # load data dataset = loadtxt(‘pimaindiansdiabetes.csv’, delimiter=“,”) # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # CV model model = xgboost.XGBClassifier() kfold = KFold(n=len(X), n_folds=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print(“Accuracy: %.2f%% (%.2f%%)” % (results.mean()*100, results.std()*100)) 
Running this example summarizes the performance of the default model configuration on the dataset including both the mean and standard deviation classification accuracy.
If you have many classes for a classification type predictive modeling problem or the classes are imbalanced (there are a lot more instances for one class than another), it can be a good idea to create stratified folds when performing cross validation.
This has the effect of enforcing the same distribution of classes in each fold as in the whole training dataset when performing the cross validation evaluation. The scikitlearn library provides this capability in the StratifiedKFold class.
Below is the same example modified to use stratified cross validation to evaluate an XGBoost model.

# stratified kfold cross validation evaluation of xgboost model from numpy import loadtxt import xgboost from sklearn.cross_validation import StratifiedKFold from sklearn.cross_validation import cross_val_score # load data dataset = loadtxt(‘pimaindiansdiabetes.csv’, delimiter=“,”) # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # CV model model = xgboost.XGBClassifier() kfold = StratifiedKFold(Y, n_folds=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print(“Accuracy: %.2f%% (%.2f%%)” % (results.mean()*100, results.std()*100)) 
Running this example produces the following output.
What Techniques to Use When
 Generally kfold cross validation is the goldstandard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
 Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.
 Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions.
If in doubt, use 10fold cross validation for regression problems and stratified 10fold cross validation on classification problems.
Want to Systematically Learn How To Use XGBoost?
You can develop and evaluate XGBoost models in just a few lines of Python code. You need:
>> XGBoost With Python
Take the next step with 15 selfstudy tutorial lessons.
Covers building large models on Amazon Web Services, feature importance, tree visualization, hyperparameter tuning, and much more…
Ideal for machine learning practitioners already familiar with the Python ecosystem.
Bring XGBoost To Your Machine Learning Projects
Summary
In this tutorial you discovered how you can evaluate your XGBoost models by estimating how well they are likely to perform on unseen data.
Specifically, you learned:
 How to split your dataset into train and test subsets for training and evaluating the performance of your model.
 How you can create k XGBoost models on different subsets of the dataset and average the scores to get a more robust estimate of model performance.
 Heuristics to help choose between traintest split and kfold cross validation for your problem.
Do you have any questions on how to evaluate the performance of XGBoost models or about this post? Ask your questions in the comments below and I will do my best to answer.