Tree Models for Data Imputation

Frequently in our research we are in need of memory efficient algorithms that can scale to big data, and are also fast to compute. In big data, trees are one of the most popular classes of models. A dataset of n observations and p features can be processed quickly and constructed in O(pn\log n).

Decision trees are robust to outliers and handle missing data very well using surrogate splits. The latter feature is a very useful one. Suppose we train a decision tree to predict variable Y using features A, B, C. If we receive new data that only had valid values for A,B, we would be in trouble and might not be able to predict Y. Tree models are trained to include surrogate splits, so that if a variable C is missing in a new data point, the decision defers to another variable C' that is highly correlated with C and proceeds with the prediction. One additional advantage of using splits is that we can define the prediction piecewise. In our time series data where variance is not constant over time, or when there are interventions, it would be nice to identify these regions and fit different models. Trees give such flexibility.

For a data matrix, we can use trees to impute missing data with the following simple strategy:

X = data matrix
Repeat:
    In parallel:
        For each predictor p, train a tree model that predicts p using all other p-1 features
        Predict missing values for predictor p, but do not write to X yet
    sync threads
    Fill X with predicted values

It’s important to repeat the procedure for stability, just as in SVD imputation. For the actual implementation, there are two popular choices. A simple one would just use CART and for a time series matrix with T time points, n observations and p features, the runtime would be O(np^2T\log T)

A more powerful technique would be Gradient Boosting, which uses an ensemble of weak decision trees. We use the R package gbm to create our imputation function

First, let’s create some data and see how gbm works

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
set.seed(100)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
 
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
 
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
head(data)
 
            Y         X1        X2 X3 X4 X5         X6
1 -0.09117769 0.30776611 0.1480966  d  a  c 2.31015454
2 -0.32203092 0.25767250 0.2237327  d  a  b 1.98513625
3  1.64481361 0.55232243 1.2478880  d  e  b 0.04731843
4  3.68823135 0.05638315 1.3421636  a  c  c 1.81793991
5  2.07303574 0.46854928 0.7317884  c  e  a 2.38320102
6  0.90415066 0.48377074 0.3663628  c  c  a 1.67086893

Now we use gbm to fit a model to the data and compute the sum of squared errors.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
    data=data,                   # dataset
    var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                 # +1: monotone increase,
                                 #  0: no monotone restrictions
    distribution="gaussian",     # bernoulli, adaboost, gaussian,
                                 # poisson, coxph, and quantile available
    n.trees=3000,                # number of trees
    shrinkage=0.005,             # shrinkage or learning rate,
                                 # 0.001 to 0.1 usually work
    interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
    bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
    train.fraction = 0.5,        # fraction of data for training,
                                 # first train.fraction*N used for training
    n.minobsinnode = 10,         # minimum total weight needed in each node
    cv.folds = 5,                # do 5-fold cross-validation
    keep.data=TRUE,              # keep a copy of the dataset with the object
    verbose=TRUE)                # print out progress
 
# check performance using an out-of-bag estimator
# OOB underestimates the optimal number of iterations
best.iter <- gbm.perf(gbm1,method="OOB")
print(best.iter)
 
data.predict = predict(gbm1, n.trees = best.iter)
SSE = sum((Y - data.predict)^2)
print(SSE)
 
168.2677

Now let’s see how this handles missing values

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
 
data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
 
gbm2 <- gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
    data=data2,                   # dataset
    var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                 # +1: monotone increase,
                                 #  0: no monotone restrictions
    distribution="gaussian",     # bernoulli, adaboost, gaussian,
                                 # poisson, coxph, and quantile available
    n.trees=3000,                # number of trees
    shrinkage=0.005,             # shrinkage or learning rate,
                                 # 0.001 to 0.1 usually work
    interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
    bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
    train.fraction = 0.5,        # fraction of data for training,
                                 # first train.fraction*N used for training
    n.minobsinnode = 10,         # minimum total weight needed in each node
    cv.folds = 5,                # do 5-fold cross-validation
    keep.data=TRUE,              # keep a copy of the dataset with the object
    verbose=TRUE)                # print out progress
 
# check performance using an out-of-bag estimator
# OOB underestimates the optimal number of iterations
best.iter <- gbm.perf(gbm2,method="OOB")
print(best.iter)
 
data.predict = predict(gbm2, n.trees = best.iter)
SSE = sum((Y - data.predict)^2)
print(SSE)
 
225.7908

Great! We’ll build our imputation function by looping over all columns of the data matrix and fitting a gbm for each column.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
 
gbmImpute = function(x, max.iters = 2, cv.fold = 5, verbose=T) {
 
    missing.matrix = is.na(x)
    numMissing = sum(missing.matrix)
    if(verbose) {
        print(paste("imputing on", numMissing, "missing values with matrix size",
        nrow(x)*ncol(x), sep=" "))
    }
    if(numMissing == 0) {
        return (x)
    }
 
    missing.cols.indices = which(apply(missing.matrix, 2, function(i) {
        any(i)
    }))
 
    for (i in 1:max.iters) {
        if (verbose) print (paste("Begin iteration: ", i))
        x[,missing.cols.indices] = sapply(missing.cols.indices, function(j) {
            good.data = which(!missing.matrix[,j])
            gbm1 <- gbm(x[good.data,j] ~ .,
                data = as.data.frame(x[good.data,-j]),
                var.monotone = rep(0, ncol(x)-1), # -1: monotone decrease,
                                             # +1: monotone increase,
                                             #  0: no monotone restrictions
                distribution="gaussian",     # bernoulli, adaboost, gaussian,
                                             # poisson, coxph, and quantile available
                n.trees=3000,                # number of trees
                shrinkage=0.005,             # shrinkage or learning rate,
                                             # 0.001 to 0.1 usually work
                interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
                bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
                train.fraction = 0.5,        # fraction of data for training,
                                             # first train.fraction*N used for training
                n.minobsinnode = 10,         # minimum total weight needed in each node
                cv.folds = cv.fold,                # do 5-fold cross-validation
                keep.data=TRUE,              # keep a copy of the dataset with the object
                verbose=F)                # print out progress
            best.iter <- gbm.perf(gbm1,method="OOB", plot.it = F)
            data.predict = predict(gbm1, newdata = as.data.frame(x[-good.data,-j]), n.trees = best.iter)
            x[-good.data,j] = data.predict
            x[,j]
        })
    }
 
    return ( list (
        x=x,
        missing.matrix=missing.matrix
    ))
}

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>