--- title: "Complex Model Simulation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Complex Model Simulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Package Loading Before loading the package, we should allocate enough memory for Java. Here we allocate 10GB of memory for Java. ```{r} set.seed(123) # Allocate 10GB of memory for Java. Must be called before library(iBART) options(java.parameters = "-Xmx10g") library(iBART) ``` ## Complex Model In this vignette, we will run iBART on the complex model described in Section 3.4 of the paper, i.e. the data-generating model is $$y = 15\{\exp(x_1)-\exp(x_2)\}^2 + 20\sin(\pi x_3x_4) + \varepsilon, \qquad\varepsilon\sim \mathcal{N}_n(0, \sigma^2 I).$$ The primary features are $X = (x_1,...,x_p)$, where $x_1,...,x_p \overset{\text{iid}}\sim\text{Unif}_n(-1,1)$. We will use the following setting: $n = 250$, $p = 10$, and $\sigma = 0.5$. The goal in OIS is to identify the 2 true descriptors: $f_1(X) = \{\exp(x_1)-\exp(x_2)\}^2$ and $f_2(X) = \sin(\pi x_3x_4)$ using only $(y, X)$ as input. Let's generate the primary features $X$ and the response variable $y$. ```{r iBART} #### Simulation Parameters #### n <- 250 # Change n to 100 here to reproduce result in Supplementary Materials A.2.3 p <- 10 # Number of primary features #### Generate Data #### X <- matrix(runif(n * p, min = -1, max = 1), nrow = n, ncol = p) colnames(X) <- paste("x.", seq(from = 1, to = p, by = 1), sep = "") y <- 15 * (exp(X[, 1]) - exp(X[, 2]))^2 + 20 * sin(pi * X[, 3] * X[, 4]) + rnorm(n, mean = 0, sd = 0.5) ``` Note that the input data to iBART are only $y$ and $X = (x_1,...,x_{10})$, and iBART needs to 1. Generate the correct descriptors $f_1(X)$ and $f_2(X)$ 2. Select the correct descriptors At Iteration 0 (base iteration), iBART determines which of the primary features, $(x_1,\ldots,x_{10})$, are useful and only apply operators on the useful ones. If successful, iBART should keep $(x_1,\ldots,x_4)$ in the loop and discard $(x_5,\ldots,x_{10})$. Let's run iBART for 1 iteration (Iteration 0 + Iteration 1) and examine its outputs. ```{r iBART short} #### iBART #### iBART_sim <- iBART(X = X, y = y, name = colnames(X), num_burn_in = 5000, # lower value for faster run num_iterations_after_burn_in = 1000, # lower value for faster run num_permute_samples = 20, # lower value for faster run opt = c("unary"), # only apply unary operators after base iteration sin_cos = TRUE, apply_pos_opt_on_neg_x = FALSE, Lzero = TRUE, K = 4, standardize = FALSE, seed = 123) ``` `iBART()` returns a list object that contains many interesting outputs; see `?iBART::iBART` for a full list of return values. Here we focus on 2 return values: * `iBART_sim$descriptor_names`: the descriptors selected by iBART * `iBART_sim$iBART_model`: the selected model---a `cv.glmnet` object We can use the iBART model the same way we would use a `glmnet` model. For instance, we can print out the coefficients using `coef()`. ```{r iBART result} # iBART selected descriptors iBART_sim$descriptor_names # iBART model coef(iBART_sim$iBART_model, s = "lambda.min") ``` `iBART_sim$descriptor_names` contains the name of the selected descriptors at the last iteration (Iteration 1) and `coef(iBART_sim$iBART_model, s = "lambda.min")` shows the input descriptors at the last iteration (Iteration 1) and their coefficients. Notice that the first 4 descriptors in `coef(iBART_sim$iBART_model, s = "lambda.min")` are $x_1,\ldots,x_4$. This indicates that iBART discarded $x_5,\ldots,x_{10}$ and kept $x_1,\ldots,x_4$ in the loop at Iteration 0. At Iteration 1, iBART applied unary operators to $x_1,\ldots,x_4$, yielding $$x_i, x_i^2, \exp(x_i), \sin(\pi x_i), \cos(\pi x_i), x_i^{-1}, |x_i|, \qquad\text{for } i = 1,2,3,4.$$ Among them, iBART selected 2 active intermediate descriptors: $\exp(x_1)$ and $\exp(x_2)$, which are needed to generate $f_1(X) = \{\exp(x_1)-\exp(x_2)\}^2$. This is very promising. Note that we don't have $\sqrt{x_i}$ and $\log(x_i)$ here because $\sqrt{\cdot}$ and $\log(\cdot)$ are only defined if $x_i$'s are positive. We can overwrite this by setting `apply_pos_opt_on_neg_x = TRUE`; this effectively generates $\sqrt{|x_i|}$ and $\log(|x_i|)$. To save time, we cached the result of a complete run in `data("iBART_sim", package = "iBART")`, which can be replicated by using the following code. ```{r iBART full, eval=FALSE} iBART_sim <- iBART(X = X, y = y, name = colnames(X), opt = c("unary", "binary", "unary"), sin_cos = TRUE, apply_pos_opt_on_neg_x = FALSE, Lzero = TRUE, K = 4, standardize = FALSE, seed = 123) ``` Let's load the full result and see how iBART did. ```{r load result} load("../data/iBART_sim.rda") # load full result iBART_sim$descriptor_names # iBART selected descriptors coef(iBART_sim$iBART_model, s = "lambda.min") # iBART model ``` Here iBART generated 145 descriptors in the last iteration, and it correctly identified the true descriptors $f_1(X)$ and $f_2(X)$ without selecting any false positive. This is very reassuring especially when some of these descriptors are highly correlated with $f_1(X)$ or $f_2(X)$. For instance, $\tilde{f}_1(X) = |\exp(x_1) - \exp(x_2)|$ in the descriptor space is highly correlated with $f_1(X)$. ```{r cor} f1_true <- (exp(X[,1]) - exp(X[,2]))^2 f1_cor <- abs(exp(X[,1]) - exp(X[,2])) cor(f1_true, f1_cor) ``` `iBART()` also returns other useful and interesting outputs, such as `iBART_sim$iBART_gen_size` and `iBART_sim$iBART_sel_size`. They store the dimension of the newly generated / selected descriptor space for each iteration. Let's plot them and see how **iBART** use nonparametric variable selection for dimension reduction. In each iteration, we keep the dimension of intermediate descriptor space under $\mathcal{O}(p^2)$, leading to a progressive dimension reduction. ```{r size_plot, fig.width=7, fig.height=3.5} library(ggplot2) df_dim <- data.frame(dim = c(iBART_sim$iBART_sel_size, iBART_sim$iBART_gen_size), iter = rep(0:3, 2), type = rep(c("Selected", "Generated"), each = 4)) ggplot(df_dim, aes(x = iter, y = dim, colour = type, group = type)) + theme(text = element_text(size = 15), legend.title = element_blank()) + geom_line(size = 1) + geom_point(size = 3, shape = 21, fill = "white") + geom_text(data = df_dim, aes(label = dim, y = dim + 10, group = type), position = position_dodge(0), size = 5, colour = "blue") + labs(x = "Iteration", y = "Number of descriptors") + scale_x_continuous(breaks = c(0, 1, 2, 3)) ``` ## R Session Info ```{r sessioninfo} sessionInfo() ```