The single-atom catalysis data is stored in
data/single_atom_catalysis.RData
, and the raw data is
available at this GitHub
repo. Here we model the metal/oxide binding energy (the response
variable y) using p = 59 physical properties of the
transition metals and the oxide supports (the primary features X). The response variable y and the primary features X are treated as continuous
variables, and we aim to use iBART to find an
interpretable model with high predictive performance for the metal/oxide
binding energy.
A total of 13 transition metals (Cu, Ag, Au, Ni, Pd, Pt, Co, Rh, Ir, Fe, Ru, Mn, V) and 7 oxide supports (CeO2(111), MgO(100), CeO2(110), TbO2(111), ZnO(100), TiO2(011), α-Al2O3(0001)) were studied in the dataset, making a total of n = 13 × 7 = 91 metal/oxide pairs. The primary feature matrix X contains various physical properties of the transition metals and the oxide supports including Pauling Electronegativity (χP), (n − 1)th and nth Ionization Energies (IEn − 1, IEn), Electron Affinity (EA), HOMO Energy, LUMO Energy, Heat of Sublimation (ΔHsub), Oxidation Energy of oxide support (ΔHf,ox,bulk), Oxide Formation Enthalpy (ΔHf,ox), Zunger Orbital Radius (r), Atomic Number (Z), Meidema Parameters of metal atoms (η1/3, φ), Valance Electron (Nval), Oxygen Vacancy Energy of oxide support (ΔEvac), Workfunction of oxide support (WF), Surface Energy (γ), Coordination Number (CN), and Bond Valence of surface metal atom (BV). Most of these physical properties are defined for both the transition metals and the oxide supports, while a few are only defined for the transition metals or the oxide supports. A detailed description of the 59 primary features X can be found on pages 11–14 of the data supplementary materials published by O’Connor et al.
Before loading the iBART package, we must allocate enough memory to Java to avoid out-of-memory errors.
# Allocate 10GB of memory for Java. Must be called before library(iBART)
options(java.parameters = "-Xmx10g")
library(iBART)
Next, we load the real data set and examine what data are needed to run iBART.
load("../data/catalysis.rda") # load data
summary(catalysis)
#> Length Class Mode
#> X 5369 -none- numeric
#> y 91 -none- numeric
#> head 59 -none- character
#> unit 708 -none- numeric
The data set consists of 4 objects:
X
: a matrix
of physical properties of the
transition metals and the oxide supports described in Data Description. These are our primary features
(predictors).y
: a numeric
vector of metal/oxide binding
energy described in Data Description. This is our
response variable.name
: a character
vector storing the
column names of X
.unit
: a (optional) matrix
whose columns
store the unit information of the primary features X
. This
can be generated using the helper function
generate_unit(X, unit, dimension)
. See
?iBART::generate_unit
for more detail.Now let’s apply iBART to the data set. Besides the usual regression
data (X,y)
, we need to specify the descriptor generating
strategy through opt
. Here we set
opt = c("binary", "unary", "binary")
, the descriptor
generating strategy described in (8) of our paper. That is, we let
iBART run for 3 descriptor generating iterations, where binary operators
𝒪b are used in the
1st iteration, unary operators 𝒪u in the 2nd iteration,
and binary operators 𝒪b in the 3rd
iteration.
We can also define other descriptor generating strategies. The available operator sets at each iteration are
all
: all operators $\mathcal{O} = \{+, -, \times, /, |-|, I, \exp,
\log, |\cdot|, \sqrt{}, ^{-1}, ^2, \sin(\pi\cdot),
\cos(\pi\cdot)\}$binary
: binary operators 𝒪b = {+, −, ×, /, |−|}unary
: unary operators $\mathcal{O}_u = \{I, \exp, \log, |\cdot|, \sqrt{},
^{-1}, ^2, \sin(\pi\cdot), \cos(\pi\cdot)\}$For example, opt = c("all", "all")
will apply all
operators 𝒪 for 2 iterations. Here we
use the same tuning parameters discussed in Section 3.4 of our paper.
iBART_real_data <- iBART(X = catalysis$X, y = catalysis$y,
name = catalysis$head, # colnames of X
unit = catalysis$unit, # units of X
opt = c("binary", "unary", "binary"), # binary operator first
out_sample = FALSE,
Lzero = TRUE,
K = 5, # maximum number of descriptors in l-zero model
standardize = FALSE,
seed = 888)
To save time, we cached the result of a complete run in
data("iBART_real_data", package = "iBART")
.
iBART()
returns many interesting outputs. For example,
iBART_real_data$iBART_gen_size
and
iBART_real_data$iBART_sel_size
return the dimensions of the
generated and selected descriptor space for each iteration,
respectively. Let’s plot them and see how iBART uses nonparametric
variable selection to achieve effective dimension reduction.
library(ggplot2)
df_dim <- data.frame(dim = c(iBART_real_data$iBART_sel_size, iBART_real_data$iBART_gen_size),
iter = rep(0:3, 2),
type = rep(c("Selected", "Generated"), each = 4))
ggplot(df_dim, aes(x = iter, y = dim, colour = type, group = type)) +
theme(text = element_text(size = 15), legend.title = element_blank()) +
geom_line(linewidth = 1) +
geom_point(size = 3, shape = 21, fill = "white") +
geom_text(data = df_dim, aes(label = dim, y = dim + 40, group = type),
position = position_dodge(0), size = 5, colour = "blue") +
labs(x = "Iteration", y = "Number of descriptors") +
scale_x_continuous(breaks = c(0, 1, 2, 3))
Due to the iterative nonparametric screening framework of iBART, the dimension of the selected space is significantly smaller than that of the generated space at each iteration. This ensures that only a sparse subset of the intermediate descriptors is used to generate the consecutive descriptor space, and thus achieving a progressive dimension reduction.
The full selected model is stored in
iBART_real_data$iBART_model
, which is a
cv.glmnet
object since LASSO is used the last iteration.
This means that we can use all of the glmnet
functionality.
For instance, we can obtain the model coefficients at λ=lambda.min
using
To view the selected descriptors only, we can do the following instead
iBART_real_data$descriptor_names # Symbolic syntax of the selected descriptors
#> [1] "(s_EA*Hf)" "(m_n13*m_Chi_MB_un)"
#> [3] "(Hf*Oxv)" "((Hfo*s_EA)*abs((Oxv-s_ion_4)))"
#> [5] "(Hfo/Hf)" "(m_n13/s_phi)"
#> [7] "(Hf/Oxv)" "((Hfo*s_EA)/(m_N_val/CMS))"
#> [9] "(s_ion_3/Hf)" "((m_n13/m_N_val)/(m_N_val/CMS))"
#> [11] "((m_n13/m_N_val)/(m_N_val/s_ion_4))" "abs((Hfo/Oxv))"
#> [13] "abs((m_n13/Oxv))" "abs((Hf/Oxv))"
#> [15] "abs((CMS/Hf))" "abs((s_ion_3/Hf))"
#> [17] "abs(((Hfo*s_EA)/Oxv))" "abs(((m_n13/m_N_val)/Oxv))"
iBART_real_data$coefficients # Coefficients of the selected descriptors
#> Intercept (s_EA*Hf)
#> 2.2667688754 0.0641417595
#> (m_n13*m_Chi_MB_un) (Hf*Oxv)
#> -0.2171066384 -0.0102495636
#> ((Hfo*s_EA)*abs((Oxv-s_ion_4))) (Hfo/Hf)
#> -0.0005013356 0.0015104293
#> (m_n13/s_phi) (Hf/Oxv)
#> -3.3558600086 -0.2006366861
#> ((Hfo*s_EA)/(m_N_val/CMS)) (s_ion_3/Hf)
#> -0.1284768423 0.0002178887
#> ((m_n13/m_N_val)/(m_N_val/CMS)) ((m_n13/m_N_val)/(m_N_val/s_ion_4))
#> -1.0673535163 -0.7634151724
#> abs((Hfo/Oxv)) abs((m_n13/Oxv))
#> -0.3578873442 -0.8955616423
#> abs((Hf/Oxv)) abs((CMS/Hf))
#> -0.1567905671 0.0052517211
#> abs((s_ion_3/Hf)) abs(((Hfo*s_EA)/Oxv))
#> 0.0010605477 -0.0133917672
#> abs(((m_n13/m_N_val)/Oxv))
#> -1.7098848392
If Lzero = TRUE
, iBART+ℓ0 will be run and K
controls the maximum number of descriptors in a model. Here we set
K=5
so iBART+ℓ0
will return 5 models: the best 1-descriptor model, 2-descriptor model,
and so on. We can access the best k-descriptor via
iBART_real_data$Lzero_names
, and their corresponding
regression models using iBART_real_data$Lzero_models
. For
instance, the best 3-descriptor model is
iBART_real_data$Lzero_names[[3]]
#> [1] "(s_EA*Hf)" "abs((Hfo/Oxv))"
#> [3] "abs(((m_n13/m_N_val)/Oxv))"
summary(iBART_real_data$Lzero_models[[3]])
#>
#> Call:
#> lm(formula = y_train ~ ., data = dat_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.70871 -0.42326 0.05825 0.44715 1.97315
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.01707 0.12675 -0.135 0.893
#> `(s_EA*Hf)` 0.40427 0.04441 9.104 2.75e-14 ***
#> `abs((Hfo/Oxv))` -0.58838 0.09857 -5.969 5.05e-08 ***
#> `abs(((m_n13/m_N_val)/Oxv))` -19.62963 4.25098 -4.618 1.33e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6378 on 87 degrees of freedom
#> Multiple R-squared: 0.9534, Adjusted R-squared: 0.9518
#> F-statistic: 593.9 on 3 and 87 DF, p-value: < 2.2e-16
The OIS model differs from the non-OIS model in that the former builds on nonlinear descriptors (composition of 𝒪 on X) while the latter builds on the primary features X. The OIS model has many advantages. In particular, it reveals an interpretable nonlinear relationship between y and X, and improves prediction accuracy over a simple linear regression model (or non-OIS model). Here we showcase the improved accuracy over the non-OIS model using Figure 7 in the paper.
# Train a non-OIS model with 3 predictors
set.seed(123)
model_no_OIS <- k_var_model(X_train = catalysis$X, y_train = catalysis$y, k = 3, parallel = FALSE)
#### Figure 7 ####
library(ggpubr)
model_OIS <- iBART_real_data$Lzero_model[[3]]
# Prepare data for plotting
data_OIS <- data.frame(y = catalysis$y, y_hat = model_OIS$fitted.values)
data_no_OIS <- data.frame(y = catalysis$y, y_hat = model_no_OIS$models$fitted.values)
p1 <- ggplot(data_OIS, aes(x = y_hat, y = catalysis$y)) +
geom_point() +
geom_abline() +
xlim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) +
ylim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) +
xlab("") +
ylab("") +
annotate("text", x = -12, y = -3, parse = TRUE,
label = paste("R^{2} ==", round(summary(model_OIS)$r.squared, 4)))
p2 <- ggplot(data_no_OIS, aes(x = y_hat, y = catalysis$y)) +
geom_point() +
geom_abline() +
xlim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) +
ylim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) +
xlab("") +
ylab("") +
annotate("text", x = -12, y = -3, parse = TRUE,
label = paste("R^{2} ==", round(summary(model_no_OIS$models)$r.squared, 4)))
fig <- ggarrange(p1, p2,
labels = c("OIS", "non-OIS"),
ncol = 2, nrow = 1)
annotate_figure(fig,
bottom = text_grob("Predicted binding energy from descriptors (eV)"),
left = text_grob("DFT binding energy (eV)", rot = 90))
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
#> [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggpubr_0.6.0 ggplot2_3.5.1 iBART_1.0.2 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] tidyr_1.3.1 sass_0.4.9 utf8_1.2.4
#> [4] generics_0.1.3 bartMachineJARs_1.2.1 rstatix_0.7.2
#> [7] shape_1.4.6.1 lattice_0.22-6 digest_0.6.37
#> [10] magrittr_2.0.3 evaluate_1.0.1 grid_4.4.2
#> [13] iterators_1.0.14 fastmap_1.2.0 foreach_1.5.2
#> [16] jsonlite_1.8.9 glmnet_4.1-8 Matrix_1.7-1
#> [19] missForest_1.5 backports_1.5.0 Formula_1.2-5
#> [22] survival_3.7-0 gridExtra_2.3 bartMachine_1.3.4.1
#> [25] purrr_1.0.2 fansi_1.0.6 doRNG_1.8.6
#> [28] itertools_0.1-3 scales_1.3.0 codetools_0.2-20
#> [31] jquerylib_0.1.4 abind_1.4-8 cli_3.6.3
#> [34] rlang_1.1.4 cowplot_1.1.3 munsell_0.5.1
#> [37] splines_4.4.2 withr_3.0.2 cachem_1.1.0
#> [40] yaml_2.3.10 tools_4.4.2 parallel_4.4.2
#> [43] ggsignif_0.6.4 dplyr_1.1.4 colorspace_2.1-1
#> [46] rngtools_1.5.2 broom_1.0.7 buildtools_1.0.0
#> [49] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
#> [52] randomForest_4.7-1.2 car_3.1-3 pkgconfig_2.0.3
#> [55] rJava_1.0-11 bslib_0.8.0 pillar_1.9.0
#> [58] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.13-1
#> [61] tidyselect_1.2.1 xfun_0.49 tibble_3.2.1
#> [64] sys_3.4.3 knitr_1.49 farver_2.1.2
#> [67] htmltools_0.5.8.1 carData_3.0-5 labeling_0.4.3
#> [70] maketools_1.3.1 compiler_4.4.2