In this vignette we fit a Bayesian multivariate Gaussian mixture with a prior on the number of components K to the thyroid data set available in the mclust package. We use the prior specification for model selection and the telescoping sampler to perform MCMC sampling as used in Frühwirth-Schnatter, Malsiner-Walli, and Grün (2021).
First, we load the package.
We load the data set from package mclust and extract
the variables for clustering omitting the variable indicating the known
classification available in column named "Diagnosis"
.
We extract the dimension of the data set and indicate the distribution of the known classification.
## z
## Hypo Normal Hyper
## 30 150 35
For multivariate observations y1, …, yN the following model with hierachical prior structure is assumed:
For MCMC sampling we need to specify Mmax
, the maximum
number of iterations, thin
, the thinning imposed to reduce
auto-correlation in the chain by only recording every
thin
ed observation, and burnin
, the number of
burn-in iterations not recorded.
The specifications of Mmax
and thin
imply
M
, the number of recorded observations.
For MCMC sampling, we need to specify Kmax
, the maximum
number of components possible during sampling, and Kinit
,
the initial number of filled components.
We use a static specification for the weights with a gamma prior G(1, 20) on e0.
We need to select the prior on K
. We use the prior K − 1 ∼ BNB(1, 4, 3)
as suggested in Frühwirth-Schnatter,
Malsiner-Walli, and Grün (2021) for a model-based clustering
context.
We select the hyperparameters for the priors on the component specific parameters as in Frühwirth-Schnatter, Malsiner-Walli, and Grün (2021).
R <- apply(y, 2, function(x) diff(range(x)))
b0 <- apply(y, 2, median)
B_0 <- rep(1, r)
B0 <- diag((R^2) * B_0)
c0 <- 2.5 + (r-1)/2
g0 <- 0.5 + (r-1)/2
G0 <- 100 * g0/c0 * diag((1/R^2), nrow = r)
C0 <- g0 * chol2inv(chol(G0))
We use kmeans()
with the specified initial number of
filled components Kinit
to determine an initial partition
S0 of the
observations as well as initial values for the component-specific means
μ0.
set.seed(1234)
cl_y <- kmeans(y, centers = Kinit, nstart = 100, iter.max = 30)
S_0 <- cl_y$cluster
mu_0 <- t(cl_y$centers)
The component sizes are initially set to be equal and the
component-specific variance-covariance matrices are diagonal matrices
with half the value of C0
in the diagonal.
Using this prior specification as well as initialization and MCMC settings, we draw samples from the posterior using the telescoping sampler.
The first argument of the sampling function is the data followed by
the initial partition and the initial parameter values for
component-specific means, variances and sizes. The next set of arguments
correspond to the hyperparameters of the prior setup (c0
,
g0
, G0
, C0
, b0
,
B0
). Then the setting for the MCMC sampling are specified
using M
, burnin
, thin
and
Kmax
. Finally the prior specification for the weights and
the prior on the number of components are given (G
,
priorOnK
, priorOnE0
).
estGibbs <- sampleMultNormMixture(
y, S_0, mu_0, Sigma_0, eta_0,
c0, g0, G0, C0, b0, B0,
M, burnin, thin, Kmax,
G, priorOnK, priorOnE0)
The sampling function returns a named list where the sampled
parameters and latent variables are contained. The list includes the
component means Mu
, the weights Eta
, the
assignments S
, the number of observations Nk
assigned to components, the number of components K
, the
number of filled components Kplus
, parameter values
corresponding to the mode of the nonnormalized posterior
nonnormpost_mode_list
, the acceptance rate in the
Metropolis-Hastings step when sampling α or e0, α and e0. These values can be
extracted for post-processing.
We inspect the acceptance rate when sampling e0 and the trace plot of the sampled e0:
## [1] 0.416
We also inspect the distribution using a histogram as well as by determining some quantiles:
## 25% 50% 75%
## 0.05655384 0.09077177 0.14352504
To further assess convergence, we also inspect the trace plots for the number of components K and the number of filled components K+.
We determine the posterior distribution of the number of filled components K+, approximated using the telescoping sampler. We visualize the distribution using a barplot.
Kplus <- rowSums(Nk != 0, na.rm = TRUE)
p_Kplus <- tabulate(Kplus, nbins = max(Kplus))/M
quantile(Kplus, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 3 3 4
barplot(p_Kplus/sum(p_Kplus), xlab = expression(K["+"]), names = 1:length(p_Kplus),
col = "red3", ylab = expression("p(" ~ K["+"] ~ "|" ~ bold(y) ~ ")"))
The distribution is also characterized using the 1st and 3rd quartile as well as the median.
## 25% 50% 75%
## 3 3 4
We obtain a point estimate for K+ by taking the mode and determine the number of MCMC draws where exactly K+ components were filled.
## [1] 3
## [1] 314
We also determine the posterior distribution of the number of components K directly drawn using the telescoping sampler.
## 25% 50% 75%
## 4 5 7
barplot(p_K/sum(p_K), names = 1:length(p_K), xlab = "K",
ylab = expression("p(" ~ K ~ "|" ~ bold(y) ~ ")"))
## [1] 4
First we select those draws where the number of non-empty groups was exactly K̂+:
In the following we extract the cluster means, data cluster sizes and cluster assignments for the draws where exactly K̂+ components were filled.
Mu_inter <- Mu[index, , , drop = FALSE]
Mu_Kplus <- array(0, dim = c(M0, r, Kplus_hat))
for (j in 1:r) {
Mu_Kplus[, j, ] <- Mu_inter[, j, ][Nk_Kplus]
}
Eta_inter <- Eta[index, ]
Eta_Kplus <- matrix(Eta_inter[Nk_Kplus], ncol = Kplus_hat)
Eta_Kplus <- sweep(Eta_Kplus, 1, rowSums(Eta_Kplus), "/")
w <- which(index)
S_Kplus <- matrix(0, M0, N)
for (i in seq_along(w)) {
m <- w[i]
perm_S <- rep(0, Kmax)
perm_S[Nk[m, ] != 0] <- 1:Kplus_hat
S_Kplus[i, ] <- perm_S[S[m, ]]
}
For model identification, we cluster the draws of the means where exactly K̂+ components were filled in the point process representation using k-means clustering.
Func_init <- t(nonnormpost_mode_list[[Kplus_hat]]$mu)
identified_Kplus <- identifyMixture(
Mu_Kplus, Mu_Kplus, Eta_Kplus, S_Kplus, Func_init)
We inspect the non-permutation rate to assess how well separated the data clusters are and thus how easily one can obtain a suitable relabeling of the draws. Low values of the non-permutation rate, i.e., close to zero, indicate that the solution can be easily identified pointing to a good clustering solution being obtained.
## [1] 0.003184713
The relabeled draws are also returned which can be used to determine posterior mean values for data cluster specific parameters.
## [,1] [,2] [,3]
## [1,] 95.60734860 122.844708 109.949946
## [2,] 17.44237377 3.754055 9.038163
## [3,] 4.17673271 1.056005 1.719776
## [4,] 0.97889075 13.936542 1.300131
## [5,] 0.01393632 18.783948 2.503710
## [1] 0.1681265 0.1294430 0.6992356
A final partition is also obtained based on the relabeled cluster assignments by assigning each observation to the cluster it has been assigned most often during sampling.
## z_sp
## 1 2 3
## 37 28 150
The final partition can also be compared to the known classification.
## z_sp
## z 1 2 3
## Hypo 0 26 4
## Normal 2 2 146
## Hyper 35 0 0
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## [1] 0.0372093
## [1] 0.8779971