Nonlinear latent variable models

Klaus Kähler Holst

Jun 19, 2020 8 min read

ML-inference in non-linear SEMs is complex. Computational intensive methods based on numerical integration are needed and results are sensitive to distributional assumptions.

In a recent paper: A two-stage estimation procedure for non-linear structural equation models by Klaus Kähler Holst & Esben Budtz-Jørgensen (https://doi.org/10.1093/biostatistics/kxy082), we consider two-stage estimators as a computationally simple alternative to MLE. Here both steps are based on linear models: first we predict the non-linear terms and then these are related to latent outcomes in the second step.

Introduction: the measurement error problem

Measurement error in covariates is a common source of bias in regression analyses. In a simple linear regression setting, let $Y$ denote the response variable, $ξ$ the true exposure, for which we only observe replicated proxy measurements $X_{1}, \dots, X_{q}$ , and $Z$ an additional confounder:

$\begin{aligned} Y & = μ + β ξ + γ Z + δ \\ ξ & = α + κ Z + ζ \\ X_{j} & = ν_{j} + ξ + ϵ_{j}, j = 1, \dots, q, \end{aligned}$

where the measurement error terms $δ, ζ, ϵ_{j}$ are assumed to be independent of the true exposure $ξ$ . Standard techniques, e.g. linear regression (MLE) on $X_{j}$ and $Z$ fails to provide consistent estimates. To see this observe that

$\begin{array}{r} {\hat{β}}_{M L E} = \frac{C ov (Y, X_{j} ∣ Z)}{V ar (X_{j} ∣ Z)} = \frac{β V ar (ξ)}{V ar (ξ) + V ar (ϵ_{j})} < β \end{array}$

In the following, generalizations of the above problem to general latent variable models including non-linear effects of the exposure variable are considered.

Statistical model

For subjects $i = 1, \dots, n$ , we have latent response variable: $η_{i} = (η_{i 1}, \dots, η_{i p})^{t}$ non-linearly related to latent predictor: $ξ_{i} = (ξ_{i 1}, \dots, ξ_{i q})^{t}$ after adjustment for covariates $Z_{i} = (Z_{i 1}, \dots, Z_{i r})$

$\begin{array}{r} (1) & \begin{array}{lcl} η_{i} & = & α + B φ (ξ_{i}) + Γ Z_{i} + ζ_{i}, \end{array} \end{array}$

where $φ : R^{p} \to R^{l}$ is a known measurable function such that $φ (ξ_{i}) = (φ_{1} (ξ_{i}), \dots, φ_{l} (ξ_{i}))^{t}$ has finite variance (see Figure 1). The target parameter of interest parametrizing $B \in R^{p \times l}$ describes the non-linear relation between $ξ_{i}$ and $η_{i}$ . Note, that $φ$ may also depend on some of the covariates thereby allowing the introduction of interaction terms.

The latent predictors ( $ξ_{i}$ ) are related to each other and the covariates in a linear structural equation:

$\begin{array}{r} (2) & \begin{array}{lcl} ξ_{i} & = & \tilde{α} + \tilde{B} ξ_{i} + \tilde{Γ} Z_{i} + {\tilde{ζ}}_{i} . \end{array} \end{array}$

where diagonal elements in $\tilde{B}$ are zero.

Figure 1: Non-linear structural equation model (SEM).

The observed variables $X_{i} = (X_{i 1}, \dots, X_{i h})^{t}$ and $Y_{i} = (Y_{i 1}, \dots, Y_{i m})^{t}$ are linked to the latent variable in the two measurement models

$\begin{array}{r} (3) & Y_{i} = ν + Λ η_{i} + K Z_{i} + ε_{i} \\ (4) & X_{i} = \tilde{ν} + \tilde{Λ} ξ_{i} + \tilde{K} Z_{i} + {\tilde{ε}}_{i}, \end{array}$

where the error terms $ε_{i}$ and ${\tilde{ε}}_{i}$ are assumed to be independent with mean zero and covariance matrices of $Ω$ and $\tilde{Ω}$ , respectively. The Øparameters* are collected into $θ = (θ_{1}^{t}, θ_{2}^{t})^{t}$ , where $θ_{1} = (\tilde{α}, \tilde{B}, \tilde{γ}, \tilde{ν}, \tilde{Λ}, \tilde{K}, \tilde{Ω}, \tilde{Ψ})$ are the parameters of the linear SEM describing the conditional distribution of $X_{i}$ given $Z_{i}$ . The rest of the parameters are collected into $θ_{2}$ .

Two-stage estimator (2SSEM)

The linear SEM given by equations ( $2$ ) and is ( $4$ ) is fitted (e.g., MLE) to $(X_{i}, Z_{i}), i = 1, \dots, n$ to obtain consistent estimate of $θ_{1}$ .
The parameter $θ_{2}$ is estimated via a linear SEM with measurement model given by equation ( $3$ ) and structural model given by equation ( $1$ ), where the latent predictor, $φ (ξ_{i})$ , is replaced by the Empirical Bayes plug-in estimate $φ_{n}^{*} (ξ_{i}) = E_{{\hat{θ}}_{1}} [φ (ξ_{i}) ∣ X_{i}, Z_{i}]$ .

Prediction of non-linear latent terms

For important classes of non-linear functions ( $φ$ ) closed form expressions can be derived for $φ^{*} (ξ_{i}) = E [φ (ξ_{i}) | X_{i}, Z_{i}]$ . Under Gaussian assumptions the conditional distribution of $ξ_{i}$ given $X_{i}$ and $Z_{i}$ is normal with mean and variance

$\begin{aligned} m_{x, z} & = E (ξ_{i} | X_{i}, Z_{i}) = \tilde{α} + \tilde{Γ} Z_{i} + Σ_{X ξ} Σ_{X}^{- 1} (X_{i} - μ_{X}) \\ v & = V ar (ξ_{i} | X_{i}, Z_{i}) = \tilde{Ψ} - Σ_{X ξ} Σ_{X}^{- 1} Σ_{X ξ}^{t}, \end{aligned}$

where $μ_{X} = \tilde{ν} + \tilde{Λ} (I - \tilde{B})^{- 1} α + \tilde{Λ} (I - \tilde{B})^{- 1} \tilde{Γ} Z_{i} + \tilde{K} Z_{i}, Σ_{X} = \tilde{Λ} (I - \tilde{B})^{- 1} \tilde{Ψ} (I - \tilde{B})^{- 1 t} {\tilde{Λ}}^{t} + \tilde{Ω}$ and $Σ_{X ξ} = \tilde{Λ} (I - \tilde{B})^{- 1} \tilde{Ψ} (I - \tilde{B})^{- 1 t}$ . Note, that the conditional variance does not depend on $X$ , $Z$ .

Polynomials

$\begin{array}{r} \begin{array}{lcl} η_{i} & = & α + \sum_{m = 1}^{k} β_{m} ξ_{i}^{m} + ζ_{i}, \end{array} \end{array}$

Here $φ (ξ_{i}) = ξ_{i}^{m}, (m \in N)$ and conditional means are given by

$\begin{array}{r} \begin{array}{lcl} E (ξ_{i}^{m} | X_{i}, Z_{i}) = \sum_{k = 0}^{[m / 2]}, m_{x, z}^{m - 2 k}, v^{k}, \frac{m!}{2^{k} k! (m - 2 k)!} \end{array} \end{array}$

Exponentials

For the exponential function $φ (ξ_{i}) = \exp (ξ_{i})$ an expression can be obtained as $\exp (ξ_{i})$ will follow a logarithmic normal distribution where the mean is

$\begin{array}{r} E [\exp (ξ_{i}) | X_{i}, Z_{i}] = \exp (0.5 v + m_{x, y}) \end{array}$

The conditional mean of functions on the form $φ (ξ_{i}) = \exp (ξ_{i})^{m}$ is straightforward to calculate as this variable again follows a logarithmic normal distribution.

Interactions

Product-interaction model

$\begin{aligned} η_{i} & = α + β_{1} ξ_{1 i} + β_{2} ξ_{2 i} + β_{12} ξ_{1 i} ξ_{2 i} + ζ_{i}, \end{aligned}$

now $E (ξ_{1 i} ξ_{2 i} | X_{i}, Z_{i}) = C ov (ξ_{1 i}, ξ_{2 i} | X_{i}, Z_{i}) + E (ξ_{1 i} | X_{i}, Z_{i}) E (ξ_{2 i} | X_{i}, Z_{i})$ , where terms on the right-hand side are directly available from the bivariate normal distribution of $ξ_{1 i}, ξ_{2 i}$ given $X_{i}, Z_{i}$ . Regression calibration leads to the correct mean expect that the intercept will be biased as this method will not include the covariance term above.

Splines

A natural cubic spline with $k$ knots $t_{1} < t_{2} < \dots < t_{k}$ is given by

$\begin{aligned} η_{i} & = α + β_{0} ξ_{i} + \sum_{j = 1}^{k - 2} β_{j} f_{j} (ξ_{i}) + ζ_{i}, \end{aligned}$

with $f_{j} (ξ_{i}) = g_{j} (ξ_{i}) - \frac{t_{k} - t_{j}}{t_{k} - t_{k - 1}} g_{k - 1} (ξ_{i}) + \frac{t_{k - 1} - t_{j}}{t_{k} - t_{k - 1}} g_{k} (ξ_{i}),, j = 1, \dots, k - 2$ and $g_{j} (ξ_{i}) = (ξ_{i} - t_{j})^{3} 1_{ξ_{i} > t_{j}},, j = 1, \dots, k$ . Thus, predictions $E [f_{j} (ξ_{i}) | X_{i}, Z_{i}]$ are linear functions of $E [g_{j} (ξ_{i}) | X_{i}, Z_{i}]$ , and

$\begin{aligned} E [g_{j} (ξ_{i}) | X_{i}, Z_{i}] = & \frac{s}{\sqrt{2 π}} [(2 s^{2} + (m_{x, z} - t_{j})^{2}) \exp (- [\frac{(m_{x, z} - t_{j})}{s \sqrt{2}}]^{2})] + \\ (m_{x, z} - t_{j}) [(m_{x, z} - t_{j})^{2} + 3 s^{2}] p_{x, z, j}, \end{aligned}$

where $s = \sqrt{v}$ and $p_{x, z, j} = P (ξ_{i} > t_{j} | X_{i}, Z_{i}) = 1 - Φ (\frac{t_{j} - m_{x, z}}{s})$ .

Relaxing distributional assumptions

Let $G_{i} \sim multinom (π)$ be class indicator $G_{i} \in {1, \dots, K}$ , and $ξ_{i}$ the $q$ -dim. latent predictor from the mixture

$\begin{array}{r} ξ_{i} = \sum_{i = 1}^{K} I (G_{i} = k) ξ_{k i} \end{array}$

with ${\tilde{ζ}}_{k i} \sim N (0, {\tilde{Ψ}}_{k})$ where $ξ_{k i} = {\tilde{α}}_{k} + \tilde{B} ξ_{k i} + \tilde{Γ} Z_{i} + {\tilde{ζ}}_{k i}$ . Assuming $G_{i}$ to be independent of $({\tilde{ζ}}_{1 i}, \dots, {\tilde{ζ}}_{K i})$ and $(ε_{i}, {\tilde{ε}}_{i})$ ,

$\begin{array}{r} \begin{array}{lcl} E [φ (ξ_{i}) | X_{i}, Z_{i}] & = & E {E [φ (ξ_{i}) | X_{i}, Z_{i}, G_{i}] | X_{i}, Z_{i}} \\ = & \sum_{k = 1}^{K} P (G_{i} = k | X_{i}, Z_{i}), E [φ (ξ_{k i}) | X_{i}, Z_{i}, G_{i} = k] \\ = & \sum_{k = 1}^{K} P (G_{i} = k | X_{i}, Z_{i}), E [φ (ξ_{k i}) | X_{i}, Z_{i}] . \end{array} \end{array}$

Results can be extended also to the case where $π$ depends on covariates. Under regularity conditions (Redner 1984) the MLE of the stage 1 mixture model is regular and asymptotic linear (RAL).

Asymptotics

Theorem. Under correctly specified model 2SSEM will yield consistent estimation of all parameters $θ$ except for the residual covariance, $Ψ$ , of the latent variables in step 2.

Proof: (intuitive version - linear models and Berkson error)

Let $Y = α + β X + ϵ$ with Berkson error $X = W + U$ and $C ov (W, U) = 0$ . Let $Y = α + β W + \tilde{ϵ}$ . Berkson error does not lead to bias in $β$ , but residual variance is too high Follows from noting that we have Berkson error in the second step: $η_{i} = α + B φ (ξ_{i}) + Γ Z_{i} + ζ_{i}$ . Iterated conditional means: $C ov {E φ (ξ) | X, Z], E [φ (ξ) | X, Z] - φ (ξ_{i})} = 0$ .

Assume that we have i.i.d. observations $(Y_{i}, X_{i}, Z_{i}), i = 1, \dots, n$ , and we also restrict attention only to the consistent parameter estimates i.e., $θ_{2}$ does not contain any of the parameters belonging to $Ψ$ .

We will assume that the stage 1 model estimator is obtained as the solution to the following score equation:

$\begin{array}{r} U_{1} (θ_{1}; X, Z) = \sum_{i = 1}^{n} U_{1} (θ_{1}; X_{i}, Z_{i}) = 0, \end{array}$

and that

The estimator of the stage 1 model is consistent, linear, regular, and asymptotic normal.
$U$ is twice continuous differentiable in a neighbourhood around the true (limiting) parameters $(θ_{01}^{T}, θ_{02}^{T})^{T}$ . Further, $n^{- 1} \sum_{i = 1}^{n} \nabla U_{2} (Y_{i}, X_{i}, Z_{i}; θ_{1}, θ_{2})$ converges uniformly to $E [\nabla U_{2} (Y_{i}, X_{i}, Z_{i}; θ_{1}, θ_{2})]$ in a neighbourhood around $(θ_{01}^{T}, θ_{02}^{T})^{T}$ ,
and when evaluated here $- E (\nabla U_{2})$ is positive definite.

This implies the following i.i.d. decomposition of the two-stage estimator

$\begin{aligned} \sqrt{n} ({\hat{θ}}_{2} ({\hat{θ}}_{1}) - θ_{2}) & = n^{- \frac{1}{2}} \sum_{i = 1}^{n} ξ_{2} (Y_{i}, X_{i}, Z_{i}; θ_{2}) \\ + n^{- \frac{1}{2}} E [- \nabla_{θ_{2}} U_{2} (Y, Z, X; θ_{2}, θ_{1})]^{- 1} \\ \times E [\nabla_{θ_{1}} U (Y, Z, X; θ_{2}, θ_{1})] \sum_{i = 1}^{n} ξ_{1} (X_{i}, Z_{i}; θ_{1}) + o_{p} (1) \\ = n^{- \frac{1}{2}} \sum_{i = 1}^{n} ξ_{3} (Y_{i}, X_{i}, Z_{i}; θ_{2}, θ_{1}) + o_{p} (1) . \end{aligned}$

where $ξ_{1}$ and $ξ_{2}$ are the influence functions from the stage 1 and stage 2 models.

Implementation

Installation in R:

  library(lava)
  library(magrittr)

Simulation

      f <- function(x) cos(1.25*x) + x - 0.25*x^2
      m <- lvm(x1+x2+x3 ~ eta1,
	       y1+y2+y3 ~ eta2,
	       eta1+eta2 ~ z,
	       latent=~eta1+eta2)
      functional(m, eta2~eta1) <- f

      # Default: all parameter values are 1. Here we change the covariate effect on eta1
      d <- sim(m, n=200, seed=1, p=c('eta1~z'=-1))

head(d)

          x1         x2         x3       eta1         y1         y2         y3
1 -1.3117148 -0.2758592  0.3891800 -0.6852610  0.6565382  2.8784121  0.1864112
2  1.6733480  3.1785780  3.3853595  1.4897047 -0.9867733  1.9512415  2.7624733
3  0.5661292  2.9883463  0.7987605  1.4017578  0.5694039 -1.2966555 -2.2827075
4  1.7946719 -0.1315167 -0.1914767  0.1993911  0.7221921  0.9447854 -1.3720646
5  0.3702222 -2.2445211 -0.3755076  0.0407144 -0.3144152  0.3546089  0.9828617
6 -2.8786355  0.4394945 -2.4338245 -2.0581671 -4.1310534 -5.6157543 -3.0456611
        eta2           z
1  1.7434470  0.34419403
2  0.8393097  0.01271984
3 -0.4258779 -0.87345013
4  0.7340538  0.34280028
5  0.2852132 -0.17738775
6 -3.9531055  0.92143325

Specification of stage 1

    m1 <- lvm(x1+x2+x3 ~ eta1, eta1 ~ z, latent = ~ eta1)

Specification of stage 2:

      m2 <- lvm() %>%
	  regression(y1+y2+y3 ~ eta2) %>%
	  regression(eta2 ~ z) %>%
	  latent(~ eta2) %>%
	  nonlinear(type="quadratic", eta2 ~ eta1)

Estimation:

(mm <- twostage(m1, m2, data=d))

                    Estimate Std. Error  Z-value   P-value
Measurements:
   y2~eta2           0.98628    0.04612 21.38536    <1e-12
   y3~eta2           0.97614    0.05203 18.76166    <1e-12
Regressions:
   eta2~z            1.08890    0.17735  6.13990 8.257e-10
   eta2~eta1_1       1.13932    0.15548  7.32770    <1e-12
   eta2~eta1_2      -0.38404    0.05770 -6.65582 2.817e-11
Intercepts:
   y2               -0.09581    0.11350 -0.84413    0.3986
   y3                0.01440    0.10849  0.13273    0.8944
   eta2              0.50414    0.17550  2.87264  0.004071
Residual Variances:
   y1                1.27777    0.18980  6.73206
   y2                1.02924    0.13986  7.35895
   y3                0.82589    0.14089  5.86181
   eta2              1.94918    0.26911  7.24305

  pf <- function(p) p["eta2"]+p["eta2~eta1_1"]*u + p["eta2~eta1_2"]*u^2
  plot(mm,f=pf,data=data.frame(u=seq(-2,2,length.out=100)),lwd=2)