Comparison of models for under-reporting in Covid-19 and lung cancer cases

In the US by State

May 2, 2024


Project overview

  • Compare how models perform on potentially under-reported data.
    • Under-reported data is when the observed value is not the true value due to some form of measurement error.
  • Under-reported data can bias analyses and affect the following decisions.
  • Project will compare 3 models on two responses, Covid-19 and lung cancer counts

Covid count data

Covid-19 count data is likely under-reported due to many reasons:

  • It can manifest in different severity levels,
  • Diagnostic challenges,
  • Stigma or social implications,
  • Some people are apprehensive to get tested,
  • etc.

Under-reported data can bias the estimates to be lower than they really are, can be thought of as unintentional missing data.

Lung cancer count data

lung cancer as a more serious disease may not be as under-reported and might not benefit from using an under reported model.

It is reasonable to assume there is low under-reporting of lung cancer in the US;

  • Serious illness,
  • Robust disease surveillance infrastructure, and
  • Test maturity.

Exploratory Data Analysis

Covid data

  • Covid-19 counts for Lower 48 & DC from April 2020
  • 23 variables
  • Response: Positive cases (count)
  • Spatial Component: State (lattice)

Distribution of response

Summary statistics for variables
Characteristic N = 491
Positive tests 7,562 (3,618, 21,742)
Total tests 81,465 (42,667, 161,181)
Testing Rate 0.018 (0.015, 0.027)
Population Density 106 (52, 231)
Air Pollution 7.40 (6.80, 8.20)
Obesity 30.9 (28.7, 34.4)
Smoking 16.10 (14.50, 19.00)
Excessive Drinking 18.20 (16.40, 19.40)
1 Median (IQR)

Not exhaustive list of variables and summary statistics

Lung cancer data

  • Same covariates, new response variable.
  • Nevada and Indiana did not meet USCS1 publication criteria





  • Non-spatial naive model
  • Spatial model
  • Under-reported spatial hierarchical model
  • All models implemented in Nimble (de Valpine et al. 2017)

Model comparison methods

  • Watanabe-Akaike information criterion (WAIC)

Non-spatial model (naive)

  • Regression model ignoring spatial component
  • multivariate Poisson regression model
  • Model selection

Spatial model

  • Spatial Poisson regression
    • Using a log link on the Poisson mean
    • ICAR normal prior on structured spatial effects

\[\begin{align*} y_i \sim \text{Poisson}(&\lambda_i) \\ &\downarrow \\ \log(&\lambda_i) = \alpha + \sum_{i=1}^{8} \beta_i x_i + \phi_i \\ &\phi_i \sim \text{Car}(0, \tau) \end{align*}\]

Under-reporting hiearchical model

Hiearchical model

let \(z_s\) be the observed (under-reported) counts, \(y_t\) be the true unknown counts, \(\pi_s\) be the under-reporting rate, and \(\lambda_s\) be the Poisson mean.

The hierarchical model can be written as, \[\begin{align*} z_{s} | y_{s} \sim \text{Binomial}(\pi_s, &y_{s}) \\ &\downarrow \\ &y_{s} \sim \text{Poisson}(\lambda_{s}) \end{align*}\] where \(\pi_s\) uses a logit link function and \(\lambda_s\) uses a log link function to determine values for the parameters.


Simple model

  • Bayesian Poisson regression
  • Model selection with smallest WAIC
  • Covid WAIC: 1,962,286.00
    • Model includes: uninsured, smoking, and unemployment
  • Lung cancer WAIC: 106,727.2
    • Model includes: unemployment, population density, uninsured, air pollution, and drug deaths

Spatial model

  • Covid WAIC: 628.1434
  • Lung Cancer WAIC: 539.3873

Hierarchical model (Covid)

  • WAIC: 616.0229
  • Estimated cases at 5%, 50%, and 95% quantiles of under-reporting

Model estimated total Covid cases
Under-reporting Cases
Observed 1,071,003.00
Predicted 5% 1,127,881.50
Predicted 50% 1,237,461.00
Predicted 95% 1,458,510.20

Hierarchical model (Lung Cancer)

  • WAIC: 535.589
  • Estimated cases at 5%, 50%, and 95% quantiles of under-reporting
    • 75% for state counts because of increased variance

Model estimated total Lung cancer cases
Under-reporting Cases
Observed 196,370.00
Predicted 5% 195,568.65
Predicted 50% 196,374.75
Predicted 95% 197,219.53

Model comparison

WAIC for all models of COVID cases
Response Method WAIC
Covid Simple 1,962,286.00
Spatial 628.14
Under-reporting 616.02
Cancer Simple 106,727.20
Spatial 539.39
Under-reporting 535.59

Thank You!


de Valpine, P., D. Turek, C. J. Paciorek, C. Anderson-Bergman, D. Temple Lang, and R. Bodik. 2017. “Programming with Models: Writing Statistical Algorithms for General Model Structures with NIMBLE.” Journal of Computational and Graphical Statistics 26: 403–17.
Stoner, Oliver, Theo Economou, and Gabriela Drummond Marques da Silva. 2019. “A Hierarchical Framework for Correcting Under-Reporting in Count Data.” Journal of the American Statistical Association 114 (528): 1481–92.

Extra slides

Spatial effects (Covid)

Spatial effects (Cancer)

Hierarchical trace plots (Covid)

Hiearchical trace plots (Cancer)

Spatial model (Covid)

Spatial model (Cancer)