Thanks to CIDRAP for alerting us to this important study in PNAS: Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys. Excerpt, with my bolding:
Significance
The novel coronavirus SARS-CoV-2 has infected over 33 million people in the United States. Nationwide, over 600,000 have died in the COVID-19 pandemic, which has necessitated shutdowns of schools and sectors of the economy. The extent of the virus’ spread remains uncertain due to biases in test data.
We combine multiple data sources to estimate the true number of infections in all US states. These data include representative random testing surveys from Indiana and Ohio, which provide potentially unbiased prevalence estimates.
We find that approximately 60% of infections have gone unreported. Even so, only about 20% of the United States had been infected as of early March 2021, suggesting that the country was far from herd immunity at that point.
Abstract
There are multiple sources of data giving information about the number of SARS-CoV-2 infections in the population, but all have major drawbacks, including biases and delayed reporting. For example, the number of confirmed cases largely underestimates the number of infections, and deaths lag infections substantially, while test positivity rates tend to greatly overestimate prevalence. Representative random prevalence surveys, the only putatively unbiased source, are sparse in time and space, and the results can come with big delays. Reliable estimates of population prevalence are necessary for understanding the spread of the virus and the effectiveness of mitigation strategies.
We develop a simple Bayesian framework to estimate viral prevalence by combining several of the main available data sources. It is based on a discrete-time Susceptible–Infected–Removed (SIR) model with time-varying reproductive parameter. Our model includes likelihood components that incorporate data on deaths due to the virus, confirmed cases, and the number of tests administered on each day.
We anchor our inference with data from random-sample testing surveys in Indiana and Ohio. We use the results from these two states to calibrate the model on positive test counts and proceed to estimate the infection fatality rate and the number of new infections on each day in each state in the United States.
We estimate the extent to which reported COVID cases have underestimated true infection counts, which was large, especially in the first months of the pandemic. We explore the implications of our results for progress toward herd immunity.
Discussion
To craft and implement effective policy and mitigation strategies, policymakers need reliable assessments of the impact of previous nonpharmaceutical interventions on the transmission rate of the disease. We have developed a simple Bayesian model of the dynamics of SARS-CoV-2 transmission incorporating readily available time-series data tracking the virus, as well as statewide representative point-prevalence surveys conducted in Indiana and Ohio, which are the highest-quality random testing surveys carried out to date. We present estimates of the IFR and the time-varying viral prevalence and reproductive number r(t) in each US state on each day.
Our results indicate that a large majority of COVID infections go unreported. Even so, we find that the United States was still far from reaching herd immunity to the virus in early March 2021 from infections alone. This suggests that continued mitigation and an aggressive vaccination effort are necessary to surpass the herd-immunity threshold without incurring many more deaths due to the disease. This work demonstrates the value of random-sample testing in response to this and future pandemics.
By incorporating testing and case data aggregated over any period of time, our additive model for positive tests in Eq. 2 allows us to avoid using data at the daily level, which can be very unreliable. For example, the reported cumulative number of tests administered in a state may not be updated for up to 2 wk at a time, or it may decrease from one day to the next as data are deduplicated upon further review. The latter scenario frequently occurs with reported cases as well. Working with data at the daily level generally requires using some kind of moving average, which washes out stochasticity in the data and leads to oversmoothing inconsistent with the high overdispersion of SARS-CoV-2 transmission.
Our inference relies on daily reported deaths due to COVID in each state, as opposed to excess deaths. Because of the possibility of death misclassification, excess-death data represent a mix of confirmed COVID deaths and deaths from other causes. Nevertheless, relying on reported deaths is a potential source of bias, as they are affected by the accuracy of cause-of-death determinations. Their numbers can fall significantly below excess-death counts and may undershoot the true number of deaths due to the disease. Ascertainment of COVID deaths may vary between states, with the cumulative excess-death count since the start of the pandemic exceeding reported COVID deaths by upwards of 50% in some states, according to a New York Times analysis of Centers for Disease Control and Prevention (CDC) mortality data. Consequently, our results may underestimate viral incidence in those states.
The CDC estimated a total of 83 million infections in the United States through December 2020, which is substantially larger than our estimate of 50 million infections in that period. Their numbers are based on the work of Reese et al., who infer COVID incidence in the United States using a multiplier model to account for underdetection in the number of confirmed cases. Beyond the limitations of our study discussed above, there are a few possible explanations for the difference in our estimates. Reese at al. base their estimates on nationally reported laboratory-confirmed cases, which do not constitute a probabilistic sample of the population. To this point, the authors remark that “…some infections, such as those among healthcare workers or from outbreaks in congregate residential settings, may be more likely to be tested and nationally reported compared with the general population, and could overestimate nonhospitalized cases and infections.”
Furthermore, the multiplier in their model relies on documented rates of test administration and care-seeking among symptomatic COVID patients. Reese et al. note that data on rates of test administration in this group are limited, especially at the local level. As such, Reese et al. do not account for geographic variation in testing, which is a potential source of bias.
Recent Comments