Missing Data

How we can know, plan, and act in the face of incomplete information

Sep 12, 2021

Covid data are never perfect. Georgia at least has a nationwide reporting system - unlike the US, which has a patchwork of state reporting resulting in maps like this one, from the New York Times:

Covid in the U.S.: Latest Map and Case Count - The New York Times

Note the giant hole in the middle of the country where Nebraska should be. Nebraska does not report covid case numbers the way other states do. The Nebraskan government claims this is due to privacy concerns - in counties with 20,000 people or fewer, reporting case counts could theoretically allow analysts to discover specifically which people have had covid, or something. Critics say this is nonsense. In any case, the result is that - best case - you have maps like this, with a giant void in America’s heartland. Worst case, you get maps which mistake absence of evidence for evidence of absence, and just falsely report these counties as having low or no covid transmission.

However, Georgia (the country) has had a few data hiccups, and this week we saw one of them. On Tuesday, we were expecting to see about 3600 cases - per my projection, based on my exponential decay model; Georgian NCDC head Amiran Gamkrelidze said they were expecting 3500 - 3700 cases based on their projections. Instead we saw 1965. Gamkrelidze explained that this wasn’t a “real” number and that there was a backlog of cases due to technical problems.

On Wednesday I thought if the backlog cleared we’d see about 4000 - 4500 cases; instead we got 2571. This was a feasible number, but lower than expected. The rest of the week proceeded normally, with 2455 cases on Thursday, 2262 on Friday, and 2533 on Saturday.

Except, on Saturday, the numbers didn’t add up.

I entered the official case total into my spreadsheet - 579031 - and Excel calculated the new cases by subtracting the previous day’s total - 575210. The difference was 3821. As I said, the official new case count said 2533. The discrepancy was 1288 cases.

Could I have made some kind of data entry error? I could understand getting a single digit wrong - or transposing two digits - but what kind of typo would result in a discrepancy of 1288 cases?

I went back to the previous day’s data on the NCDC website and found this:

The first bullet is the total number of cases as of September 10th - 576498. The second bullet is the number of new cases - 2262. Could I really have entered 575210 instead of 576498 by mistake? That’s clearly not a transcription error. Where did that number come from?

I went back another day:

Okay, so on the 9th we had 572948 cases total. Add 2262 and you get - 575210! Wait - did the NCDC just quietly change their case count at some point after I’d entered the data into my spreadsheet on Friday?

Here’s a news report from Friday (via 1tv.ge and Google Translate):

Yup - the news reported 575,210 as well, with a timestamp of 10:46 am.

So it looks like what happened was that 1288 cases were added to the official total at some point late in the day on Friday, September 10th. The new case count and regional case counts were not updated to reflect these new cases. The positive test rates for the last 7-day and 14-day periods were also not updated - the 7-day total is about half a percent lower than it should be. That is, unless these new cases were also not counted in the total number of tests performed? This whole thing is a mess.

Anyway, I do not have official confirmation that these 1288 cases of mysterious provenance were in fact the missing cases from Tuesday. The circumstantial evidence strongly suggests they are. Adding the 1288 to Tuesday’s total gives the weekly case curve exactly the shape I was expecting it to have based on reporting patterns in previous weeks: 3253 cases on Tuesday being very slightly under projection. My r calculation (the ratio I use in my exponential model) hovers between .75 and .7 for the week, which again is on the low end of my expectations but not remarkably so; without the missing 1288 cases, I would have been writing a post trying to explain why r had dropped to .65 for two days before rebounding to .74. So assigning these 1288 cases to Tuesday makes the week fit projections and expectations based on previous data to such an extent that I’m comfortable making the assumption that these are the missing cases from Tuesday, unless some official comes out and states otherwise.

Missing Data in Policy Debates

Two weeks ago, I constructed three different projections and assigned probabilities to the four scenarios bounded by them:

I wrote the following about the yellow, 20% scenario:

In order to drop below the yellow curve, either restrictions would need to be tightened, or we’d need more immunity than my estimates, or there could be some kind of seasonal effect (e.g. people go outside a lot because the weather is cooling off), or some other kind of population dynamic effect, like the one I hypothesized might be generally driving down cases after peaks, would need to take hold. This is slightly more likely than the above, so I’ll call it 20%.

It turns out we landed squarely in this zone. We did see some slight tightening and extension of restrictions - restaurant closing times were moved earlier, the public transit shutdown was extended - and this probably contributed to r falling below where I’d expected it to fall. As of September 10th, we’re here:

I consider this good news: restrictions were tightened and extended, and it looks to me like the result was that our pandemic numbers ended up in a more favorable scenario than I would have expected without these interventions. Of course this could be a giant lucky coincidence, and maybe we’d have the same falling numbers without the restrictions. But to me it looks like what we’re doing is working, and we’re better off than we’d be without these interventions.

Others are not so happy with this scenario. Several political figures have specifically complained that the public transit closures disproportionately affect poor urban residents who cannot get to work without public transportation. This is true, and I have long supported government assistance to the poor to compensate for the impact of lockdowns or restrictions. However, we continue to face the dilemma where on the one hand we want measures that save lives and preserve the capacity of the medical system to treat patients, and on the other hand we want measures that are economically and socially painless. We want to have our cake and eat it too.

I think it’s unfortunate that public health officials are criticized no matter what - early on they were criticized for not doing enough; later they were criticized for doing too much. People never seem to consider that political pressure against life-saving restrictions might cause officials to delay those restrictions until the country has literally the worst outbreak in the world. The higher the political cost of interventions, the more likely we are to see delayed action in the future. I wish we had a public that would have embraced reasonable interventions in May or June rather than a public that complained bitterly about every intervention for the last year, celebrated the end of curfew with massive gatherings, acted like they were invulnerable while cases steadily climbed all summer, woke up on August 21st and said “holy crap, 74 deaths - is anyone doing anything about this???”, and are now back to complaining bitterly about restrictions as though they are incapable of drawing inferences about causal relationships between past and present, or present and future.

Of course, the government could have - and should have - done better. But I think the substance and timing of the criticism matters. There are people claiming that transit closures are “pointless” without offering any alternative proposals at all. Like - okay, transit closures are painful for the urban poor; what should we do instead? What would you have done in the first two weeks of August when Georgia had more cases per capita than any other country on Earth? If you can’t answer that, your criticism is not constructive. One person told me they would have done stricter lockdowns, sooner. Great! I agree with that. The public would have been dead set against it, though, and it’s likely that everyone who is criticizing Georgian Dream now for its late and light restrictions would have been criticizing you twice as much for your early, heavier restrictions.

I’m all for public debate, but the terms of the debate should be honest and clear about the underlying reality. Our trajectory before transit closures was uncontrolled exponential growth. Since transit closures we’ve had a mostly steady decrease in r, although towards the end of this week it looks like it may be levelling off at around .74. I think if someone is going to say this is a big coincidence and our interventions aren’t actually doing anything, the burden is on them to overcome the circumstantial evidence that the interventions are working. And I think that policy criticisms that don’t acknowledge the tradeoffs and political costs of various policies are fundamentally dishonest.

In other words, there’s data here to be explained, and if your policy position relies on ignoring that data, it’s not a policy - it’s a fantasy.

Opening Schools: The Missing Benchmark

I suspect, once transit reopens tomorrow, we’ll see another uptick by the end of next week, and we might be looking at going back up to .8 or more - although, as I’ve previously noted, it does sort of seem like lower transmission rates are “stickier” than high rates - my impression is that high rates seem more likely to spontaneously drop than low rates are to spontaneously increase.

So let’s imagine, for a moment, that the current weekly ratio - .738 - holds for the next three weeks. As of today, we have had 16032 cases reported nationwide in the last 7 days. Multiply that by .738 and we get 11832 cases by September 19th, 8732 by September 26th, and 6444 by October 3rd.

Georgian officials have issued two benchmarks for opening schools. One is that 80% of teachers and staff should be vaccinated. The other is that community transmission rates should be low before schools are opened, as indicated by a positive test rate of 4% or lower. As stated in my previous covid post, my estimate is that a 4% positive test rate would correspond to about 7000 cases per week nationwide. So based on the very simple projection above, we would not hit this benchmark by September 26th, but we would hit the benchmark by October 3rd.

Schools are scheduled to open for in-person instruction on October 4th. On the surface, at least, it looks like the government has the timing almost exactly right. If the weekly ratio creeps back up to .8, we probably won’t hit the 4% benchmark by October 4th, but we’d be close, and we’d hit it at some point that week.

As of today, the positive test rate for the last 7 days is 8.03%. The government is reporting 7.48%, but as noted above, this does not take into account the 1288 cases which mysteriously appeared at some point between Friday and Saturday morning. We’re probably not going to hit the 4% benchmark this week.

There’s a petition circulating on social media (scroll down for English) to open schools for in-person instruction as soon as they can demonstrate that 80% of “parents, teachers, and staff” are vaccinated.

As a parent, I sympathize. I wish my kids could go to in-person instruction rather than starting school online. I’d almost say it would be better to just wait until October 4th, on the theory that 13 days of online instruction will be largely pointless. But I can’t get behind the idea that we should just throw out (or, in this case, completely ignore) the benchmark we don’t like. Community transmission is the more important of the two benchmarks since the transmission rate in the community will determine the likelihood of an infected student coming to school and spreading the illness to classmates.

Of course, there are some who would say that schools are perfectly safe, and that the threat to children is vastly overblown. That children are more likely to get killed in a car accident on the way to school than to die of covid. I haven’t checked this statistic for Georgia but it sounds plausible - hundreds of people die per year in car accidents here. Of course, these risks aren’t mutually exclusive, so by sending your kids to school you’re adding covid risk *on top of* flu risks, car accident risks, and whatever other risks you’ve decided to tolerate by sending your kids to school in a normal year. But I get the point.

My response to this point is twofold. One, when I look at the US, it looks to me like opening schools in places with high community transmission rates has led to outbreaks which has led to lots of kids in hospitals. Even if the kids survive - as most will - being in a hospital is an unpleasant experience. There is an unknown risk from long covid. Maybe the long covid risk is the same as flu risk - I don’t know - but I can get my kids flu shots; as yet I can’t get them covid shots.

In other words, again, we have missing data. Too many unknowns. What are our options? We could become paralyzed by fear and keep our kids home forever. We could ignore any new information and pretend covid isn’t real and send our kids to school without masks or any precautions. Or, we could try to find some kind of middle ground. We can take some precautions. We can keep kids home under some circumstances. We can anchor on benchmarks - as the Georgian government has done - and then adjust these benchmarks as more data comes in. Perhaps the 4% PTR was too cautious and led to lost learning. Perhaps it was too liberal and led to outbreaks. We’ll see, and then we’ll adjust for the future - because there will be future waves.

Personally, I prefer the middle ground. Anchoring on benchmarks which are transparent, objective, and adjustable in the medium-long term seems like the best approach for balancing reasonable caution with a reasonable desire to get back to normal. Maybe we get the balance wrong, but we’ll still be less wrong than we’d be if we ignored the tradeoffs and just defaulted to one extreme or another.

Life is a series of decisions taken based on incomplete information. How we cope with missing data determines our success or failure. We need to notice when data is missing and make reasonable assumptions to compensate for it. We need to base our arguments and our policies on an open acknowledgment that we don’t have all the facts, but we do have some of them. We need to make reasonable guesses about how to balance one good against another, and then review those guesses to see how they went and adjust accordingly. We’ll always be missing data and we’ll never have all the answers, but we can always act reasonably based on the data that we do have.

As usual, numbers in this post are courtesy of stopcov.ge and ncdc.ge, unless noted otherwise. Projections are mine. I am not a medical expert but I am a trained forecaster, so take my forecasts with however many grains of salt you think are appropriate.

Implications

Missing Data

How we can know, plan, and act in the face of incomplete information

Missing Data in Policy Debates

Opening Schools: The Missing Benchmark