This algorithm could keep your census data ‘future-proof'

The World
Business-card advertisements for census 2020 jobs are pinned to a bulletin board

Twenty years ago, Cynthia Dwork started thinking about what privacy would mean today — in an age when people’s personal information is collected in any number of massive databases.

“The census was always our primary motivating scenario,” says Dwork, who is a professor of computer science at Harvard University, and a distinguished scientist at Microsoft Research. “This is the data of the people that's going to be analyzed for the benefit of the people and to distribute the people's resources.”

The federal government uses census data to allocate money to state and local governments, as well as to create legislative districts. Information has to be accurate, but it can’t be so specific that individual people can be identified. By law, the Census Bureau is required to protect individuals' privacy. That has been a big issue for the 2020 census because the Trump administration has pushed to include a question about citizenship status. In January, a New York federal judge blocked the question, but the Justice Department has appealed. The Supreme Court plans to rule later this year on whether the question can be included.

Related: Here's why the US census citizenship question stokes mistrust

In the early 2000s, it became clear to Dwork and others that it was going to become more difficult for the US to maintain the crucial balance between privacy and accuracy in the census. As technology has developed, there are more and more ways to reconstruct databases, or cross-reference multiple databases, making it easier to identify people.

“I think we're limited only by our imagination,” says Dwork, speaking of potential threats to privacy from the census.

Dwork spearheaded an effort to find ways the Census Bureau could release usable information without risking privacy breaches. They eventually developed a solution called differential privacy.

“It’s a dream of mine to learn how to really explain this so that it’s widely accessible,” Dwork says.

To put it crudely, differential privacy applies an algorithm to a dataset. In the simplest example, the algorithm flips a coin for each individual data item to decide whether to use the value in the dataset or instead to use a random value.

“Differential privacy absolutely protects everybody, including the outliers and the unusual people,” Dwork says.

And that protection is greatly needed. At a recent meeting of the Census Scientific Advisory Council, the bureau’s chief scientist, John Abowd, said he led internal tests that showed the 2010 census could have been compromised: It was possible to identify individuals’ age, gender, location, race and ethnicity.

“These experiments confirm that you can very accurately reconstruct what we call the ‘100 percent detail file,’ which is a confidential version of the census, using just the tables we published in 2010 at the person level,” Abowd said.

He said that is why the Census Bureau plans to use differential privacy for the first time for the 2020 census.  

“As far as I know, this is the first time anyone has tried to do this,” Abowd said.

The theory is that differential privacy will guard against threats that exist now and even ones that might be invented in the future.

“They’re future-proof, which means it makes absolutely no difference what happens from the point of computing power, or future releases of data from other sources. They can’t be compromised,” Abowd said, adding, “as long as they’re implemented correctly.”

And that’s a hint at a big vulnerability with regard to privacy and the census — the possibility of internal missteps. The scientist advisors seemed especially concerned about the Census Bureau sharing information with other federal agencies, and whether those agencies — like the Department of Homeland Security — will respect privacy.

Related: Arab Americans lobbied for their own US census box. Will it backfire?

Kevin Smith, the Census Bureau’s chief information officer, responded that only a small group of staff would have access to sensitive information, and any data that is shared with other agencies would be encrypted and very limited.

“It is things like internet addresses,” he said. “It is not answers and responses to surveys.”

For her part, Dwork says she knows if a citizenship question is included on the census, it might frighten people. But she’s confident that threats to privacy from the census — even those within the federal government — are being addressed.

“I think we're in the fortunate situation where we have a Census Bureau that we can trust,” Dwork says. “As I understand it, the people at the census do not support the inclusion of the citizenship question. This is being imposed on them externally.”

Sign up for our daily newsletter

Sign up for The Top of the World, delivered to your inbox every weekday morning.