Why digital government records are so hard to preserve ‣ Diverse Daily

In May, a federal judge ordered White House staff to comply with the Presidential Records Act, the 1978 law that makes a president’s official records public property and governs their preservation and eventual release.

A month earlier, the Justice Department had argued the law exceeds Congress’s constitutional authority. The American Historical Association and the watchdog group American Oversight sued, warning that the opinion could let the White House abandon policies meant to restrict officials from conducting government business through personal email or encrypted messages. The risk, they argued, was a current loss of accountability and a permanent gap in the historical record.

Judge John D. Bates has so far found the law “likely constitutional.” But the court fight is just one part of a much broader challenge. The records that reveal how governments and public figures make decisions are now born in email, chat apps and cloud documents, often inside proprietary systems whose lifespans are measured in product cycles. Preserving them long enough for the public to see them has become a technical problem in its own right, one that grows harder as the volume climbs. The National Archives added 463 terabytes of electronic records to its permanent collection in 2024 alone.

On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

“The world is creating digital records at a pace no organization anticipated,” says Mike Quinn, CEO of digital preservation company Preservica.

Before archivists can preserve a record, the record must survive long enough to make it into their hands. Public-records laws can require preservation, and the technology exists to capture and store messages even from some encrypted platforms when accounts or devices are configured to retain them. The digital preservation company Smarsh, for instance, advertises it can capture data from more than 100 communications channels. But recent incidents suggest how easily significant records can still vanish, from U.S. Cabinet officials discussing military plans via the encrypted app Signal to UK Prime Minister Keir Starmer’s reported use of disappearing WhatsApp messages.

The same fragility follows private archives too. Even when individuals such as politicians or artists—or their estates—donate physical papers to a university library, the digital material that once sat alongside them can be overlooked and lost, says Thorsten Ries, an assistant professor at the University of Texas at Austin who applies digital-forensics techniques to archival work.

Pulling the data off a hard drive or USB drive without altering files or metadata like timestamps also takes skill, Ries says. Different software versions, and even different storage media, can preserve different file fragments and automatic backups. Those offer valuable clues to how a document was drafted and how its creators thought, but recovering and interpreting them is painstaking, specialized work. “This kind of knowledge and expertise is actually still very sparse,” he says.

Cloud-based systems such as Google Docs can hold the most detailed file histories of all, but extracting files from them without the original passwords and two-factor authentication is its own challenge, he adds.

Survival is just the first step; the material also must remain readable as software changes. “All these types of digital content don’t age like paper,” Quinn says. “They become unreadable when formats become obsolete.”

That often requires regularly migrating material like word processing documents, spreadsheets and computer-aided design files to current file formats while keeping a careful log of exactly what’s been done. If handled carelessly, those conversions can misrepresent the original, says Christopher J. Prom of the University of Illinois Urbana-Champaign library. That appears to be what happened when the Justice Department released emails tied to the late financier and sex offender Jeffrey Epstein that were marred by rendering errors.

A preserved file can still be hard to use. Digital archives can contain copyrighted material alongside sensitive correspondence, including personal messages and medical bills, sitting in the same inboxes and folders as the files a researcher wants. That makes institutions cautious about opening collections broadly. And though a digital file could in theory be opened from anywhere with an internet connection, archives still routinely require an onsite visit, if they grant access at all, says Lise Jaillant, professor of digital cultural heritage at Loughborough University. Researchers must schedule and pay for travel, then comb through enormous collections on potentially unfamiliar systems in whatever time they have.

The “staggering volumes” of digital material produced by U.S. government agencies have likewise slowed the handling of Freedom of Information Act requests, says Jason R. Baron, a professor at the University of Maryland’s College of Information and former director of litigation at the National Archives and Records Administration. Agencies must first try to locate potentially relevant files, often by keyword search, then remove or redact anything classified, sensitive, or otherwise exempt from disclosure.

“It is not unusual for a requester to wait years or even in some cases over a decade to receive complete responses,” Baron says.

Automation may help, with substantial human oversight. In a 2025 paper, Baron explored using artificial intelligence and machine-learning techniques to flag paragraphs likely to be exempt under the FOIA provision that shields an agency’s “deliberative process.” Software can also help spot sensitive information like Social Security numbers and extract text from scanned documents or archived video through optical character recognition and automated transcription.

AI can also surface files relevant to a particular question in a sprawling archive, including documents a simple keyword search would miss. As Baron points out, the same techniques are already used in litigation for electronic discovery, when vast sets of corporate files, emails, and other records often must be searched for material bearing on a lawsuit.

Still, challenges remain, says Jaillant, who is leading an international project on AI’s applications to government records. One is a shortage of publicly available email data to train AI to handle messages of various types and origins. Partly because of privacy concerns, researchers still often lean on a now-decades-old set of messages that government investigators obtained from Enron, Jaillant says.

And even as AI gets better at parsing archival material, it is unlikely to relieve human researchers of the need to read the relevant documents themselves. “It’s still important for a human user to go back to the documents and be able to read individual emails just to understand the context,” she says.

All of that assumes the records survive long enough to be read—which is precisely what the fight in Washington has put in doubt. Archivists, and the software they depend on, are working to make sure they do, before the records of today’s decisions become trapped in dead formats or erased from message threads without the public ever getting the chance to see them.

Source link

Diverse Daily

Leave a Reply Cancel reply

Recent Comments

Categories