One of the high points of last week’s Gov 2.0 Summit was transparency champion Carl Malamud’s speech on the history of public access to government information — ending with a clarion call for government documents, data, and deliberation to be made more freely available online. The argument is a clear slam-dunk on simple grounds of fairness and democratic accountability. If we’re going to be bound by the decisions made by regulatory agencies and courts, surely at a bare minimum we’re all entitled to know what those decisions are and how they were arrived at. But as many of the participants at the conference stressed, it’s not enough for the data to be available — it’s important that it be free, and in a machine readable form. Here’s one example of why, involving the PACER system for court records:
The fees for bulk legal data are a significant barrier to free enterprise, but an insurmountable barrier for the public interest. Scholars, nonprofit groups, journalists, students, and just plain citizens wishing to analyze the functioning of our courts are shut out. Organizations such as the ACLU and EFF and scholars at law schools have long complained that research across all court filings in the federal judiciary is impossible, because an eight cent per page charge applied to tens of millions of pages makes it prohibitive to identify systematic discrimination, privacy violations, or other structural deficiencies in our courts.
If you’re thinking in terms of individual cases — even those involving hundreds or thousands of pages of documents — eight cents per page might not sound like a very serious barrier. If you’re trying to do a meta-analysis that looks for patterns and trends across the body of cases as a whole, not only is the formal fee going to be prohibitive in the aggregate, but even free access won’t be much help unless the documents are in a format that can be easily read and processed by computers, given the much higher cost of human CPU cycles. That goes double if you want to be able to look for relationships across multiple different types of documents and data sets.