Thousands of PACER documents could have failed redactions.
Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary's PACER system offers the public online access to hundreds of millions of court records. The judiciary's rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don't always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.
But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a "what you see is what you get" quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple "layers." There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it's been redacted, but the text is still "under" the box. And often extracting that information is a simple matter of cutting and pasting.
So how many PACER documents have this problem? We're in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)
Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.
Links:
http://freedom-to-tinker.com/blog/tblee/studying-frequency-redaction-failures-pacer
Redacting with Confidence by the National Security Agency :
http://www.fas.org/sgp/othergov/dod/nsa-redact.pdf