Exactly where reporting, personal computer science meet

Today’s front webpage carries a fresh installment in our ongoing “Doctors & Sex Abuse” investigation. It includes an extremely detailed and important analysis of every state’s laws that protect (or don’t protect) patients from dangerous doctors.

The research by investigative reporter Carrie Teegardin and editor Lois Norder involved months of painstaking collection and analysis of information. I could brag all day on the careful approach they took to bring you this important material, detailed in an interactive presentation at I could also brag on the sensitive interviewing and writing by reporter Danny Robbins that resulted in today’s other front-page story from the series.

For now, though, I want take you behind the scenes and back in time to some unusual work that helped get the project off the ground more than a year ago.

The work is unusual enough that one of our data journalists, Jeff Ernsthausen, was invited to talk about it recently at a conference that highlights the intersection of journalism and personal computer science. Jeff couldn’t make it, so I got to brag on his work instead. This column shares comments I made to attendees of C+J Symposium in October.

Jeff was assigned to our project because a data team member is always part of our big investigations. But we did not anticipate how much we would need him.

The story grew out of some reporting by investigative reporter Danny Robbins on a dangerous doctor who worked in our state prison system. Like any curious reporter, Danny went further than the story in front of him required. He was looking into the Georgia medical board’s record of disciplining doctors and began noticing many who had been accused of sexually abusing patients and were still practicing. He eventually estimated that two-thirds of doctors sanctioned after sexual misconduct continued to practice.

We expected that the problems were similar elsewhere - several news organizations have done this analysis in their own states, including The Seattle Times and Chicago Tribune and found similar results - and Public Citizen has done some work based on reports to a national database of physician misconduct.

So we decided to broaden our inquiry by sending out open records requests to all 64 medical boards. It took some time to learn that no state provides medical disciplinary actions in any kind of quantitative or analyzable format.

Obtaining the records the old fashioned way - asking government agencies for only the specific cases we wanted to review - was not going to work. We’d have to go for every kind of disciplinary record available and sort them ourselves. Many states said it would take months and a lot of expense to compile those; others said it could not be done at all and we should just use the records on their websites.

So ultimately, we began scraping. What does that mean? It’s a simple concept but a complicated process - the personal computer repeatedly downloads public records in text form from a public website. Setting up scrapes of more than 60 different websites, all with different forms of medical disciplinary information, is no easy task. Some have press releases. Some post legal documents. Some post lists of actions with links. Rarely do the documents have actual searchable text. Some states post not a single word. (An aside for those wondering if this process is proper: it’s a perfectly legal way to obtain large volumes of public records when government can’t or won’t help obtain the underlying records. No private material is obtained as scraping only returns what is already available for the public on the website.)

We knew scraping was going to be difficult, but we thought things would get easier once that part was done.

I remember the weekend I realized otherwise. We had thousands of disciplinary orders on hand from the first states we scraped and we were going to read a bunch and sort the ones relating to doctor sex abuse.

Reading cases from Texas, I spent a lot of time, I mean a lot, crying that weekend. First, the cases are horrendous and heartbreaking. I couldn’t believe the passes doctors got after fondling, violating and even raping female patients. One story we ultimately told out of Texas involved a doctor who was accused by 17 women of violating them under the guise of medical exams - and he got to keep practicing. I wish I could tell you that was an exceptional story, but it is way too common.

The other reason I was crying was because there were so many cases to read in Texas and it was taking me forever. These cases can be dozens of pages each of obscure, legal arguments and procedural motions. I was getting though maybe four cases an hour. And it was quickly becoming apparent to me that it would take years - several years - for our team to get through all the documents we would eventually collect from all over the nation.

After my nightmarish weekend, I consulted with Jeff and the team. And as he often does, Jeff studied the problem for a few days and came up with a proposed solution. He would get the personal computer to read the documents for us, using a machine learning tool he and the team would develop, and sort out the ones that related to doctor sex abuse. So we would only have to read a subset.

So that’s the long version of what we did. Jeff ultimately used optical character recognition to convert the text in the 100,000 or so records we scraped and collected from websites, public records requests, even news reports. He then applied logistic regression with dummy variables based on presence or absence of keywords to give each record a score for whether it was likely to reveal sexual misconduct. We ultimately read the ones that scored over 50. We know we missed some, but we are certain we got the vast majority.

Jeff got it down to some 10,000 records we had to read instead of 100,000. That still took months - but it was doable. He then those put those documents into a database tool our reporters could use to read and analyze specific cases

You saw the result - we found 2,400 doctors sanctioned after being accused of sexual misconduct with patients. Half were still practicing when we began publishing the series in July.

Now this is not really a numbers story and if you read our coverage you will understand why. That 2,400 cases is not definitive; it’s a collection of cases we could study to tell the story of a broken culture. With 900,000 doctors in America, that’s not a big number. But it’s many more than the medical profession has ever acknowledged and probably understated significantly because many states handle cases in secret.

The real story is the broken system that fails to take many cases seriously, protects doctors and endangers patients. You can read more about that on our special website for this project,

Here’s what investigative reporter Carrie Teegardin has to say about Jeff’s work.

The quick take: This technology allowed us to do old-school reporting on a massive scale and shine a spotlight on a phenomenon that was buried in thousands of hard to find documents.”

Whether it’s old-school reporting or newfangled computer magic, all of this is designed to bring you the kind of accountability reporting you value. We appreciate the support from subscribers and readers that allows us to go deep and take risks to tell important stories that would not otherwise be told.

1 comment:

  1. Harrah's Cherokee Casino Resort - Mapyro
    Information and reviews 원주 출장샵 for 구미 출장안마 Harrah's 부천 출장샵 Cherokee Casino Resort in 순천 출장안마 Cherokee, NC. Rating: 7.6/10 청주 출장마사지 · ‎2,215 votes