PII, PHI, and in the words of George Takei: “Oh, My!”

Robby Delaware
7 min readFeb 15, 2021

As I mentioned previously, I have enjoyed playing around with things like Google Dorks.

I have only occasionally come across things online which rose to the level where I felt that someone needed to be contacted about them. I mentioned previously that I had run into a couple thousand email addresses tied to PayPal users. You’ll oftentimes run into a whole host of odd things online that have been indexed into major search engines.

As a general rule, I am looking for anything that can be discovered via a simple search engine queries which might possibly come close to being in violation of a state level data breach statute. I’m not digging around looking at unsecured AWS buckets or MongoDB databases. No, I am simply looking for what’s been indexed and appears in Bing or Google searches.

What’s considered as personal information as defined by state-level legislation is pretty lenient. Take a look at the state of New Mexico’s definition of Personal Information:

This is the state of New Mexico’s definition of Personal Identification. In the United States, unlike in the E.U., the individual states have been tasked with coming up with a regulatory framework governing online privacy and data breaches. This has led, naturally, to extraordinarily corporate-friendly laws.

You’ll notice: Email addresses are not applicable.

Speaking of the Land of Enchantment: Here’s an example of something that, while not violating the spirit of the 2017 legislation governing personal identification, it certainly came close in my opinion.

Now, in general I am against anonymity. But for a variety of reasons i’ve found that it is easier to simply create a burner ProtonMail account with a fake name when sending a message off.

Here’s a message I sent off on New Year’s Eve. It’s an issue I found on New Year’s Eve while drinking beer:

Thursday, December 31, 2020 12:48 AM

To:

nmsos.security@state.nm.us<nmsos.security@state.nm.us>

CC:

Media@nmag.gov<Media@nmag.gov>

New Mexico State Officials-

Hello! I wasn’t sure exactly where I should report this, so I took the liberty of sending this off to the Secretary of State’s Office and a Media contact for the New Mexico Attorney General’s office.

I do not believe that this problem would rise to the level of a data breach as defined by New Mexico’s Data Breach Notification act. But, it certainly is an issue in the general area of private information being easily exposed on the internet.

The website for █████ ███ located at: ██████████████████████████gov/

Is configured in an odd way. One of the domains on this website: ███████████████████████████████████████.txt is open, and currently being indexed by search engines like Google and Bing.

In fact, you can run a Google or Bing search for “████████████” and you can find what looks to be a repository of information about resumes and CVs that people have sent into █████ ███ ██████.

A number of full resumes look to have been incorrectly stored by the █████ ███ ██████ website. These resumes are also now indexed into search engine search results.

Take a look at something like this: ███████████████████████████████████████████████████████████

There are quite a few of these resumes on the internet. Both these resumes, and more easily, the page ending with ████████████, encourage issues such as identity theft.

Just a cursory glance of some of the resume shows:

Current address. First and Last Names. Mobile telephone numbers. Full email addresses. Full employment and education histories.

While this doesn’t reach the PII threshold, it isn’t the best way to be storing data. This is basically encouraging identity theft.

I was just curious if you’ve had anyone look into this — or had any previous communication about this issue.

Problematic domain: ████████████████████████████████████████txt

Data exposed: email addresses, home address, phone numbers

Data indexed into search engines: — each phone number/email address and name individual indexed into search engines

Search Terms that lead to domain and data: “████████████”

Resolution: You probably should work to have these resumes stored correctly, and search engines should be notified to wipe the data out of search.

Sent with ProtonMail Secure Email.

This one was easy and cut and dry. A .gov domain, linked to a municipality, leaking a whole host of resumes and other documents. A quick email to state authorities and the problematic domain was gone within 48 hours.

The problem, and there’s always some fucking problem, is that those email addresses were scraped by shady third party websites like “Houston-based” usseek.com. Meaning, those email addresses tied correctly to very specific first and last names, are going to be on the internet forever.

Plus, if I could find this information that easily, there is little doubt that bad actors have already stumbled across this too. Remember, I picture smoky rooms of people in the Philippines or Hyderabad, doing little more than manually collecting things like email addresses and other low-hanging PII for criminal networks. “Your job today, Slumdog? We need 5,000 unique email address of Americans!”

Hey, times are tough, and now that the Trump Train is out of office, no one is using Trump campaign contributions to pay for non-stop Qanon and MAGA troll postings anymore. Got to find a way to earn some cash during a pandemic, and manually scraping email address from Google search results is as good a means as any — amirite!?!

Take a small moment out of your day to remember that time that Donald Trump actually tweeted a sock puppet account that left on their geo-located position to a town in the Philippines that ran numerous for hire troll farms.

You might even discover, after spending some time investigating these things, that determining who exactly is responsible for a particular issue is difficult to ascertain.

There’s a difference, of course, between Personal Information (PI) and Personal Health Information (PHI) when it comes to data breaches and information leaking online.

You might find it surprising that it might actually be much harder to report the disclosure of personal health information online, than it is to report other types of problems.

Here’s an example, my spidey-sense was instantly activated this past September after I came across a large (10,000+) depository of what looked to be the personal health information (PHI) of numerous people that was being stored on an open Github repository.

In addition to what appeared to be health care related information, I also spotted what I believed to be social security numbers, first and last names, height and weight and gender, cell phone numbers, doctor’s diagnosis and geo-location data.

The social security numbers (if valid) would have been a significant data breach. The patient data would have been even more serious, and would most likely have forced patient notification, if the data could be traced to a particular entity inside of the United States.

My response.

I came across these files located in a Github repository that was located at gist.github.com.

Here’s an example of a JSON file with more than 10,000 entries:

The types of information located in a JSON file in a gist.github.com repository: city, first name, last name, middle name, last visit (to doctor), weight, height, location, gender, age, social security number, ICD-10 codes, and a written description of ailment.

Now I made a few mistakes, but I didn’t error when I decided that I needed to reach out and get someone to pay attention to this as soon as I spotted it. I’d read enough stories to know that the owners of Github repositories are often less than responsive.

I’d also read that Github (like Amazon with buckets) takes exactly zero responsibility if something sensitive is stored in an unsecured repository. I did, however, make sure to carbon copy Github’s bug security email address with information about this repository.

How was I supposed to approach this? I decided to reach out to SchizoDuckie after I did some research and browsed his paper online:

Worth a read!

SchizoDuckie contacted me on Twitter, and he really put in a lot of effort trying to track down exactly what this data was, and why it was in an open Github repository. He also guided me towards @PogoWasRight from Databreaches.net. @PogoWasRight showed me how to validate SSN numbers and she also provided me some helpful hints — along with evidence that what I had stumbled across was most likely highly realistic test data, mixed with other valid data from dead folks.

I kind of came across this data in a really, really dumb way. I though that if I played around with terms related to “SSN” — “Social Security Numer” I would eventually come across a stash of valid SSNs online. I was thinking that searches for “ssn&#039” would allow for searches of headers for source code information that included SSNs.

It didn’t work, but a typo “ste” instead of “site” and “mi” instead of something else, lead me to a “Patients.JSON.”

Take a look: an example of a screw-up in typing happen to lead, by happenstance, to an interesting find online.

Here’s another example of some of the ambiguity of things online. How one should go about reporting things really differs. Me, i’m not keen on reaching out to any dev. Therefore, I am looking for a regulatory agency of government — an agency that presumably WANTS to know this information — as opposed to going down a months long path of “responsible disclosure” with a third party. This is especially true if there is no bounty system in place.

It should be noted that GitHub, for whatever reason, quickly assisted in the removal of this data. The repository was secured in a few days after notification, and is no longer visible in Google search results. Whatever this data was — legit data, test data, some mix of the two — it is no longer accessible.

--

--