Hello again ladies and gentlemen! My apologies for the delay in bringing this next installment of the Ghostbin’s Guts series to you - these past few months have been hectic. From quitting my job to trying to start a company to finding contract work to keep myself afloat, I’ve been having a tough time properly managing my schedule.
That’s no excuse though! Should you enjoy what I’m writing here and want me to write more, please harass me to do so on Twitter to speed things along. I love hearing from folks that like my writing and am much more motivated to put these together when I know people are looking forward to reading them.
Anyhow, without further ado I give you Ghost Got Passwords – Ghostbin’s Guts Part 2!
Back in August I set about scraping data from the Ghostbin paste-sharing platform. I had always been curious about how honest the “anonymity is used as much for good as it is for bad” argument was, and tested the truth of this statement through reviewing the contents of Ghostbin. As it turned out, most of what Ghostbin was being used for fell into the “malicious” category.
A breakdown of 20,868 Ghostbin pastes that I analyzed resulted in the following distribution of paste contents:
Of the 20,868 files that were classified, 3,975 of them contained password dumps. The size distribution of these 3,975 password dumps are shown below (note that the Y-axis has a logarithmic scale):
As can be seen above, the vast majority of pastes were under 5 kB in size, with 3,152 pastes being under 1 kB.
After I had singled out all of the password dump files, I then had to parse each file for username and password data. This got a bit tricky, as the formats of the files tended to differ quite a bit. For instance, the following patterns were commonly found in many of these files:
username:password username,password username password username, password username, password hash, password password:username password,username username password
Because of the widely varied representation of password data in these files, it is safe to say that the following information does not represent the entirety of the password data that I was parsing, and that some of the data is erroneous. However, I was able to successfully parse out the following number of data points (each row representing unique counts of the data point):
What follows is a statistical breakdown of what these passwords contained, and the resulting password lists that I fashioned from this data.
After I had parsed through all of the password dumps I pulled down, I was immediately intrigued by the number of email address and password combinations that this data set contained. I have been on many penetration tests where the target organization was not American, yet I only had English-based password dictionaries to throw at them. Here, then, was an opportunity to see if different nationalities employed different passwords.
I took all of the email address and password combinations that I had found and reviewed how many combinations I had by TLD. The results of this are shown below:
For those that are curious, I also took a look at what domains were most commonly represented in these password dumps:
While the lion’s share of the email address I analyzed belonged to the COM TLD, RU (Russia), NET (Generic), UK (United Kingdom), FR (France), and BR (Brazil) all had large enough presences to provide interesting insights.
I separated out all of the email and password combinations that I had by TLD and ran these lists through the Pipal Password Analyser software.
Firstly, let’s take a look at what the data trends are when we don’t break passwords down by TLD. Password lengths (agnostic of TLD) are shown below:
The percentage of last digits found in passwords (agnostic of TLD) are shown below:
One of the great features of Pipal is that it looks into the passwords it’s analyzing and identifies common patterns. There are two sets of data that Pipal spits out regarding password pattern categories and the presence of particular characters in passwords. The first of these two data sets (agnostic of TLD) is shown below:
The second of these two data sets (agnostic of TLD) is shown below:
And last but not least, the most common passwords (agnostic of TLD):
After looking at all of these graphs, I couldn’t say that I was surprised. The most common digit these passwords ended with was 1, passwords were commonly 6-8 characters long, lowercase+digit passwords were the most common followed by only lowercase alphabetic passwords, and 123456 was the most common password overall. So now let’s see if these trends held across the different TLDs.
Password lengths by TLD are shown below:
Password last digits by TLD are shown below:
The first password categories data across all TLDs is shown below:
The second password categories data across all TLDs is shown below:
And finally, the most common passwords by TLD:
Awesome – sure enough it looked like password lists could benefit from being tailored to specific language sets.
So now that we’ve got all of these pretty graphs in front of us, what are some of the takeaways we might conclude from this data?
With all of this work done and all of these passwords analyzed, it wouldn’t be fair not to share these password lists with you all. Note that these lists DO NOT contain any usernames or email addresses corresponding to the affected accounts.
The password lists are broken up by TLD and can be found on my GitHub:
I hope you all found this analysis of passwords scraped from Ghostbin enlightening. I would love to see more effort put into creating language-specific password lists, as this analysis clearly indicates the necessity.
If you like this article, please share it! If you want me to write more like them, harass me at @_lavalamp.
Share on Twitter Share on Facebook