Privacy and anonymity are critical tools for maintaining freedom in our growingly-digital world. Even so, privacy and anonymity are commonly used by individuals performing malicious activities. These two sides of the same coin are what debates on privacy tools in the 21st century often orbit around. I recently discovered an opportunity to test, in one small corner of the Internet, whether anonymity and privacy were being used more for benign or malicious purposes. I scraped a significant portion of Ghostbin's public pastes, and here is what I found.
I like services that allow you to post and share random text with random people. There are a plethora of them out there (https://pastee.org/, http://pastebin.com/, https://cryptobin.org/, etc), and they all come with their own pros and cons. One of my favorites is Ghostbin partly due to its ease of use, partly due to the fact that the site is not indexed.
The other day I was posting to Ghostbin and I noticed something. Let me post once more and show you.
Posting to Ghostbin is shown below:
After the text was successfully posted, my browser was directed to the URL where the paste was now stored:
My post was saved at the following URL:
Looking at the URL, I thought to myself that the entropy of the paste identifier (o45h8) was quite low. Some quick Google dorking confirmed my suspicions that paste identifiers, while random, were not complex:
Further investigation determined that Ghostbin paste identifiers were five characters long, and were comprised of only the lowercase alphanumeric character set. I did some quick back-of-the-hand calculations to figure out how many unique pastes Ghostbin could possibly have with this identifier structure:
[a-z0-9] == 36 possible characters 36^5 == 36*36*36*36*36 == 60,466,176 possible IDs
It turns out that Ghostbin could only have ~60.5 million unique pastes.
I then took it one step further. There are plenty of tools and libraries out there for firing off asynchronous HTTP requests en masse, and if I could achieve an average of 100 requests per second…
60,466,176 endpoints / 100 requests per second == 604,661.76 seconds 604,661.76 seconds / 60 seconds per minute == 10,077.696 minutes 10,077.696 minutes / 60 minutes per hour == 167.9616 hours 167.9616 hours / 24 hours per day == 6.9984 days
From the math above, it would take me about seven days (in theory) to find all of the valid Ghostbin paste IDs, assuming I could maintain an average of 100 requests sent per second.
It was at this point that I grew excited. Individuals who use services that provide anonymity tend to claim that they only want the anonymity on ethical grounds, shying away from the fact that anonymity is a tool commonly employed by those that are up to no good. Here, then, was an opportunity to see how true this was. Was anonymity more commonly used for good or evil on Ghostbin?
I set upon the task of scraping Ghostbin, and in reality the scraping took closer to two weeks. Additionally, I did not scrape the entire ID space (successfully scraped from aaaaa to ~6aaaa, missed ~6aaaa through 99999). Even so, the information that I scraped from Ghostbin contained some very interesting stuff.
The resulting data set was so massive that I cannot hope to cover it all in one blog post. As such, I’ll be breaking this series up into this first introduction followed by individual posts for each of the interesting data types. If you’ve got a preference for what type of data I should dig into next, let me know at @_lavalamp.
I identified a total of 42,538 unique pastes on Ghostbin. Note that pastes on Ghostbin can be set to expire, so this number does not represent the sum total of data Ghostbin has historically hosted. I made two passes across the identifier space: the first to find all of the valid identifiers and the second to pull down the raw paste content for the discovered identifiers. After pulling down the pastes, I took a look at the paste size distribution (note the logarithmic scale of the Y axis):
As shown above, paste sizes tended to be smaller with the vast majority of pastes falling under 512 kB.
I was then faced with the enormous task of sorting through over 42k pastes. I manually looked through around 2k pastes and decided to separate pastes by topic. I started identifying common trends across similar pastes, and built out heuristics to organize as many pastes as I could. After a few rounds of going back and forth, I came up with the following categories:
Even after writing filters specifically for ferreting out the above categories, I successfully classified only 20,928 pastes (a little under half of the total number of retrieved pastes). Furthermore, the filters were far from 100% accurate, so many things were either missed or incorrectly classified. Because of this, the following numbers should be understood as generalizations instead of absolute truths.
After classification, the number of pastes found for each category were as follows:
The cumulative paste sizes for each category are shown below:
Looking at the two charts above, it can be seen that the file count for a category does not strongly correlate with the cumulative paste size. Normalizing both of the above graphs (divide X axis by max observed X value) to better demonstrate this observation, we get the following:
As was previously stated, I started looking at this data to determine whether Ghostbin was being used predominantly for good or evil. I summed up the paste sizes for the 20,928 pastes that I categorized, which came out to 460 MB. I then summed up the paste sizes by category, and divided the resulting values by 460 MB to get the percentage of the 20,928 pastes that each category represented (by disk size). The results are shown below:
As shown above, doxes, database dumps, password dumps, crash dumps, and email lists made up five of the top six categories, comprising 58% of the analyzed paste volume.
The only potentially innocuous category in the top six was PHP. Taking a closer look, most of the PHP files contained one-off scripts and random junk. There were, however, a significant number of PHP shells and exploits such as:
It always cracks me up to see images/CSS/JS files loaded in web shells. YOU get a shell! YOU get a shell! YOU get a shell! That guy you’ve never heard of that put the image into the web shell code gets a shell!
I digress. From the data shown above, it is evident that a significant portion of the data hosted on Ghostbin is being used for malicious purposes. Assuming that URL lists, Nmap scans, and credit card dumps are malicious brings the cumulative malicious paste size percentage up to 68%. In conclusion, the contents of Ghostbin appear to support the argument that anonymity is predominantly used with malicious intentions.
Services that offer anonymity are great, and they solve some very interesting problems. Even so, anonymity is a tool that is often sought after and used by people without good intentions. While the numbers I’ve crunched above may contain some amount of inaccuracy, the picture remains clear: a significant amount of the pastes shared on Ghostbin have illicit content.
As I continue to dig through the data I’ve scraped from Ghostbin, I will post analyses of and insights into what I find. I am still in the process of refining and adding heuristics to classify the remaining half of the data so hopefully there will be even more goodies than I’m aware of now. Anything that I can share without risking harm, I will share freely.
Questions? Comments? As always, drop me a line at @_lavalamp.
Until next time folks!Share on Twitter Share on Facebook