I’m a fan of services that allow you to post and share random text with random people. There are a plethora of them out there (https://pastee.org/, http://pastebin.com/, https://cryptobin.org/, etc), and they all come with their own pros and cons. One of my favorites is Ghostbin partly due to its ease of use, partly due to the fact that the site is not indexed.

The other day I was posting something to Ghostbin and I noticed something. Let me post something once more and show you.

Posting to Ghostbin is shown below:

Posting to Ghostbin
Posting to Ghostbin

After the text was successfully posted, my browser was directed to the URL where the paste was now stored:

Ghostbin submission successfully posted
Ghostbin submission successfully posted

My post was saved at the following URL:


Looking at the URL, I thought to myself that the entropy of the paste identifier (o45h8) was quite low. Some quick Google dorking confirmed my suspicions that paste identifiers, while random, were not complex:

Google dorking for Ghostbin URLs
Google dorking for Ghostbin URLs

Further investigation determined that Ghostbin paste identifiers were five characters long, and were comprised of only the lowercase alphanumeric character set. I did some quick back-of-the-hand calculations to figure out how many unique pastes Ghostbin could possibly have with this identifier structure:

Okay, so Ghostbin could only have ~60.5 million unique pastes.

I then took it one step further. There are plenty of tools and libraries out there for firing off asynchronous HTTP requests en masse, and if I could achieve an average of 100 requests per second…

I found that in theory it would take me about seven days to find all of the valid Ghostbin paste IDs, assuming I could maintain an average of 100 requests sent per second.

It was at this point that I grew excited. Individuals who use services that provide anonymity tend to claim that they only want the anonymity on ethical grounds, shying away from the fact that anonymity is a tool commonly employed by those that are up to no good. Here, then, was an opportunity to see how true this was. Was anonymity more commonly used for good or evil?

As a quick aside, if this sort of research interests you, I highly recommend checking out Nicolas Christin’s publications – he’s a prolific and expert researcher in the area of digital crime.

I set upon the task of scraping Ghostbin, and in reality the scraping took closer to two weeks. Additionally, I did not scrape the entire ID space (got from aaaaa to ~6aaaa, missed ~6aaaa through 99999). Even so, the information that I scraped from Ghostbin contained some very interesting stuff.

The resulting data set was so massive that I cannot hope to cover it all in one blog post. As such, I’ll be breaking this series up into this first introduction followed by individual posts for each of the interesting data types. If you’ve got a preference for what type of data I should dig into next, let me know @ @_lavalamp.

The Bigger Picture

In total I identified a total of 42,538 pastes on Ghostbin. I made two passes across the identifier space: the first to find all of the valid identifiers and the second to pull down the raw paste content for the discovered identifiers. After pulling down the identified pastes, I took a look at the paste size distribution (note the logarithmic scale of the Y axis):

Ghostbin paste sizes
Ghostbin paste sizes

As shown above, paste sizes tended to be smaller with the vast majority of pastes falling under 512 kB.

I was then faced with the enormous task of sorting through over 42k pastes. I manually looked through around 2k pastes and decided to separate pastes by topic. I started identifying common trends across similar pastes, and built out heuristics to organize as many pastes as I could. After a few rounds of going back and forth, I came up with the following categories:

  • Basic Password Dump – Pastes that were majorly comprised of password dumps
  • C Code – Pastes containing C code
  • Credit Card Dumps – Pastes that were majorly comprised of credit card information
  • Database Dumps – Pastes that contained non-specific database dumps
  • Decompilation – Pastes that contained decompiled binaries
  • Dox – Pastes containing doxes of varying targets
  • Email Dumps – Pastes containing large numbers of email addresses
  • Garbage – Small pastes that did not appear to contain anything of interest
  • HTML – Pastes containing HTML
  • IP Lists – Pastes containing large numbers of IP addresses and ports
  • JavaScript – Pastes containing JavaScript code
  • JSON – Pastes containing JSON
  • Nmap – Pastes containing Nmap output
  • Objective C – Pastes containing Objective C code
  • Crash Dumps – Pastes containing crash dump output
  • PHP – Pastes containing PHP code
  • Python – Pastes containing Python code
  • Shell Scripts – Pastes containing shell scripts
  • Tweaks – Pastes containing iOS tweaks
  • Twitter Account Lists – Pastes containing a large number of Twitter account links
  • URL Lists – Pastes majorly comprised of URLs
  • XML – Pastes containing XML

Even after writing filters specifically for ferreting out the above categories, I successfully classified only 20,928 pastes (a little under half of the total number of retrieved pastes). Furthermore, the filters were far from 100% accurate, so many things were either missed or incorrectly classified. Because of this, the following numbers should be understood as generalizations instead of absolute truths.

After classification, the number of pastes found for each category are as follows:

Ghostbin pastes by category
Ghostbin pastes by category

The cumulative paste sizes for each category are shown below:

Ghostbin paste sizes by category
Ghostbin paste sizes by category

Looking at the two charts above, it can be seen that the file count for a category does not strongly correlate with the cumulative paste size. Normalizing both of the above graphs (divide X axis by max observed X value) to better demonstrate this observation, we get the following:

Ghostbin paste sizes and counts by category
Ghostbin paste sizes and counts by category

As was previously stated, I started looking at this data to determine whether Ghostbin was being used predominantly for good or evil. I summed up the paste sizes of the 20,928 pastes that I categorized which came out to 460 MB. I then summed up the paste sizes by category, and divided the resulting values by 460 MB to get the percentage of the 20,928 pastes that each category represented (by disk size). The results are shown below:

Ghostbin paste category size representations
Ghostbin paste category size representations

As shown above, doxes, database dumps, password dumps, crash dumps, and email lists made up five of the top six categories, comprising 58% of the analyzed paste volume.

The only potentially innocuous category in the top six was PHP. Taking a closer look, most of the PHP files contained one-off scripts and random junk. There was, however, a significant number of PHP shells and exploits such as:

It always cracks me up to see images/CSS/JS files loaded in web shells. YOU get a shell! YOU get a shell! YOU get a shell! That guy you’ve never heard of that put the image into the web shell code gets a shell!

I digress. From the data shown above, it is evident that a significant portion of the data hosted on Ghostbin is being used for malicious purposes. Assuming that URL lists, Nmap scans, and credit card dumps are malicious brings the cumulative malicious paste size percentage up to 68%. In conclusion, the contents of Ghostbin appear to support the argument that anonymity is predominantly used with malicious intentions.

Conclusion and Next Steps

Services that offer anonymity are great, and they solve some very interesting problems. Even so, anonymity is a tool that is often sought after and used by people without good intentions. While the numbers I’ve crunched above may contain some amount of inaccuracy, the picture remains clear: a significant amount of the pastes shared on Ghostbin have illicit content.

As I continue to dig through the data I’ve scraped from Ghostbin, I will post analyses of and insights into what I find. I am still in the process of refining and adding heuristics to classify the remaining half of the data so hopefully there will be even more goodies than I’m aware of now. Anything that I can share without risking harm, I will share freely.

Questions? Comments? As always, drop me a line at @_lavalamp.

Until next time folks!