I had the pleasure of speaking at HushCon East this past weekend. For anyone that hasn’t attended before, I can’t recommend it highly enough. Quality people, great venue, interesting talks – everything you could want from a hacker con. The talk that I gave was entitled Cloudstone – Sharpening Your Weapons Through Big Data. What follows here is a write-up of the talk’s contents, references to the tools that I wrote to process the relevant data, and links to the content discovery hit lists that I generated through the research.


As someone that has spent a good amount of time in academic research, it’s apparent that many research tactics often employed in the academic sphere have not been translated over to industry research. This is especially true with respect to big data analysis, and the potential for improvement with respect to offensive security tools by big data analysis is striking. Thus this talk was born, as I wanted to bring some of my academic research techniques and apply them to offensive security research.

This talk focused on improving web application content discovery. For the uninitiated, web app content discovery refers to the process of discovering resources on a web server that are not linked to. For web app testers, content discovery quite often pays some serious dividends, as unlinked resources commonly:

  • Contain debugging functionality or functionality not intended for anonymous users
  • Contain old versions of files already on the web server (ie: index.php was renamed to index.php.old when a new index.php was deployed)

The process of running web app content discovery is fairly straightforward – brute force guess directory and file paths on the web server in the hopes that your tool guesses URL paths for unlinked resources on the server. As things stand presently, this process is quite expensive. It is effectively fuzzing file and directory names on a remote web server that the tester only has black box access to. As in all things fuzzing, the effectiveness of the fuzzing process is dependent upon the efficacy of the list of fuzzing strings being used for the fuzzing process. For a frame of reference, here are the list sizes for some of the fuzzing lists used by the standard content discovery tools:

As you can see above, the sizes of these lists can vary greatly. Furthermore, these tools will commonly run permutations on the input lists, meaning that the 5,992 entries in the dirs3earch tool is more realistically 20,000+.

Assuming an average throughput of 100 HTTP requests per second, this means that running through these lists can take anywhere from a few minutes to multiple hours. This is not necessarily problematic if you’re running content discovery on a single web server, but what if you want to run web app content discovery at scale? With these sorts of time commitments, this problem does not scale very well. Furthermore, these lists don’t come with any amount of guarantee that you’ve guessed the most common URL routes found on the Internet.

So how can we improve the effectiveness of web app content discovery? What if we could analyze all of the URL paths found on the world wide web and create hit lists that gave you a statistical guarantee that you had guessed the most common paths found on all web sites? What if we could take things a step further and cater these hit lists to individual server types? For instance, it’s probably a waste of time to guess file paths that end in .aspx on a server that is clearly running on a Linux host.

It’s precisely this problem that was the motivation to conduct this research, and the end result of the research is content discovery hit lists for a number of popular web servers that provide statistical guarantees of content discovery coverage. If you’re interested in seeing how I ran this research, then read on! If you’re only interested in the content discovery hit lists, then check out this GitHub repository.

The Technologies

This research is build upon a number of technologies. In no specific order, they are:

  • MapReduce – A programmatic approach for processing large amounts of data in distributed fashion.
  • Hadoop – An open source implementation of the MapReduce algorithm.
  • Amazon Elastic MapReduce – Amazon’s service offering that enables you to spin up a Hadoop cluster to process data on top of Amazon EC2 instances.
  • Common Crawl – A non-profit organization that crawls the world wide web and stores all of the crawl data in Amazon S3, to be consumed by Amazon services free of charge (e.g. accessing the data is free).

The Code

All of the software that I wrote for this research is available on my GitHub. The individual repositories are:

The Presentation

If you’d like to take a look at the slides that I used for the talk, take a look at my SlideShare page here.

The Approach

The details provided here may be a bit light for anyone that is entirely unfamiliar with Hadoop and MapReduce. As such, I recommend taking a look at what the two technologies do if anything here is insufficiently clear.

The goal of this research was to analyze all of the crawl data provided by Common Crawl, extract that URL paths found within the data set, and associate URL path segments with the server types found within the relevant HTTP responses. This sounds simple in theory – run through all of the HTTP responses, aggregate the URL path segments for every observed server type, and then order the results by the most common URL segments for every observed web server type. The main issue here is that we must use the MapReduce algorithm to handle this process for us. MapReduce is deceptively simple as well. At a high level, you get two steps to process a data set:

  1. Map – Analyze data and generate key-value pairs representing the data that you would like to aggregate upon.
  2. Reduce – Apply a mathematical aggregation on the key-value pairs that were generated in the map phase.

Given our goal of extracting the most common URL segments per server type, and that we must use the MapReduce algorithm to achieve this goal, how then can we go about achieving our goal?


I first had to determine what sort of key-value pairs I wanted to generate from the Common Crawl data. To this end, I decided that I would create keys of strings that contain (1) the server type and (2) the URL segment for every URL segment found in every HTTP response. These strings would then be paired with an integer value of 1. As an example, take the following HTTP response:

In the response above, we have a URL path of /foo/bar/baz.html and the server is running Apache on Unix. Through the decomposition process found in the referenced Hadoop code, this response would result in the generation of the following key-value pairs:

By mapping the Common Crawl data set into these key-value pairs, I could then apply a sum aggregation on the values on a per-key basis which would then indicate how many times a specific URL segment had been observed in association with a given server type in the full data set.


With the Common Crawl data set mapped into these key-value pairs, the reduce process was quite simple – sum up the values for every key. For instance, take the following key-value pairs:

Applying the sum function to these key-value pairs would result in the following output data set:

Sure enough, taking a look at the output data set, the results followed our expectations:

Common Crawl analysis output
Common Crawl analysis output

Now the only thing left was to process the output into the content discovery hit lists that we require.


The resulting data set from the Hadoop job(s) was approximately ~3.6GB large and contained around 94 million lines. I now had to process this data into the content discovery hit lists described at the beginning of this post. The hit lists in their final form can be downloaded from this repository. The code for processing the Hadoop results into the content discovery hit lists can be downloaded from this other repository.

I bucketized the Hadoop results into 19 different server types. The number of URL segments gathered per server ranged from the low end of 839,122 entries for the Gunicorn server to 597,085,398 entries for generic Apache (ie: did not specify whether running on *nix or Windows). The URL segment counts by server type are shown below:

URL segments by server type
URL segments by server type

So now that we have the most common URL segments on a per server basis we can have a statistical guarantee that a certain number of content discovery guesses for a given web server will cover 99.9% of all observed URL paths for that server type. The big question then, is how many requests do we need to send to achieve this coverage guarantee? I’m happy to say that the results indicate a significant improvement upon existing approaches.

The following chart shows how many requests need to be submitted for every server type to achieve a given percentage of coverage of all data found within the Common Crawl data set:

Common Crawl content discovery coverage by server type
Common Crawl content discovery coverage by server type

As shown above, in order to cover 99.9% of all URL paths observed for generic Apache, we only need to send 784 HTTP requests to the server. Taking the worst case scenario (1,360 requests for Nginx server type) this represents the following improvements in content discovery effectiveness over existing word lists and technologies:

Common Crawl content discovery improvements
Common Crawl content discovery improvements


A few caveats should be taken into consideration when evaluating the improvements listed above. Namely, they are:

  • The Common Crawl data set only represents files that are linked to (ie: crawlable). As such, unlinked files are not contained within the output data set. This means that using the lists generated by this research alone is insufficient to achieve maximum coverage. Instead, they should be used in conjunction with unlinked-specific hit lists like the one provided by dirs3arch.
  • How Common Crawl determines what comprises the world wide web is not all that well known. The data sets continue to improve as Common Crawl matures, but there will always be significant portions of the world wide web that are omitted from the Common Crawl data set.
  • The URLs crawled by Common Crawl likely miss many of the programmatically-generated (ie: JavaScript-based) URLs that are so prevalent in today’s single-page applications.

Regardless of these caveats, however, the data sets generated by this research should still represent a significant improvement upon the current state of web application content discovery.


This research represents only a single instance of how we can use big data to improve the effectiveness of our offensive tools. I hope that the data here in this post, as well as the contents of the slide deck provided alongside it, may inspire folks to get their hands dirty with Hadoop and the Common Crawl data set. There are also a number of other big data sources (listed in the slides) that can provide other intriguing starting points for using big data to sharpen our weapons.

Thanks for playing o/