Monsters Under Your Bed

Mapping the Dark Web with Python

Introduction

This write-up is based on a talk given as part of GrrCon 2024. The article starts out with a brief overview of what the dark web is, and then shows how you can use some simple python code to connect to the dark web, scan various onion services, and identify potential threat intelligence. It also addresses the conflict of “Build vs Buy” and shows a real-life example of how something like this could provide value to an organization, without the need to pay a vendor tens of thousands of dollars in exchange. 

A Disclaimer

Before I dive into the what and the how of this presentation, I want to make sure you understand the risks associated with the dark web. The dark web is just that, it’s dark. You can find people selling drugs, websites advertising human trafficking services, even murder for hire. If you are not very careful, you could find yourself on the wrong side of another person’s scam or you could end up having your personal information compromised. 

With all of that being said, it is best to follow some best practices. The list provided in the slide above is, by no means, conclusive. But it is a start.  Before accessing the dark web, make sure that you are using the most up-to-date version of the Tor browser. It is also a good idea to direct your web traffic through a VPN, which can come with malware blocking capabilities. 

Once connected to the dark web, it is best to be extremely cautious when downloading anything. If you need to download something, be sure that it is from a trusted provider. Also disable JavaScript on all sites unless you absolutely need it for what you are doing. 

Be careful and try not to do anything stupid. Also, I do not accept any responsibility for what you do with anything that you learn from this article (or any of my other work for that matter). If you experience any negative consequences in your research, that is ultimately your responsibility. 

What is the dark web? 

For those of you that have not heard of or accessed the dark web before, at its most basic, it’s a separate segment of the internet that can only be accessed using specific private channels. Some of the more common network protocols that make up the dark web are Tor and I2P, but there are other less common protocols. With Tor specifically, you can access things called onion services which are a lot like websites on the open internet with an additional layer of privacy. If configured properly, onion services should be able to effectively hide their IP address, as well as their location. 

You may ask yourself why someone would want to create or access an onion service (I’ll just assume you did). People access the dark web for both good and awful reasons. On the positive side, you often see organizations working to promote free speech and protect whistle blowers. When it comes to those living under oppressive government authorities, onion services allow them to have a much wider reach where they can share their opinion and make sure their voice is heard. 

You can’t have the good without the bad though. On the dark web, you will often come across things like drug marketplaces, hackers for hire, human trafficking; this list goes on. It can be a very dark place, but that side of the coin is to be expected. For every good thing, you will always have groups that find ways to use it for the wrong reason. The dark web is no different. 

But what does this have to do with cybersecurity? Unless you’re looking to buy drugs (I don’t know you), you will most likely be interested in seeing what kind of intelligence you can pull from the dark web. Spoiler alert, there is a lot out there. 

Dark Web Intelligence

When you’re looking for intel on the dark web, you can go one of several different ways. For our uses, I specifically dove into breach intel and threat intel. We’ll touch a little bit on brand security monitoring, but for the most part we’ll just keep it to these two core areas. So first I’ll explain the difference between breach intel and threat intel. 

Breach intelligence, put simply, is any piece of information that could be used to indicate an organization or individual’s digital security has been compromised. You could be looking for corporate email addresses that are included in data leaks or you could be looking for references to your internal systems (think IP addresses, host names, etc.). On a personal level, you will often see dark web monitoring solutions that keep an eye out for your personal information (health information, social security number, personal email, etc.) in recent data leaks. 

Threat intelligence, on the other hand, doesn’t indicate that a breach has occurred. This is something that organizations often gather as a way of seeing what known threat actors are doing. This information allows them to keep up with the most recent tactics and strategies used by the enemy, as well as potentially hear about attacks before they happen (this is often rare but does happen). You can also keep an eye on new malware and see what vulnerabilities threat actors are starting to exploit, so that you can ensure your systems are patched and protected. 

Build vs Buy

All of this intel is great, but what exactly do you have to do to get your hands on it? As with most technology problems, you have two options. You can either buy a solution or you can build your own. Based off the overall topic of this article, you can probably guess which way I tend to lean. 

The option that many organizations typically end up going with is to buy a cookie cutter solution. You can find this intel through companies like CrowdStrike, Flashpoint, or Recorded Future just to name a few. This is not cheap, but if you are looking to build an equivalent solution you will likely end up spending far more on development internally. Cost aside, the biggest negative associated with buying instead of building is flexibility. Since you are not the sole user of the solution, you have far less control over the features and future roadmap of the solution you’re paying for. From first-hand experience, I can tell you this is frustrating. You may have a great idea for a feature that is pretty simple to implement, but it will not be implemented unless you have an in with the product manager. 

Building, on the other hand, gives you far more control over the functionality of the solution. You own it and nobody gets to use it unless you say, so of course you get to build it however you want. This does require that you have internal development experience (or pay for someone external to build it for you), but that is being made easier with the recent advancements in generative AI. A simple application that could’ve taken a software engineer up to a week to build can now be generated with a few well-crafted prompts. If you can’t afford to pay six figures a year for a dark web monitoring solution, you can start looking into building out something simple and improve it as you go. While this wouldn’t give you the same immediate spike in incoming intelligence, you would be able to gradually increase (or decrease) your organizations spend as you realize (or don’t realize) business value. 

As I hinted at, I tend to lead toward building because it is more approachable for small business with recent AI advancements and gives larger enterprises more flexibility. If you belong to a larger organization, you may want to use a custom solution to augment the intel coming in from a commercial product and get the best from both sides. 

If you are interested in building your own solution, I have built a very basic solution to show you what’s possible. The rest of this article will show you how it works, provide a case study to demonstrate the value it can deliver, and explain what ways you can go with something like this. 

How it works...

Before I dive into the logic of the solution, it may be a good idea to check out the source code at this GitHub repository. This code is entirely AI generated and does require some additional tweaks before it could be close to production-ready, but it’s something to start with. All I did to get this code was submit two fairly detailed prompts. From there, you can see what I got. This code was working in minutes, while the script would have taken me over an hour to write. If you have more specific requirements for your solution, I recommend trying to generate something yourself. 

Dependencies

Before I could actually run the source code I generated, I needed to make sure I had all of the dependencies in place. First off, there was the Tor browser. While there are a couple well known Tor libraries in python, using the browser is often the simplest solution. I will also be using it to find relevant onion services and investigate findings, so it is important to get that installed. If you’re looking to access the dark web, you can download the most recent version of the Tor browser from TorProject.org

You may also find the DB Browser for SQLite useful when viewing your scan results. Later on, you will see how I used it to look at bitcoin wallets discovered on various onion services. If you’re going to follow along, I highly recommend going out to SQLiteBrowser.org and downloading it for yourself. 

The script itself requires a handful of different libraries. Some of these dependencies are included in the Python Standard Library, but there are a few that need to be imported separately. The libraries can be separated into three different categories; web scraping, data storage and retrieval, and data visualization. See below for more details on each library. 

Web Scraping: 

  • requests – For handling web requests  

  • bs4 – BeautifulSoup, for scraping/parsing HTML  

  • re – For running regex searches against web pages 

Data Storage and Retrieval: 

  • sqlite3 – For creating and manipulating the database 

  • json – For loading the config file into memory 

Data Visualization: 

  • networkx – For building out a network diagram  

  • plotly – For visualizing the dark web network 

Environment Setup

Once you’ve gone out and installed Tor, there is only one configuration change you have to make. To do so, you have to open a file named torrc (you can find it by going to ”Tor browser > Tor > data > torrc”). For the most part you will leave this file alone, but you will have to add one line to the end. That line should read as “SocksPort 9051”, but you can use any port number you want as long as it is available. This change makes a proxy port that is available to the python script, so that it can connect to the Tor network. 

Aside from that change you will have to import the 4 non-standard libraries listed in the image above. You can use the following syntax to do that: “pip install [library name]”. 

An Overview

At a high level, the scan workflow has 3 phases; initiation, scanning, and visualization (with visualization being part of a different script). The initiation phase is simple. It starts by pulling in the configuration details from the config file and then uses that information to create a Tor session. It also checks to see whether the database exists. If it does not, it will create the database along with each of the tables inside of it.  

The scanning phase starts after the database has been created and/or loaded by the script. It simply goes through each of the onion URLs and pulls down the related web page. Using the BeautifulSoup and re libraries, it scans the web page for other onion URLs, any bitcoin addresses, as well as whatever custom patterns are included in the config file. For each onion URL discovered in a web page, a consecutive scan of that URL is kicked off with a depth of 1 less than the page it was discovered on. This allows whoever is running the script to control the depth of a scan and limit the scope of their scan as little or as much as they want. 

Once each seed URL has been scanned to the configured depth, the scanner script should finish adding the findings to the database and finish running. From there you can use either the visualizer script or the DB browser to view your results. If you use the visualizer, you will see a network diagram displaying the results and where each onion service was initially referenced. 

The Config File

This is an example of the config file that is used in the initiation phase of the scanner script. You can see a few different data points in the JSON file, such as; the port that was defined in the torrrc file, the scan depth that is used to control the scope of the scan, and the patterns used to identify relevant findings on scanned web pages. These are simple regex patterns, so if you have used regex before it should be no different. The initial AI generated solution built this config file, but I had to reset some of the variables in the scanner script so that it would use the config file. 

One other variable that is included as part of the scanner is the noise filter. This variable is used to filter out any results that don’t have enough relevant findings. Setting it to 1 ensures that every service shown by the visualizer has at least one relevant finding. You can, of course, set this to whatever level you wish. 

Creating a Session

The code used to create a session is rather simple. If you’ve ever used requests, you may have created a session before. The only difference between creating a normal requests session and creating one through Tor is 3 lines of code. You can see the lines where the value for ses.proxies is set, setting it to the proxy port that was defined in the torrc file. By doing this, you can direct all traffic related to that session through Tor and access the dark web. 

Running the Scan

The code used to scan a web page is a bit more complex and includes some SQL. First off it grabs the web page using requests and parses the response using Beautiful Soup. Once it does that, the script then searches through the contents of the web page for other onion URLs as well as whatever custom patterns you have provided. Based on the results of these searches it adds any new onion services to the database and saves any new outbound or inbound connections. It also kicks off additional scans for those discovered services if there is still a positive scan depth. 

After adding the basic findings and other onion services to the database, it performs one more regex search for bitcoin addresses specifically. If it finds any, the script adds those findings into a separate table and references the URL the address is associated with. 

A Case Study

To provide an example of what a scanner like this can do, I ran the scan against a very small list of seed URLs and used only 2 custom search patterns. Specifically, I wanted to find any fraudulent activity related to Western Union money transfers. These types of transfers are regularly found for sale at a deep discount on the dark web. While many of these offers are likely scams, looking for a naive user to freely give up their money, some of these are legitimate. I wanted to see what we could do with this simple scanner and what intelligence we could pull together. 

Sewing the Seed

The first step in setting up the scanner is figuring out what data you want to feed it. For this type of thing, a good saying to keep top-of-mind is ”Garbage in, Garbage out”. If you give the scanner bad seed data, you will likely not get good results from future scans. Some examples of good seed data are seed URLs that are relevant to your organization and custom search patterns that are hyper focused on what you are looking for. 

A good way to get relevant URLs is using dark web search engines. The screenshot provided in the slide above shows a search engine, with results that match up with the search term “Western Union”. This is great and provides a strong foundation to build our scanner on top of. Another good place to start is a well-maintained dark web directory site, such as Deep Link Guide or The Hidden Wiki. 

One thing that’s worth pointing out is that you should be careful with what links you click on these seed URLs. As you can see near the top of the screenshot, there are several links to services that sponsor the search engine I used. Many of these sponsors are offering harmful and illegal services related to things like human trafficking (you can see why I censored the links). 

In a typical situation where your organization was outsourcing this process to a CTI vendor, you would not see this or be exposed to the more malicious side of the dark web. When building your own solution, you are inevitably going to be exposed to this type of content and should do your best to prepare for it. Seeing stuff like this can have negative consequences when it comes to your mental health, so be sure that you have someone you can talk to if you’re struggling. 

Waiting...

Now that the config file is created, the next step is to kick off the scan. Depending on the level of depth defined in the config file, the length of this scan can vary. I kept it pretty simple with a depth of 5, so this scan took a little over 5 minutes. A scan with more depth could take much longer. 

Eliminating the Noise

When using the ”visualizer.py” script, you will see a query that pulls the different services from the database. If you neglect to apply the noise filter when pulling out findings, you will often end up with a much dirtier graph (such as the one displayed on the left). By applying the noise filter, you can get rid of all results that have nothing to do with your search and avoid seeing content from sponsors or advertisers. The chart on the right is much easier to decipher and generally more informative. 

In the chart on the right, you will see a few things. You can see several nodes that are very light colored, almost white. Those services have at least one keyword search, but not enough to be of substantial interest. Likely, the ones you’d care about will be darker. You can see the black node in the middle of the chart. That is the search engine that was used as part of the seed, so it makes sense that there are over 50 findings (you’re looking at search results for the exact key word used in the scan). The ones we’d be most interested in are those that are dark green or even blue. Those few services are the ones covered in the next few slides. 

Wolf Hacker Group

The first finding that I looked into was an onion service managed by a group that calls themselves Wolf Hacker Group. They sell services surrounding things like ATM hacking, fraudulent money transfers, or social media hacking. In other words, they’re what you’d expect from a dark web hacker service. 

I found no evidence to disprove or validate the authenticity of their services, but there is one key thing about this service that is interesting. They claim to have compromised a group of key systems that handle Western Union money transfers. If what they say is true and they have compromised Western Union systems, this becomes very valuable information. If their claims are false, it is still good to know that they exist in case they become more dangerous to the organization in the future. 

Western Union Hack

Another service that showed up in the scan results is like Wolf Hacker Group, but more specific to Western Union. They call themselves Western Union Hack. Once again, I was unable to validate their claims (I don’t have internal access to Western Union systems...) but there is something unique about this service. 

If you look at the site, you can see that it looks incredibly professional, so much so that you might wonder if it’s even a dark web site. The reason it looks so professional is partially due to the fact that they are using Western Union’s specific branding to design their web pages. While fraud is a bigger risk to the company in my opinion, this may be of interest to the marketing and/or brand security teams. Something like this could tarnish the company brand if it gained traction, so it's best if the organization does what it can to get on top of it. 

Western Union Shop

Westen Union Shop is one of the more interesting vendors that I came across. Compared to the other two vendors shown in this article, the styling is pretty minimal. It uses a solid yellow background and relies on a few containers to separate the different products it offers. It offers similar products compared to the other vendors, mostly revolving around credit cards, PayPal accounts, and Western Union transfers. 

To receive payments, it calls out a specific escrow service that is commonly used across many onion services. That is one signal that indicates it is more reputable, but it doesn’t give us much to go off. However, they also call out a specific bitcoin wallet that was picked up by the scanner. This BTC wallet is used for direct payments, which should be easier to track and is directly associated with the adversary. 

Using a website called BitRef.com, I pulled up the transaction history associated with the discovered BTC wallet. We can see that there has been a good amount of traffic, with one transaction dating back less than 2 months and many others occurring in the previous year. This information could prove to be valuable intelligence. If I had access to internal Western Union systems, I would look into any potentially compromised money transfers and see if I can correlate that data with the transaction data available on BitRef. If we start to see an overlap, we can use that information to identify additional fraudulent transfers. 

What next?

Now that you’ve found a potential intelligence source, what next? Well, there’s a lot you can do. First off, you should put the scanner on a schedule so that you can keep your findings up to date. This can be done with a cron job or any other job scheduling tool. You can also monitor the BTC network for activity related to any wallets of interest that came up in your scan results. 

It may also be a good idea to look at expanding the scope of your scans. This can be done by either adding additional seed URLs (there are several dark web scan engines and directories out there) or adding additional search patterns. By doing this you will increase the number of services you scan and improve the quality of your scan results. 

Another way to expand the scope of your scans is to expand the functionality of the scanner. This can go a million different ways, but two specific improvements come to mind. In its current state, the scanner is only discovering links on websites and taking them as is. You could look into automating the identification of other web directories in the onion service, using dictionaries and other brute force methods. 

Another potential expansion for the scanner would be to find a way to handle authentication. Many of the best intelligence sources will be hidden behind authentication controls, but those controls are typically pretty weak. You can build out some basic handling for password fields and, if you’re getting really deep, attempt to find ways around any bot protection you run into. 

These are just a few ideas, but the possibilities are endless. If you can think of a cool way to extract intel from the dark web, you can probably build it. And if I had to guess, you’ll get it working faster than any vendor would ever build it for you. Just try it out and if you do something awesome, tell me about it! 

Summary

So to sum it all up, the dark web is dangerous, but it is also a goldmine. You will likely see the darker side of humanity and come across things you wish you hadn’t, but if you look hard enough you may be able to stay one step ahead of your adversaries. As long as you don’t mind getting your hands dirty, you don’t need to spend ungodly amounts of money to get this intel. It’s out there, you just have to build the tools to go out and get it. It’s not that hard and generative AI will continue to make it much easier, so give it a look. 

And remember. Be safe, not stupid. Don’t do anything I wouldn’t do and if you do... well I told you not to. Hopefully that’s enough for the lawyers. 

Conclusion 

This is just an example of what you can do, and a very simple one at that. If you put in a bit of effort and approach this with a specific set of requirements, you can easily come up with a much better solution. For those of you that think starting from scratch is too intimidating, you can always look into open-source tools and customize them to better fit your needs. 

If you are interested in a more complete open-source solution like this, keep an eye on my projects page or my GitHub profile. I will be taking what I’ve learned from this project and putting together a more complete solution. Once the first version of that solution is ready, I will update this article and include a link. Until then, get to building and let me know what cool things you find!