Recursive Link Crawler
The Recursive Link Crawler was created to better utilize the Trie class that I previously learned.
Functionality
The Recursive Link Crawler traverses a given link, gathering all subsequent links recursively. It builds a Trie structure representing the website's link tree.
Components
- TrieCrawler.py contains the main functionality of the program.
- Node.py handles background calculations and branching.
- Crawler.pyCreates and calls the TrieCrawler class with a specified website and title
Understanding Links and Nodes
In this context, a link is defined as any element with an HREF
attribute, whether or not it is accessible. This means that not all links lead to valid websites. For example, ./styles.css is considered a node but not a reachable website.
Size Calculation and Scalability
The size of a Recursive Link Crawler (RLC) Trie is determined by the number of unique nodes it contains, not the number of links. Here’s an example:
-> Website 1 (Links to 2, 3, and 4) - -> Website 2 (Links to 1) - -> Website 3 (Links to 1 and 2) - -> Website 4 (Links to 5 and 6) - - -> Website 5 (Links to 4 and 2) - - -> Website 6 (Links to 1 and 5)
In this case, the Trie would have a size of 6, as there are 6 unique websites, even though there are 11 total links.
It's easy to see how quickly this process can grow in complexity with any modern website. The following table provides examples of RLC sizes for different websites:
Website | Link | RCL Size |
---|---|---|
Panoramic56 | Panoramic56/index.html | 11 |
IBM | www.ibm.com/us-en | |
Apple | www.apple.com | |
The New York Times | www.nytimes.com | |
The University of Utah | www.utah.edu | |
Snowbird | www.snowbird.com |
Dot Graphs
I decided to add a dot graph representation for the websites I was scrapping
The creation of these dot graphs is done based on the website title, which is the last section of the website's URL with it's punctuation changed to an underline (done in order to be compatible with GraphvizOnline
Since most of the websites have gigantic RCL Tries, the dot graph is almost unreadable, but for some smaller websites it is a good way of visualizing what the algorithm is doing
The following is a dot graph of my website (the grey box with the gigantic URL is the draw.oi website)

Conclusion
The time to scrape all these websites is significant, but optimizing web scraping is not the focus of this project. The goal is to understand the vast connectivity of the internet.
This was mostly a research experiment and not an actual, usable script
If you want to see the github repo where the code is stored, click the button below. The repo also has dot graphs for the example websites and their text representation
Recursive-Link-Crawler-