As SEO professionals, a lot of the work we do involves analysing and manipulating URL data. While URL data is great, when it comes to digesting that data or presenting findings to stakeholders or internal teams, that level of granularity the URLs offer can sometimes be a downside. A common way to get around this is to group by root domain and turn lots of smaller pieces of analysis into one bigger picture.
If you’re less familiar with python, check out a our other blog post, how to extract domains from urls in excel.
We’ll be using the Python 3 tld project to make our scripts much easier to manage. Find out more information and how to install tld here. If you’re using IDLE with macOS, check out my other post which gives a brief overview on how to install modules in IDLE.
Extracting a single domain using print
from tld import get_tld url = 'https://www.honchosearch.com/blog/seo/14-elements-every-successful-outreach-campaign-need/' #URL to strip. Change this URL to whatever you want. res = get_tld(url, as_object=True) #Get the root as an object print (res.fld) #res.fld to extract the domain
Extracting multiple root domains within a list using print
from tld import get_tld urls = [\'https://www.example.com/hello_world\', \'https://www.example.co.uk/hello_uk\'] #list of urls for url in urls: #for loop to create iterations res = get_tld(url,as_object=True) print(res.fld)
Extracting multiple root domains from a CSV using print
from tld import get_tld urls_file = "urls_file.csv" #URLs should be in column A without a heading, in a CSV file named "urls_file.csv" urls = [line.rstrip('\n') for line in open(urls_file)] for url in urls: res = get_tld(url,as_object=True) print(res.fld)
This script also works with .txt files.
Extracting multiple root domains from a CSV to a CSV
from tld import get_tld urls_file = "urls_file.csv" #URLs should be in column A without a heading, in a CSV file named "urls_file.csv" urls = [line.rstrip('\n') for line in open(urls_file)] the_file = open("domains.csv", "w") #Create a CSV file within the same file directory and name "domains.csv" the_file.write("root domain, urls \n") for url in urls: the_list =  result = get_tld(url,as_object=True).fld try: root_domain = the_list.append(result) except: the_list.append("NO ROOT") url = the_list.append(url) the_list.append("\n") the_file.write(",".join(the_list)) the_file.close
Outside of the context of DNS, the root domain usually refers to the overarching structure of a domain. So, for example, https://www.honchosearch.com which then contains all folders (/services/seo) or subdomains etc.
The tld Python library created by Artur Barseghyan allows you to easily extract the top level domain (TLD) from a given URL. What makes this library so handy is that it includes other useful functions such as being able to extract sub-domains, extract root domains and even check the validity of a tld.
Want to find out more? National Coding Week is coming up on 16th September. Keep an eye on our blog as we’ll be sharing useful tips daily.