Quick way to check for URL redirects
Recently I had to go through a bunch of URLs and identify whether or not they redirect to another URL. Since the list was rather long (several hundred entries), I thought I should rather write a script to do this for me.
I used the Requests library to make the initial request for a given URL. If the response code was
3XX, then I assumed it is being redirected. There are a couple of other edge cases (timeouts and generic connection errors, either due to non-existent domain, or network error) that the script attempts to cover for.
The script accepts a single parameter, which should be a filename containing URL strings. If the filename is not provided, the script will look for
domains.txt in the current directory.
#!/usr/bin/env python import sys import requests def check_for_redirects(url): try: r = requests.get(url, allow_redirects=False, timeout=0.5) if 300 <= r.status_code < 400: return r.headers['location'] else: return '[no redirect]' except requests.exceptions.Timeout: return '[timeout]' except requests.exceptions.ConnectionError: return '[connection error]' def check_domains(urls): for url in urls: url_to_check = url if url.startswith('http') else "http://%s" % url redirect_url = check_for_redirects(url_to_check) print("%s => %s" % (url_to_check, redirect_url)) if __name__ == '__main__': fname = 'domains.txt' try: fname = sys.argv except IndexError: pass urls = (l.strip() for l in open(fname).readlines()) check_domains(urls)
Below is an example of URLs. If protocol is not defined,
http will be assumed.
google.com http://gmail.com www.ibm.com redhat.com example.com i-believe-this-domain-does-not-exist-123abc.com
And the results:
http://google.com => http://www.google.com/ http://gmail.com => http://mail.google.com/mail/ http://www.ibm.com => http://www.ibm.com/us/en/ http://redhat.com => http://www.redhat.com/ http://example.com => http://www.iana.org/domains/example/ http://i-believe-this-domain-does-not-exist-123abc.com => [connection error]
comments powered by Disqus