In a previous article we gave an introduction about Web scraping with the Html Agility Pack. Now we are going to put that into use and make a simple web crawler.
Sometimes I have the need to crawl my own websites. For example when migrating a website I might need to create redirects and this becomes a useful tool.
I will explain how you can build your own web crawler, but you should only use this article as a basis and adapt it as needed.
Warning: This crawler doesn't include any kind of delay between crawling pages and it will probably fail when presented with an unexpected structure, so you should NOT use this to crawl any random website. Customize it to your own needs or if you just want to crawl a random website then download one of the many existent crawling utilities.
The basic structure:
public class Crawler
{
protected static readonly Regex REGEX_FILENAME = new Regex(@"[^a-zA-Z0-9_-]+");
public Dictionary<string, string> Files { get; protected set; }
public string Url { get; set; }
public string BaseUrl
{
get { return this.Url.EndsWith("/") ? this.Url : this.Url + "/"; }
}
}
The properties:
Url
is the start point of the crawlBaseUrl
is self-explanatoryFiles
contains the original paths and the local paths of their respective crawled versionREGEX_FILENAME
is just a regular expression we will use when determining a filename for local files
You use the crawler as following:
var url = "http://localhost/test";
var crawler = new Crawler(url);
crawler.Crawl();
The main and only public method:
public void Crawl()
{
Files.Clear();
CrawlUrl(new HtmlWeb(), this.Url);
}
This method calls an internal recursive method. Let’s look at that:
protected void CrawlUrl(HtmlWeb web, string url)
{
// only crawl each url once
if (!IsUrlCrawled(url))
{
if (SavePage(url))
{
CrawlLinks(web, url);
}
}
}
Not much to it yet. First we make sure the URL hasn’t been crawled before, then we try to save it and if we succeed we proceed to crawl all links from that page.
You can find the complete code for the IsUrlCrawled
method on the download. In short, it checks if the URL has already been added to the Files
collection. You will run into problems when differentiating between home paths, which usually have several alias. I just assume /
, index.php
, index.html
and index.aspx
are the possible options and interchangeable. Obviously this doesn’t fit all use cases.
Let’s look at how we parse all links from a page and crawl those links.
protected void CrawlLinks(HtmlWeb web, string url)
{
// get all links
var doc = web.Load(url);
var links = doc.DocumentNode.SelectNodes("//a[@href]");
if (links != null)
{
foreach (var link in links)
{
var href = link.GetAttributeValue("href", string.Empty);
// avoid anchors
if (!string.IsNullOrWhiteSpace(href)
&& !href.StartsWith("#"))
{
// build absolute uri
var absoluteUri = GetAbsoluteUrl(href);
// only crawl links from the base URL
if (absoluteUri.Contains(this.Url))
{
CrawlUrl(web, absoluteUri);
}
}
}
}
}
We are using the Html Agility Pack to parse the document. We get all the links, but we make sure to avoid anchors and external URLs. Then we crawl each URL and the process starts over.
Now let’s see how pages are saved.
protected bool SavePage(string url)
{
using (var client = new WebClient())
{
try
{
var html = client.DownloadString(url);
var contentType = client.ResponseHeaders["content-type"];
// only crawl html files
if (!string.IsNullOrWhiteSpace(contentType)
&& contentType.Equals("text/html"))
{
var filename = BuildFilename(url);
CreateFile(filename, html);
Files.Add(url, filename);
}
else
{
return false;
}
}
catch (WebException)
{
return false;
}
}
return true;
}
We load the entire content of the URL into a string, but we only save it after checking its content type. We exclude everything that isn’t an HTML document, that means no images, scripts, etc.
The CreateFile
method handles saving the content to disk. It creates the directory structure if needed.
Let’s now look at how the local filename is determined. This is surprisingly the largest method.
protected string BuildFilename(string url)
{
var uri = new Uri(url);
var filename = Path.GetFileNameWithoutExtension(uri.LocalPath);
// process query string
var query = HttpUtility.ParseQueryString(uri.Query);
foreach (string param in query)
{
var value = query[param];
if (!string.IsNullOrWhiteSpace(value))
{
filename += string.Format("_{0}", value.ToLower());
}
}
if (string.IsNullOrWhiteSpace(filename))
{
// if we don't have a filename, assume index
filename = "index";
}
// clean the filename
filename = REGEX_FILENAME.Replace(filename, string.Empty);
// make sure filename is unique
var file = string.Format("{0}.html", filename);
var i = 2;
while (Files.ContainsValue(file))
{
file = string.Format("{0}_{1}.html", filename, i);
i++;
}
// add path
var path = GetPath(url);
if (!string.IsNullOrEmpty(path))
{
file = string.Format("{0}/{1}", path, file);
}
return file;
}
I think most of it is self-explanatory but I will make the rundown:
- Starts with the bare filename without extension
- Adds query string parameters
- If the filename is empty at this point, renames it to “index”
- Removes unwanted characters (only keeps letters, numbers, underscores and hyphens)
- Ensures uniqueness (adds an index if necessary)
- Finally it adds the directory structure if there is any
This explains the basis of the crawler. You can use this as a starting point for your own. Good luck!