How to make a simple web crawler

In a previous article we gave an introduction about Web scraping with the Html Agility Pack. Now we are going to put that into use and make a simple web crawler.

Sometimes I have the need to crawl my own websites. For example when migrating a website I might need to create redirects and this becomes a useful tool.

I will explain how you can build your own web crawler, but you should only use this article as a basis and adapt it as needed.

The basic structure:

public class Crawler
{
    protected static readonly Regex REGEX_FILENAME = new Regex(@"[^a-zA-Z0-9_-]+");
 
    public Dictionary<string, string> Files { get; protected set; }
    public string Url { get; set; }
    public string BaseUrl
    {
        get { return this.Url.EndsWith("/") ? this.Url : this.Url + "/";  }
    }
}

The properties:

  • Url is the start point of the crawl
  • BaseUrl is self-explanatory
  • Files contains the original paths and the local paths of their respective crawled version
  • REGEX_FILENAME is just a regular expression we will use when determining a filename for local files

You use the crawler as following:

var url = "http://localhost/test";
var crawler = new Crawler(url);
crawler.Crawl();

The main and only public method:

public void Crawl()
{
    Files.Clear();
    CrawlUrl(new HtmlWeb(), this.Url);
}

This method calls an internal recursive method. Let’s look at that:

protected void CrawlUrl(HtmlWeb web, string url)
{
    // only crawl each url once
    if (!IsUrlCrawled(url))
    {
        if (SavePage(url))
        {
            CrawlLinks(web, url);
        }
    }
}

Not much to it yet. First we make sure the URL hasn’t been crawled before, then we try to save it and if we succeed we proceed to crawl all links from that page.

You can find the complete code for the IsUrlCrawled method on the download. In short, it checks if the URL has already been added to the Files collection. You will run into problems when differentiating between home paths, which usually have several alias. I just assume /, index.php, index.html and index.aspx are the possible options and interchangeable. Obviously this doesn’t fit all use cases.

Let’s look at how we parse all links from a page and crawl those links.

protected void CrawlLinks(HtmlWeb web, string url)
{
    // get all links
    var doc = web.Load(url);
    var links = doc.DocumentNode.SelectNodes("//a[@href]");
 
    if (links != null)
    {
        foreach (var link in links)
        {
            var href = link.GetAttributeValue("href", string.Empty);
   
            // avoid anchors
            if (!string.IsNullOrWhiteSpace(href)
                && !href.StartsWith("#"))
            {
                // build absolute uri
                var absoluteUri = GetAbsoluteUrl(href);
    
                // only crawl links from the base URL
                if (absoluteUri.Contains(this.Url))
                {
                    CrawlUrl(web, absoluteUri);
                }
            }
        }
    }
}

We are using the Html Agility Pack to parse the document. We get all the links, but we make sure to avoid anchors and external URLs. Then we crawl each URL and the process starts over.

Now let’s see how pages are saved.

protected bool SavePage(string url)
{
    using (var client = new WebClient())
    {
        try
        {
            var html = client.DownloadString(url);
            var contentType = client.ResponseHeaders["content-type"];
   
            // only crawl html files
            if (!string.IsNullOrWhiteSpace(contentType)
                && contentType.Equals("text/html"))
            {
                var filename = BuildFilename(url);
                CreateFile(filename, html);
                Files.Add(url, filename);
            }
            else
            {
                return false;
            }
        }
        catch (WebException)
        {
            return false;
        }
    }
 
    return true;
}

We load the entire content of the URL into a string, but we only save it after checking its content type. We exclude everything that isn’t an HTML document, that means no images, scripts, etc.

The CreateFile method handles saving the content to disk. It creates the directory structure if needed.

Let’s now look at how the local filename is determined. This is surprisingly the largest method.

protected string BuildFilename(string url)
{
    var uri = new Uri(url);
    var filename = Path.GetFileNameWithoutExtension(uri.LocalPath);
 
    // process query string
    var query = HttpUtility.ParseQueryString(uri.Query);
 
    foreach (string param in query)
    {
        var value = query[param];
  
        if (!string.IsNullOrWhiteSpace(value))
        {
            filename += string.Format("_{0}", value.ToLower());
        }
    }
 
    if (string.IsNullOrWhiteSpace(filename))
    {
        // if we don't have a filename, assume index
        filename = "index";
    }
 
    // clean the filename
    filename = REGEX_FILENAME.Replace(filename, string.Empty);
 
    // make sure filename is unique
    var file = string.Format("{0}.html", filename);
    var i = 2;
 
    while (Files.ContainsValue(file))
    {
        file = string.Format("{0}_{1}.html", filename, i);
        i++;
    }
 
    // add path
    var path = GetPath(url);
 
    if (!string.IsNullOrEmpty(path))
    {
        file = string.Format("{0}/{1}", path, file);
    }
 
    return file;
}

I think most of it is self-explanatory but I will make the rundown:

  • Starts with the bare filename without extension
  • Adds query string parameters
  • If the filename is empty at this point, renames it to “index”
  • Removes unwanted characters (only keeps letters, numbers, underscores and hyphens)
  • Ensures uniqueness (adds an index if necessary)
  • Finally it adds the directory structure if there is any

This explains the basis of the crawler. You can use this as a starting point for your own. Good luck!

Related articles