Ruby based image crawler

I don’t write much code these days and felt it was time to sharpen the saw.

I have a need to download a ton of images from a site (I got permission first…) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.

As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.

You can download the file here: http://mcdonaldland.info/files/crawler/crawl.rb

Here is the source:

require 'net/http'
require 'uri'

class Crawler

  # This is the domain or domain and path we are going
  # to crawl. This will be the starting point for our
  # efforts but will also be used in conjunction with
  # the allow_leave_site flag to determine whether the
  # page can be crawled or not.
  attr_accessor :domain

  # This flag determines whether the crawler will be
  # allowed to leave the root domain or not.
  attr_accessor :allow_leave_site

  # This is the path where all images will be saved.
  attr_accessor :save_path

  # This is a list of extensions to skip over while
  # crawling through links on the site.
  attr_accessor : omit_extensions # Remove space between : and o - WordPress tries to make this a smiley if I leave them together.

  # This keeps track of all the pages we have visited
  # so we don't visit them more than once.
  attr_accessor :visited_pages

  # This keeps track of all the images we have downloaded
  # so we don't download them more than once.
  attr_accessor :downloaded_images

  def begin_crawl
    # Check to see if the save path ends with a slash. If so, remove it.
    remove_save_path_end_slash

    if domain.nil? || domain.length < 4 || domain[0, 4] != "http"
      @domain = "http://#{domain}"
    end

    crawl(domain)
  end

  private

  def remove_save_path_end_slash        
    sp = save_path[save_path.length - 1, 1]

    if sp == "/" || sp == "\\"
      save_path.chop!
    end
  end

  def initialize
    @domain = ""
    @allow_leave_site = false
    @save_path = ""
    @omit_extensions = []
    @visited_pages = []
    @downloaded_images = []
  end

  def crawl(url = nil)

    # If the URL is empty or nil we can move on.
    return if url.nil? || url.empty?

    # If the allow_leave_site flag is set to false we
    # want to make sure that the URL we are about to
    # crawl is within the domain.
    return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain)

    # Check to see if we have crawled this page already.
    # If so, move on.
    return if visited_pages.include? url

    puts "Fetching page: #{url}"

    # Go get the page and note it so we don't visit it again.
    res = fetch_page(url)
    visited_pages << url

    # If the response is nil then we cannot continue. Move on.
    return if res.nil?

    # Some links will be relative so we need to grab the
    # document root.
    root = parse_page_root(url)

    # Parse the image and anchor tags out of the result.
    images, links = parse_page(res.body)

    # Process the images and links accordingly.
    handle_images(root, images)
    handle_links(root, links)
  end

  def parse_page_root(url)
    end_slash = url.rindex("/")
    if end_slash > 8
      url[0, url.rindex("/")] + "/"
    else
      url + "/"
    end
  end

  def discern_absolute_url(root, url)
    # If we don't have an absolute path already, let's make one.            
    if !root.nil? && url[0,4] != "http"

      # If the URL begins with a slash then it is domain
      # relative so we want to append it to the domain.
      # Otherwise it is document relative so we want to
      # append it to the current directory.
      if url[0, 1] == "/"
        url = domain + url
      else
        url = root + url
      end
    end    

    while !url.index("//").nil?
      url.gsub!("//", "/")
    end

    # Our little exercise will have replaced the two slashes
    # after http: so we want to add them back.
    url.gsub!("http:/", "http://")

    url
  end

  def handle_images(root, images)
    if !images.nil?
      images.each {|i|

        # Make sure all single quotes are replaced with double quotes.
        # Since we aren't rendering javascript we don't really care
        # if this breaks something.
        i.gsub!("'", "\"")        

        # Grab everything between src=" and ".
        src = i.scan(/src=[\"\']([^\"\']+)/i)
        if !src.nil?
          src = src[0]
          if !src.nil?
            src = src[0]
          end
        end

        # If the src is empty move on.
        next if src.nil? || src.empty?

        # We want all URLs we follow to be absolute.
        src = discern_absolute_url(root, src)

        save_image(src)
      }
   end
  end

  def save_image(url)
    # Check to see if we have saved this image already.
    # If so, move on.
    return if downloaded_images.include? url        

    # Save this file name down so that we don't download
    # it again in the future.
    downloaded_images << url

    # Parse the image name out of the url. We'll use that
    # name to save it down.
    file_name = parse_file_name(url)

    while File.exist?(save_path + "/" + file_name)
      file_name = "_" + file_name
    end

    # Get the response and data from the web for this image.
    response = fetch_page(url)

    # If the response is not nil, save the contents down to
    # an image.
    if !response.nil?
      puts "Saving image: #{url}"    

      File.open(save_path + "/" + file_name, "wb+") do |f|
        f << response.body
      end
    end
  end

  def parse_file_name(url)
    # Find the position of the last slash. Everything after
    # it is our file name.
    spos = url.rindex("/")    
    url[spos + 1, url.length - 1]
  end

  def handle_links(root, links)
    if !links.nil?
      links.each {|l|    

        # Make sure all single quotes are replaced with double quotes.
        # Since we aren't rendering javascript we don't really care
        # if this breaks something.
        l.gsub!("'", "\"")

        # Grab everything between href=" and ".
        href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/i)
        if !href.nil?
          href = href[0]
          if !href.nil?
            href = href[1]
          end
        end

        # We don't want to follow mailto or empty links
        next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto")

        # We want all URLs we follow to be absolute.
        href = discern_absolute_url(root, href)

        # Down the rabbit hole we go...
        crawl(href)
      }
    end
  end

  def parse_page(html)    
    images = html.scan(/<img [^>]*>/i)
    links = html.scan(/<a [^>]*>/i)

    return [ images, links ]
  end

  def fetch_page(url, limit = 10)
    # Make sure we are supposed to fetch this type of resource.
    return if should_omit_extension(url)

    # You should choose better exception.
    raise ArgumentError, 'HTTP redirect too deep' if limit == 0

    begin
      response = Net::HTTP.get_response(URI.parse(url))
    rescue
      # The URL was not valid - just log it can keep moving
      puts "INVALID URL: #{url}"
    end

    case response
      when Net::HTTPSuccess     then response
      when Net::HTTPRedirection then fetch_page(response['location'], limit - 1)
      else
        # We don't want to throw errors if we get a response
        # we are not expecting so we will just keep going.
        nil
    end
  end

  def should_omit_extension(url)
    # Get the index of the last slash.
    spos = url.rindex("/")

    # Get the index of the last dot.
    dpos = url.rindex(".")

    # If there is no dot in the string this will be nil, so we
    # need to set this to 0 so that the next line will realize
    # that there is no extension and can continue.
    if dpos.nil?
      dpos = 0
    end

    # If the last dot is before the last slash, we don't have
    # an extension and can return.
    return false if spos > dpos

    # Grab the extension.
    ext = url[dpos + 1, url.length - 1]

    # The return value is whether or not the extension we
    # have for this URL is in the omit list or not.
    omit_extensions.include? ext

  end

end

# TODO: Update each comparison to be a hash comparison (possibly in a hash?) in order
# to speed up comparisons. Research to see if this will even make a difference in Ruby.

crawler = Crawler.new
crawler.save_path = "C:\SavePath"
crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt",
                          "pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ]
crawler.domain = "http://www.yoursite.com/"
crawler.allow_leave_site = false
crawler.begin_crawl

# Bugs fixed:
# 1. Added error handling around call to HTTP.get_response in order to handle timeouts and other errors
#
# 2. Added check upon initialization to remove the trailing slash on the save path, if it exists.

About the author

Jason McDonald

View all posts

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.