Ruby based image crawler

I don’t write much code these days and felt it was time to sharpen the saw.

I have a need to download a ton of images from a site (I got permission first…) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.

As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.

You can download the file here: http://www.mcdonaldland.info/files/crawler/crawl.rb

Here is the source:

require 'net/http'
require 'uri'

class Crawler

  # This is the domain or domain and path we are going
  # to crawl. This will be the starting point for our
  # efforts but will also be used in conjunction with
  # the allow_leave_site flag to determine whether the
  # page can be crawled or not.
  attr_accessor :domain

  # This flag determines whether the crawler will be
  # allowed to leave the root domain or not.
  attr_accessor :allow_leave_site

  # This is the path where all images will be saved.
  attr_accessor :save_path

  # This is a list of extensions to skip over while
  # crawling through links on the site.
  attr_accessor : omit_extensions # Remove space between : and o - WordPress tries to make this a smiley if I leave them together.

  # This keeps track of all the pages we have visited
  # so we don't visit them more than once.
  attr_accessor :visited_pages

  # This keeps track of all the images we have downloaded
  # so we don't download them more than once.
  attr_accessor :downloaded_images

  def begin_crawl
    # Check to see if the save path ends with a slash. If so, remove it.
    remove_save_path_end_slash

    if domain.nil? || domain.length < 4 || domain[0, 4] != "http"
      @domain = "http://#{domain}"
    end

    crawl(domain)
  end

  private

  def remove_save_path_end_slash        
    sp = save_path[save_path.length - 1, 1]

    if sp == "/" || sp == "\\"
      save_path.chop!
    end
  end

  def initialize
    @domain = ""
    @allow_leave_site = false
    @save_path = ""
    @omit_extensions = []
    @visited_pages = []
    @downloaded_images = []
  end

  def crawl(url = nil)

    # If the URL is empty or nil we can move on.
    return if url.nil? || url.empty?

    # If the allow_leave_site flag is set to false we
    # want to make sure that the URL we are about to
    # crawl is within the domain.
    return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain)

    # Check to see if we have crawled this page already.
    # If so, move on.
    return if visited_pages.include? url

    puts "Fetching page: #{url}"

    # Go get the page and note it so we don't visit it again.
    res = fetch_page(url)
    visited_pages << url

    # If the response is nil then we cannot continue. Move on.
    return if res.nil?

    # Some links will be relative so we need to grab the
    # document root.
    root = parse_page_root(url)

    # Parse the image and anchor tags out of the result.
    images, links = parse_page(res.body)

    # Process the images and links accordingly.
    handle_images(root, images)
    handle_links(root, links)
  end

  def parse_page_root(url)
    end_slash = url.rindex("/")
    if end_slash > 8
      url[0, url.rindex("/")] + "/"
    else
      url + "/"
    end
  end

  def discern_absolute_url(root, url)
    # If we don't have an absolute path already, let's make one.            
    if !root.nil? && url[0,4] != "http"

      # If the URL begins with a slash then it is domain
      # relative so we want to append it to the domain.
      # Otherwise it is document relative so we want to
      # append it to the current directory.
      if url[0, 1] == "/"
        url = domain + url
      else
        url = root + url
      end
    end    

    while !url.index("//").nil?
      url.gsub!("//", "/")
    end

    # Our little exercise will have replaced the two slashes
    # after http: so we want to add them back.
    url.gsub!("http:/", "http://")

    url
  end

  def handle_images(root, images)
    if !images.nil?
      images.each {|i|

        # Make sure all single quotes are replaced with double quotes.
        # Since we aren't rendering javascript we don't really care
        # if this breaks something.
        i.gsub!("'", "\"")        

        # Grab everything between src=" and ".
        src = i.scan(/src=[\"\']([^\"\']+)/i)
        if !src.nil?
          src = src[0]
          if !src.nil?
            src = src[0]
          end
        end

        # If the src is empty move on.
        next if src.nil? || src.empty?

        # We want all URLs we follow to be absolute.
        src = discern_absolute_url(root, src)

        save_image(src)
      }
   end
  end

  def save_image(url)
    # Check to see if we have saved this image already.
    # If so, move on.
    return if downloaded_images.include? url        

    # Save this file name down so that we don't download
    # it again in the future.
    downloaded_images << url

    # Parse the image name out of the url. We'll use that
    # name to save it down.
    file_name = parse_file_name(url)

    while File.exist?(save_path + "/" + file_name)
      file_name = "_" + file_name
    end

    # Get the response and data from the web for this image.
    response = fetch_page(url)

    # If the response is not nil, save the contents down to
    # an image.
    if !response.nil?
      puts "Saving image: #{url}"    

      File.open(save_path + "/" + file_name, "wb+") do |f|
        f << response.body
      end
    end
  end

  def parse_file_name(url)
    # Find the position of the last slash. Everything after
    # it is our file name.
    spos = url.rindex("/")    
    url[spos + 1, url.length - 1]
  end

  def handle_links(root, links)
    if !links.nil?
      links.each {|l|    

        # Make sure all single quotes are replaced with double quotes.
        # Since we aren't rendering javascript we don't really care
        # if this breaks something.
        l.gsub!("'", "\"")

        # Grab everything between href=" and ".
        href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/i)
        if !href.nil?
          href = href[0]
          if !href.nil?
            href = href[1]
          end
        end

        # We don't want to follow mailto or empty links
        next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto")

        # We want all URLs we follow to be absolute.
        href = discern_absolute_url(root, href)

        # Down the rabbit hole we go...
        crawl(href)
      }
    end
  end

  def parse_page(html)    
    images = html.scan(/<img [^>]*>/i)
    links = html.scan(/<a [^>]*>/i)

    return [ images, links ]
  end

  def fetch_page(url, limit = 10)
    # Make sure we are supposed to fetch this type of resource.
    return if should_omit_extension(url)

    # You should choose better exception.
    raise ArgumentError, 'HTTP redirect too deep' if limit == 0

    begin
      response = Net::HTTP.get_response(URI.parse(url))
    rescue
      # The URL was not valid - just log it can keep moving
      puts "INVALID URL: #{url}"
    end

    case response
      when Net::HTTPSuccess     then response
      when Net::HTTPRedirection then fetch_page(response['location'], limit - 1)
      else
        # We don't want to throw errors if we get a response
        # we are not expecting so we will just keep going.
        nil
    end
  end

  def should_omit_extension(url)
    # Get the index of the last slash.
    spos = url.rindex("/")

    # Get the index of the last dot.
    dpos = url.rindex(".")

    # If there is no dot in the string this will be nil, so we
    # need to set this to 0 so that the next line will realize
    # that there is no extension and can continue.
    if dpos.nil?
      dpos = 0
    end

    # If the last dot is before the last slash, we don't have
    # an extension and can return.
    return false if spos > dpos

    # Grab the extension.
    ext = url[dpos + 1, url.length - 1]

    # The return value is whether or not the extension we
    # have for this URL is in the omit list or not.
    omit_extensions.include? ext

  end

end

# TODO: Update each comparison to be a hash comparison (possibly in a hash?) in order
# to speed up comparisons. Research to see if this will even make a difference in Ruby.

crawler = Crawler.new
crawler.save_path = "C:\SavePath"
crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt",
                          "pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ]
crawler.domain = "http://www.yoursite.com/"
crawler.allow_leave_site = false
crawler.begin_crawl

# Bugs fixed:
# 1. Added error handling around call to HTTP.get_response in order to handle timeouts and other errors
#
# 2. Added check upon initialization to remove the trailing slash on the save path, if it exists.

Mistaken identity

I’ve gotten a few questions lately about a car wreck in my area involving two motorcycles and some deaths. Turns out one of the people involved was named Jason McDonald as well.

Just to clear things up, this wasn’t me. I wasn’t in Mt. Pleasant that day and have not been in a wreck.


Pop-warner players signed to NFL – Disaster Ensues

For the first time ever, pee-wee football players, an entire team to be exact, has been signed to the NFL. In a questionable move late Saturday evening, Jacksonville Jaguars head coach Jack Del Rio opted to cut his entire team and instead sign, “The Marauders”, a pop-warner team from the Jacksonville metro area. Due to the late timing of the decision the team did not have sufficient time to print new jerseys with the correct names on them. As a result, the tiny tikes donned the jerseys of their professional counterparts, filling them out surprisingly well.

team.jpg

“The Marauders” after their Saturday Pop-Warner game.

“I just felt like it was time for a change. I have been thinking about doing this for a number of weeks now and late last night my gut told me that the time was right.”, says Del Rio.

When asked of the wisdom behind picking little league players as opposed to free agents, players from the practice squad, college players, or even high school players Del Rio stared off into the distance, slowly shook his head and said, “I don’t know. It just felt right.”

But the decision wasn’t popular across the board. Star running back, Maurice Jones-Drew, told press in an interview after the game Sunday that, “I feel that it was a horrible decision. If we were playing poorly I could understand it. But we are coming off two big wins and have good momentum. Now you have kids out there, wearing our names, and performing poorly. Not only does that reflect poorly on the team, but since they are using our names it reflects poorly on us.”

“Jones-Drew”, being filled by 5′ 6″ Jamal Jackson, age 13, posted a paltry 34 yards rushing over 12 carries. The real Jones-Drew declined to give in depth commentary on the performance of the youngster, only saying that he was proud that the “little guy” was able to get that much against a professional team.

The decision ended up costing the Jaguars the game and has led many to wonder, even further, if this will be the last year we will see Del Rio in the greater Jacksonville area.

Long time fan, Jason McDonald, sent an email to local press saying, “I love this team and I think they were headed in the right direction. I’ve always thought that Del Rio did a good job and that the rumors lately of him being released have been unfounded.”, going on to say, “but I’m not so sure about the latest decision. While a certain part of the crushing loss can be attributed to the team and their lack of performance, you can’t entirely discount the coaching staff and their responsibility.”

McDonald goes on to say that, “I will continue to be a fan, as will many others. It is just hurts to proudly wear my team’s colors only to see them have their ass handed to them by a team with no wins. Had they at least scored a point, it wouldn’t have been as bad. I will stick with them but really hope that they can turn things around for the rest of the season.”