JavaScript Event to an Object Instance

I searched the web for quite a while and couldn’t find a good answer to this problem, so I figured I’d post it to save the next person a little time.

I have a situation where I want to send an event to an instance of a JavaScript object. I could get it to go to the object itself, but was having trouble getting it to apply to only a specific instance of the object.

The answer was surprisingly simple – just wrap the pointer to your event handler function in an anonymous function. Doing this gives the anonymous function (closure) access to the variables currently in scope, which means that I have access to my object instance now.

The code is below. The commented out event listener add lines were the original attempt that would result in “undefined” being printed to the screen instead of my instance variable value. The uncommented version correctly calls my instance, which prints “test” to the debug area of the screen each time it is clicked.

// The object to handle the event...
function TestListenerObj(txt) {
    this.eventText = txt;
    this.handleEvent = function() {
        var debug = document.getElementById("debug");
        debug.innerHTML = debug.innerHTML + this.eventText + "<br/>";
        return false;
    }
}

// Sets up the event handler
function init() {
    var link = document.getElementById("test_link");
    var obj = new TestListenerObj("test");

    if (link.addEventListener) {
        // link.addEventListener("click", obj.handleEvent, false); // Calls "class" level handleEvent()
        link.addEventListener("click", function() { obj.handleEvent(); }, false); // Calls instance level handleEvent()
    } else {
        // link.attachEvent("onclick", obj.handleEvent ); // Calls "class" level handleEvent()
        link.attachEvent("onclick", function() { obj.handleEvent(); }); // Calls instance level handleEvent()
    }
}

For a working example go here. For complete source visit this link and view source. Just click the “Test” link on the example to see the event be handled by the object, which in this case just prints to the debug area.


Ruby based image crawler

I don’t write much code these days and felt it was time to sharpen the saw.

I have a need to download a ton of images from a site (I got permission first…) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.

As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.

You can download the file here: http://www.mcdonaldland.info/files/crawler/crawl.rb

Here is the source:

require 'net/http'
require 'uri'

class Crawler

  # This is the domain or domain and path we are going
  # to crawl. This will be the starting point for our
  # efforts but will also be used in conjunction with
  # the allow_leave_site flag to determine whether the
  # page can be crawled or not.
  attr_accessor :domain

  # This flag determines whether the crawler will be
  # allowed to leave the root domain or not.
  attr_accessor :allow_leave_site

  # This is the path where all images will be saved.
  attr_accessor :save_path

  # This is a list of extensions to skip over while
  # crawling through links on the site.
  attr_accessor : omit_extensions # Remove space between : and o - WordPress tries to make this a smiley if I leave them together.

  # This keeps track of all the pages we have visited
  # so we don't visit them more than once.
  attr_accessor :visited_pages

  # This keeps track of all the images we have downloaded
  # so we don't download them more than once.
  attr_accessor :downloaded_images

  def begin_crawl
    # Check to see if the save path ends with a slash. If so, remove it.
    remove_save_path_end_slash

    if domain.nil? || domain.length < 4 || domain[0, 4] != "http"
      @domain = "http://#{domain}"
    end

    crawl(domain)
  end

  private

  def remove_save_path_end_slash        
    sp = save_path[save_path.length - 1, 1]

    if sp == "/" || sp == "\\"
      save_path.chop!
    end
  end

  def initialize
    @domain = ""
    @allow_leave_site = false
    @save_path = ""
    @omit_extensions = []
    @visited_pages = []
    @downloaded_images = []
  end

  def crawl(url = nil)

    # If the URL is empty or nil we can move on.
    return if url.nil? || url.empty?

    # If the allow_leave_site flag is set to false we
    # want to make sure that the URL we are about to
    # crawl is within the domain.
    return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain)

    # Check to see if we have crawled this page already.
    # If so, move on.
    return if visited_pages.include? url

    puts "Fetching page: #{url}"

    # Go get the page and note it so we don't visit it again.
    res = fetch_page(url)
    visited_pages << url

    # If the response is nil then we cannot continue. Move on.
    return if res.nil?

    # Some links will be relative so we need to grab the
    # document root.
    root = parse_page_root(url)

    # Parse the image and anchor tags out of the result.
    images, links = parse_page(res.body)

    # Process the images and links accordingly.
    handle_images(root, images)
    handle_links(root, links)
  end

  def parse_page_root(url)
    end_slash = url.rindex("/")
    if end_slash > 8
      url[0, url.rindex("/")] + "/"
    else
      url + "/"
    end
  end

  def discern_absolute_url(root, url)
    # If we don't have an absolute path already, let's make one.            
    if !root.nil? && url[0,4] != "http"

      # If the URL begins with a slash then it is domain
      # relative so we want to append it to the domain.
      # Otherwise it is document relative so we want to
      # append it to the current directory.
      if url[0, 1] == "/"
        url = domain + url
      else
        url = root + url
      end
    end    

    while !url.index("//").nil?
      url.gsub!("//", "/")
    end

    # Our little exercise will have replaced the two slashes
    # after http: so we want to add them back.
    url.gsub!("http:/", "http://")

    url
  end

  def handle_images(root, images)
    if !images.nil?
      images.each {|i|

        # Make sure all single quotes are replaced with double quotes.
        # Since we aren't rendering javascript we don't really care
        # if this breaks something.
        i.gsub!("'", "\"")        

        # Grab everything between src=" and ".
        src = i.scan(/src=[\"\']([^\"\']+)/i)
        if !src.nil?
          src = src[0]
          if !src.nil?
            src = src[0]
          end
        end

        # If the src is empty move on.
        next if src.nil? || src.empty?

        # We want all URLs we follow to be absolute.
        src = discern_absolute_url(root, src)

        save_image(src)
      }
   end
  end

  def save_image(url)
    # Check to see if we have saved this image already.
    # If so, move on.
    return if downloaded_images.include? url        

    # Save this file name down so that we don't download
    # it again in the future.
    downloaded_images << url

    # Parse the image name out of the url. We'll use that
    # name to save it down.
    file_name = parse_file_name(url)

    while File.exist?(save_path + "/" + file_name)
      file_name = "_" + file_name
    end

    # Get the response and data from the web for this image.
    response = fetch_page(url)

    # If the response is not nil, save the contents down to
    # an image.
    if !response.nil?
      puts "Saving image: #{url}"    

      File.open(save_path + "/" + file_name, "wb+") do |f|
        f << response.body
      end
    end
  end

  def parse_file_name(url)
    # Find the position of the last slash. Everything after
    # it is our file name.
    spos = url.rindex("/")    
    url[spos + 1, url.length - 1]
  end

  def handle_links(root, links)
    if !links.nil?
      links.each {|l|    

        # Make sure all single quotes are replaced with double quotes.
        # Since we aren't rendering javascript we don't really care
        # if this breaks something.
        l.gsub!("'", "\"")

        # Grab everything between href=" and ".
        href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/i)
        if !href.nil?
          href = href[0]
          if !href.nil?
            href = href[1]
          end
        end

        # We don't want to follow mailto or empty links
        next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto")

        # We want all URLs we follow to be absolute.
        href = discern_absolute_url(root, href)

        # Down the rabbit hole we go...
        crawl(href)
      }
    end
  end

  def parse_page(html)    
    images = html.scan(/<img [^>]*>/i)
    links = html.scan(/<a [^>]*>/i)

    return [ images, links ]
  end

  def fetch_page(url, limit = 10)
    # Make sure we are supposed to fetch this type of resource.
    return if should_omit_extension(url)

    # You should choose better exception.
    raise ArgumentError, 'HTTP redirect too deep' if limit == 0

    begin
      response = Net::HTTP.get_response(URI.parse(url))
    rescue
      # The URL was not valid - just log it can keep moving
      puts "INVALID URL: #{url}"
    end

    case response
      when Net::HTTPSuccess     then response
      when Net::HTTPRedirection then fetch_page(response['location'], limit - 1)
      else
        # We don't want to throw errors if we get a response
        # we are not expecting so we will just keep going.
        nil
    end
  end

  def should_omit_extension(url)
    # Get the index of the last slash.
    spos = url.rindex("/")

    # Get the index of the last dot.
    dpos = url.rindex(".")

    # If there is no dot in the string this will be nil, so we
    # need to set this to 0 so that the next line will realize
    # that there is no extension and can continue.
    if dpos.nil?
      dpos = 0
    end

    # If the last dot is before the last slash, we don't have
    # an extension and can return.
    return false if spos > dpos

    # Grab the extension.
    ext = url[dpos + 1, url.length - 1]

    # The return value is whether or not the extension we
    # have for this URL is in the omit list or not.
    omit_extensions.include? ext

  end

end

# TODO: Update each comparison to be a hash comparison (possibly in a hash?) in order
# to speed up comparisons. Research to see if this will even make a difference in Ruby.

crawler = Crawler.new
crawler.save_path = "C:\SavePath"
crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt",
                          "pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ]
crawler.domain = "http://www.yoursite.com/"
crawler.allow_leave_site = false
crawler.begin_crawl

# Bugs fixed:
# 1. Added error handling around call to HTTP.get_response in order to handle timeouts and other errors
#
# 2. Added check upon initialization to remove the trailing slash on the save path, if it exists.

Mistaken identity

I’ve gotten a few questions lately about a car wreck in my area involving two motorcycles and some deaths. Turns out one of the people involved was named Jason McDonald as well.

Just to clear things up, this wasn’t me. I wasn’t in Mt. Pleasant that day and have not been in a wreck.