I don’t write much code these days and felt it was time to sharpen the saw.
I have a need to download a ton of images from a site (I got permission first…) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.
As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.
You can download the file here: http://www.mcdonaldland.info/files/crawler/crawl.rb
Here is the source:
require 'net/http' require 'uri' class Crawler # This is the domain or domain and path we are going # to crawl. This will be the starting point for our # efforts but will also be used in conjunction with # the allow_leave_site flag to determine whether the # page can be crawled or not. attr_accessor :domain # This flag determines whether the crawler will be # allowed to leave the root domain or not. attr_accessor :allow_leave_site # This is the path where all images will be saved. attr_accessor :save_path # This is a list of extensions to skip over while # crawling through links on the site. attr_accessor : omit_extensions # Remove space between : and o - WordPress tries to make this a smiley if I leave them together. # This keeps track of all the pages we have visited # so we don't visit them more than once. attr_accessor :visited_pages # This keeps track of all the images we have downloaded # so we don't download them more than once. attr_accessor :downloaded_images def begin_crawl # Check to see if the save path ends with a slash. If so, remove it. remove_save_path_end_slash if domain.nil? || domain.length < 4 || domain[0, 4] != "http" @domain = "http://#{domain}" end crawl(domain) end private def remove_save_path_end_slash sp = save_path[save_path.length - 1, 1] if sp == "/" || sp == "\\" save_path.chop! end end def initialize @domain = "" @allow_leave_site = false @save_path = "" @omit_extensions = [] @visited_pages = [] @downloaded_images = [] end def crawl(url = nil) # If the URL is empty or nil we can move on. return if url.nil? || url.empty? # If the allow_leave_site flag is set to false we # want to make sure that the URL we are about to # crawl is within the domain. return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain) # Check to see if we have crawled this page already. # If so, move on. return if visited_pages.include? url puts "Fetching page: #{url}" # Go get the page and note it so we don't visit it again. res = fetch_page(url) visited_pages << url # If the response is nil then we cannot continue. Move on. return if res.nil? # Some links will be relative so we need to grab the # document root. root = parse_page_root(url) # Parse the image and anchor tags out of the result. images, links = parse_page(res.body) # Process the images and links accordingly. handle_images(root, images) handle_links(root, links) end def parse_page_root(url) end_slash = url.rindex("/") if end_slash > 8 url[0, url.rindex("/")] + "/" else url + "/" end end def discern_absolute_url(root, url) # If we don't have an absolute path already, let's make one. if !root.nil? && url[0,4] != "http" # If the URL begins with a slash then it is domain # relative so we want to append it to the domain. # Otherwise it is document relative so we want to # append it to the current directory. if url[0, 1] == "/" url = domain + url else url = root + url end end while !url.index("//").nil? url.gsub!("//", "/") end # Our little exercise will have replaced the two slashes # after http: so we want to add them back. url.gsub!("http:/", "http://") url end def handle_images(root, images) if !images.nil? images.each {|i| # Make sure all single quotes are replaced with double quotes. # Since we aren't rendering javascript we don't really care # if this breaks something. i.gsub!("'", "\"") # Grab everything between src=" and ". src = i.scan(/src=[\"\']([^\"\']+)/i) if !src.nil? src = src[0] if !src.nil? src = src[0] end end # If the src is empty move on. next if src.nil? || src.empty? # We want all URLs we follow to be absolute. src = discern_absolute_url(root, src) save_image(src) } end end def save_image(url) # Check to see if we have saved this image already. # If so, move on. return if downloaded_images.include? url # Save this file name down so that we don't download # it again in the future. downloaded_images << url # Parse the image name out of the url. We'll use that # name to save it down. file_name = parse_file_name(url) while File.exist?(save_path + "/" + file_name) file_name = "_" + file_name end # Get the response and data from the web for this image. response = fetch_page(url) # If the response is not nil, save the contents down to # an image. if !response.nil? puts "Saving image: #{url}" File.open(save_path + "/" + file_name, "wb+") do |f| f << response.body end end end def parse_file_name(url) # Find the position of the last slash. Everything after # it is our file name. spos = url.rindex("/") url[spos + 1, url.length - 1] end def handle_links(root, links) if !links.nil? links.each {|l| # Make sure all single quotes are replaced with double quotes. # Since we aren't rendering javascript we don't really care # if this breaks something. l.gsub!("'", "\"") # Grab everything between href=" and ". href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/i) if !href.nil? href = href[0] if !href.nil? href = href[1] end end # We don't want to follow mailto or empty links next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto") # We want all URLs we follow to be absolute. href = discern_absolute_url(root, href) # Down the rabbit hole we go... crawl(href) } end end def parse_page(html) images = html.scan(/<img [^>]*>/i) links = html.scan(/<a [^>]*>/i) return [ images, links ] end def fetch_page(url, limit = 10) # Make sure we are supposed to fetch this type of resource. return if should_omit_extension(url) # You should choose better exception. raise ArgumentError, 'HTTP redirect too deep' if limit == 0 begin response = Net::HTTP.get_response(URI.parse(url)) rescue # The URL was not valid - just log it can keep moving puts "INVALID URL: #{url}" end case response when Net::HTTPSuccess then response when Net::HTTPRedirection then fetch_page(response['location'], limit - 1) else # We don't want to throw errors if we get a response # we are not expecting so we will just keep going. nil end end def should_omit_extension(url) # Get the index of the last slash. spos = url.rindex("/") # Get the index of the last dot. dpos = url.rindex(".") # If there is no dot in the string this will be nil, so we # need to set this to 0 so that the next line will realize # that there is no extension and can continue. if dpos.nil? dpos = 0 end # If the last dot is before the last slash, we don't have # an extension and can return. return false if spos > dpos # Grab the extension. ext = url[dpos + 1, url.length - 1] # The return value is whether or not the extension we # have for this URL is in the omit list or not. omit_extensions.include? ext end end # TODO: Update each comparison to be a hash comparison (possibly in a hash?) in order # to speed up comparisons. Research to see if this will even make a difference in Ruby. crawler = Crawler.new crawler.save_path = "C:\SavePath" crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt", "pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ] crawler.domain = "http://www.yoursite.com/" crawler.allow_leave_site = false crawler.begin_crawl # Bugs fixed: # 1. Added error handling around call to HTTP.get_response in order to handle timeouts and other errors # # 2. Added check upon initialization to remove the trailing slash on the save path, if it exists.
Updated today with the latest bug fixes.