I don’t write much code these days and felt it was time to sharpen the saw.
I have a need to download a ton of images from a site (I got permission first…) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.
As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.
You can download the file here: http://www.mcdonaldland.info/files/crawler/crawl.rb
Here is the source:
require 'net/http'
require 'uri'
class Crawler
# This is the domain or domain and path we are going
# to crawl. This will be the starting point for our
# efforts but will also be used in conjunction with
# the allow_leave_site flag to determine whether the
# page can be crawled or not.
attr_accessor :domain
# This flag determines whether the crawler will be
# allowed to leave the root domain or not.
attr_accessor :allow_leave_site
# This is the path where all images will be saved.
attr_accessor :save_path
# This is a list of extensions to skip over while
# crawling through links on the site.
attr_accessor : omit_extensions # Remove space between : and o - WordPress tries to make this a smiley if I leave them together.
# This keeps track of all the pages we have visited
# so we don't visit them more than once.
attr_accessor :visited_pages
# This keeps track of all the images we have downloaded
# so we don't download them more than once.
attr_accessor :downloaded_images
def begin_crawl
# Check to see if the save path ends with a slash. If so, remove it.
remove_save_path_end_slash
if domain.nil? || domain.length < 4 || domain[0, 4] != "http"
@domain = "http://#{domain}"
end
crawl(domain)
end
private
def remove_save_path_end_slash
sp = save_path[save_path.length - 1, 1]
if sp == "/" || sp == "\\"
save_path.chop!
end
end
def initialize
@domain = ""
@allow_leave_site = false
@save_path = ""
@omit_extensions = []
@visited_pages = []
@downloaded_images = []
end
def crawl(url = nil)
# If the URL is empty or nil we can move on.
return if url.nil? || url.empty?
# If the allow_leave_site flag is set to false we
# want to make sure that the URL we are about to
# crawl is within the domain.
return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain)
# Check to see if we have crawled this page already.
# If so, move on.
return if visited_pages.include? url
puts "Fetching page: #{url}"
# Go get the page and note it so we don't visit it again.
res = fetch_page(url)
visited_pages << url
# If the response is nil then we cannot continue. Move on.
return if res.nil?
# Some links will be relative so we need to grab the
# document root.
root = parse_page_root(url)
# Parse the image and anchor tags out of the result.
images, links = parse_page(res.body)
# Process the images and links accordingly.
handle_images(root, images)
handle_links(root, links)
end
def parse_page_root(url)
end_slash = url.rindex("/")
if end_slash > 8
url[0, url.rindex("/")] + "/"
else
url + "/"
end
end
def discern_absolute_url(root, url)
# If we don't have an absolute path already, let's make one.
if !root.nil? && url[0,4] != "http"
# If the URL begins with a slash then it is domain
# relative so we want to append it to the domain.
# Otherwise it is document relative so we want to
# append it to the current directory.
if url[0, 1] == "/"
url = domain + url
else
url = root + url
end
end
while !url.index("//").nil?
url.gsub!("//", "/")
end
# Our little exercise will have replaced the two slashes
# after http: so we want to add them back.
url.gsub!("http:/", "http://")
url
end
def handle_images(root, images)
if !images.nil?
images.each {|i|
# Make sure all single quotes are replaced with double quotes.
# Since we aren't rendering javascript we don't really care
# if this breaks something.
i.gsub!("'", "\"")
# Grab everything between src=" and ".
src = i.scan(/src=[\"\']([^\"\']+)/i)
if !src.nil?
src = src[0]
if !src.nil?
src = src[0]
end
end
# If the src is empty move on.
next if src.nil? || src.empty?
# We want all URLs we follow to be absolute.
src = discern_absolute_url(root, src)
save_image(src)
}
end
end
def save_image(url)
# Check to see if we have saved this image already.
# If so, move on.
return if downloaded_images.include? url
# Save this file name down so that we don't download
# it again in the future.
downloaded_images << url
# Parse the image name out of the url. We'll use that
# name to save it down.
file_name = parse_file_name(url)
while File.exist?(save_path + "/" + file_name)
file_name = "_" + file_name
end
# Get the response and data from the web for this image.
response = fetch_page(url)
# If the response is not nil, save the contents down to
# an image.
if !response.nil?
puts "Saving image: #{url}"
File.open(save_path + "/" + file_name, "wb+") do |f|
f << response.body
end
end
end
def parse_file_name(url)
# Find the position of the last slash. Everything after
# it is our file name.
spos = url.rindex("/")
url[spos + 1, url.length - 1]
end
def handle_links(root, links)
if !links.nil?
links.each {|l|
# Make sure all single quotes are replaced with double quotes.
# Since we aren't rendering javascript we don't really care
# if this breaks something.
l.gsub!("'", "\"")
# Grab everything between href=" and ".
href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/i)
if !href.nil?
href = href[0]
if !href.nil?
href = href[1]
end
end
# We don't want to follow mailto or empty links
next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto")
# We want all URLs we follow to be absolute.
href = discern_absolute_url(root, href)
# Down the rabbit hole we go...
crawl(href)
}
end
end
def parse_page(html)
images = html.scan(/<img [^>]*>/i)
links = html.scan(/<a [^>]*>/i)
return [ images, links ]
end
def fetch_page(url, limit = 10)
# Make sure we are supposed to fetch this type of resource.
return if should_omit_extension(url)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
begin
response = Net::HTTP.get_response(URI.parse(url))
rescue
# The URL was not valid - just log it can keep moving
puts "INVALID URL: #{url}"
end
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch_page(response['location'], limit - 1)
else
# We don't want to throw errors if we get a response
# we are not expecting so we will just keep going.
nil
end
end
def should_omit_extension(url)
# Get the index of the last slash.
spos = url.rindex("/")
# Get the index of the last dot.
dpos = url.rindex(".")
# If there is no dot in the string this will be nil, so we
# need to set this to 0 so that the next line will realize
# that there is no extension and can continue.
if dpos.nil?
dpos = 0
end
# If the last dot is before the last slash, we don't have
# an extension and can return.
return false if spos > dpos
# Grab the extension.
ext = url[dpos + 1, url.length - 1]
# The return value is whether or not the extension we
# have for this URL is in the omit list or not.
omit_extensions.include? ext
end
end
# TODO: Update each comparison to be a hash comparison (possibly in a hash?) in order
# to speed up comparisons. Research to see if this will even make a difference in Ruby.
crawler = Crawler.new
crawler.save_path = "C:\SavePath"
crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt",
"pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ]
crawler.domain = "http://www.yoursite.com/"
crawler.allow_leave_site = false
crawler.begin_crawl
# Bugs fixed:
# 1. Added error handling around call to HTTP.get_response in order to handle timeouts and other errors
#
# 2. Added check upon initialization to remove the trailing slash on the save path, if it exists.