This is one of the first tools that I’ve learned to use and instantly felt how powerful it is. Nokogiri is a web scraping tool used to parse and search HTML, XML and other documents. It’s simple to use and actually kinda fun. For my first project with it, I built a web scraper that grabs the information for the top books of every century from Goodreads.
An Outline of this Project
In planning out this project, I envisioned three stages of interaction.
The user chooses the decade for which they want to see the top books.
The program returns a list of the top books for that specific decade.
The user then can get more information about a specific book, like descriptions and ratings.
This means building a few methods, one that scrapes information on the decades, one that gathers information on each book and a CLI (Command Line Interface) to interact with these data. (Yes, the word data is plural).
Scraping the Decade Information from Goodreads
We can open, read and parse HTML Pages by requiring both Nokogiri and Open-URI at the top of our document. Then I set the opened Nokogiri document equal to a variable called decades. All of this is contained within the Decade class and a class method called scraper. So Decade.scraper does exactly what it sounds: scrapes information on the decades.
You’ll notice the decades.css line above. .CSS is a nokogiri method that allows us to narrow in on any part of the HTML document based on CSS classes and IDs. Here, the links for each decade are listed, so I iterate through each one.
This gives us some nice and tidy data:
Gathering Book Information for each Decade
So each decade has a top ten books. As the data get scraped above, it doesn’t include the information on the top ten books of that decade: @Top10 = nil. I need to write code that opens each URL for the decade, pulls the data for each book and returns an array of those books, stored in the @Top10 attribute of the decade.
To do this, I built a Book class which has a scrape class method. The code below is commented to explain it’s purpose.
Basically, the information for each book is stored in a table row, so for each table row, I grab the needed information.
Now, to integrate this data with my Decade objects, I created the Decade.books class method.
This method iterates through @@all the decades, grabs the URL of that decade and passes it to the Book.scrape method.
When the book scraper returns the array of the top ten books, it sets that array equal to the @Top10 attribute of the Decade object.
So the data for each decade now look like this:
Building a Command Line Interface
To be able to interact with this information, I’ve built a command line interface. The code is available on my github and this video below shows a walkthrough of the program.
Time to Refactor
So this works to achieve a function, but can do so much more elegantly. I plan on refactoring to:
Work more quickly by storing the information instead of scraping it each time.
To be more graceful in it’s construction.
Stay tuned for the refactored update and the publication of the final gem.