Facebook Twitter Gplus RSS
magnify
Home Uncategorized Approaches to Digital Bibliography and Book History: Scraping, Analysis, and Visualization
formats

Approaches to Digital Bibliography and Book History: Scraping, Analysis, and Visualization

Published on June 16, 2015 by in Uncategorized

What follows is a loosely organized collection of links and information about web-scraping, data analysis, and visualization. The information is presented as part of the instructional material for my Rare Book School course on Approaches to Digital Bibliography and Book History, co-taught with Benjamin Pauley.

Before moving on, its helpful to have some basic unix/linux command line skills. The math department at the University of Utah provide a nice, bare-bones set of documentation. You don’t have to be a unix expert, just be able to navigate around, know where you are, and be able to launch scripts and programs.

Also, when get to using the R tool, we’ll also need to have Git installed so that you can easily get updates to the code as tools are added and change. Instructions for getting git and using it with R studio can be found here.

Web-Based Tools for Visualization and Analysis
Many web-based tools are now available for performing a variety of scraping, analysis, and visualizations. For a comprehensive list, organized by tool type, see Alan Liu’s DH Toychest. The list is actively curated and contains a relatively up to date collection of tools. The following links are to a sampling of web-based resources that we will cover in our course unit.

  • Voyant:A web based text analysis tool that lets you upload texts, submit a list of URLs, or load a pre-packaged corpus of text and provides various visualizations and analysis of the text such as word frequency, word trends, and the ability to navigate within a text to identified trends. It’s reasonably simple and self-explanatory, but a good tutorial can be found here.
  • Bookworm: an N-Gram viewer and analysis tool that can be run against a collection of text repositories, or on collections of your own making. Also an intuitive tool but for a walkthrough visit the tutorial here.
  • Palladio: A web based tool that visualized structured dat sets. One really good function is that it will export crunched data a json. This json data can then be used with D3.js to create your own visualizations.

Downloadable Tools for Visualization and Analysis

  • Laurence Anthony’s Ant Tools: This guy as text analysis coding savant! His page contains a whole collection of downloadable tools for text analysis and comparison in a variety of formats.
  • Topic-Modeling-Tool: Just like it sounds, Topic-Modeling-Tool is a simple java app (runs under your local JRE) that opens a text file and runs a configurable topic model on it. You model a single file or all files in a directory. Results are output to the console, a .csv file, and as html.
  • Stanford Named Entity Recognizer: A tool for auto extraction of named entities such as people and places. It comes out of the box with modern english can be expanded to other data sets. A downloadable tool, but there is an online demo.
  • OpenRefine: A tool cleaning up and managing datasets

In addition to the above tools, We’ll also go over some very basic information about visualization with D3, a javascript library for doing very sophisticated, data driven visualization. D3 is very complicated and this only be a brief introduction designed to show what is possible and how to begin and to give you strategies for how to hack a visualization from the example set without having to be a D3 or even Javascript exptert. A good tutorial can found here. D3VIS.zip contains a collection of visualizations of varying complexity to that we will use to build skills.

Another useful tool is the DownloadThemAll Firefox plugin. It’s great for getting, say, all the images from a Flickr collection.

I’ve also created a collection of R scripts for web scraping and analysis. The can be found in the “r-text-tools” Bitbucket git repository. We’ll be downloading and running the scripts in RStudio.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
© 2015 by Carl G Stahmer
All Rights Reserved