What follows is a loosely organized collection of links and information about web-scraping, data analysis, and visualization. The information is presented as part of the instructional material for my Rare Book School course on Approaches to Digital Bibliography and Book History, co-taught with Benjamin Pauley.
Before moving on, its helpful to have some basic unix/linux command line skills. The math department at the University of Utah provide a nice, bare-bones set of documentation. You don’t have to be a unix expert, just be able to navigate around, know where you are, and be able to launch scripts and programs.
Also, when get to using the R tool, we’ll also need to have Git installed so that you can easily get updates to the code as tools are added and change. Instructions for getting git and using it with R studio can be found here.
Web-Based Tools for Visualization and Analysis
Many web-based tools are now available for performing a variety of scraping, analysis, and visualizations. For a comprehensive list, organized by tool type, see Alan Liu’s DH Toychest. The list is actively curated and contains a relatively up to date collection of tools. The following links are to a sampling of web-based resources that we will cover in our course unit.
- Voyant:A web based text analysis tool that lets you upload texts, submit a list of URLs, or load a pre-packaged corpus of text and provides various visualizations and analysis of the text such as word frequency, word trends, and the ability to navigate within a text to identified trends. It’s reasonably simple and self-explanatory, but a good tutorial can be found here.
- Bookworm: an N-Gram viewer and analysis tool that can be run against a collection of text repositories, or on collections of your own making. Also an intuitive tool but for a walkthrough visit the tutorial here.
- Palladio: A web based tool that visualized structured dat sets. One really good function is that it will export crunched data a json. This json data can then be used with D3.js to create your own visualizations.
Downloadable Tools for Visualization and Analysis
- Laurence Anthony’s Ant Tools: This guy as text analysis coding savant! His page contains a whole collection of downloadable tools for text analysis and comparison in a variety of formats.
- Topic-Modeling-Tool: Just like it sounds, Topic-Modeling-Tool is a simple java app (runs under your local JRE) that opens a text file and runs a configurable topic model on it. You model a single file or all files in a directory. Results are output to the console, a .csv file, and as html.
- Stanford Named Entity Recognizer: A tool for auto extraction of named entities such as people and places. It comes out of the box with modern english can be expanded to other data sets. A downloadable tool, but there is an online demo.
- OpenRefine: A tool cleaning up and managing datasets
Another useful tool is the DownloadThemAll Firefox plugin. It’s great for getting, say, all the images from a Flickr collection.
I’ve also created a collection of R scripts for web scraping and analysis. The can be found in the “r-text-tools” Bitbucket git repository. We’ll be downloading and running the scripts in RStudio.