Scraping letters of the Founding Fathers
Overview
Objective
The goal is to scrape and save text data from the Founding Fathers website, extracting metadata and content from XML and JSON sources. The data will be cleaned and organized into a structured format suitable for further analysis. The original code is provided by Jay Timm which can be found here. It has been adapted for the curent project.
Workflow
- Load Required Libraries: Import the necessary R packages for data manipulation and scraping.
- Scrape Data: Extract metadata and text content from XML and JSON files.
- Save Data: Organize and save the cleaned data into files categorized by historical periods.
Step-by-Step Guide
1. Load Required Libraries
First, load the tidyverse
library, which includes essential tools for data manipulation and scraping. This library facilitates various data wrangling tasks.
library(tidyverse)
Set the working directory to where your XML metadata file is located and specify the file name.
2. Scrape Data
Extract Metadata
To obtain metadata for each letter, read the XML file using the xml2::read_xml()
function. This function parses the XML file, allowing you to extract key metadata fields such as title, permalink, project, authors, recipients, and dates. These fields are then compiled into a data frame.
- Variable:
xml2
- Function:
xml2::as_list()
Extract Text Content
The full-text content of each letter is retrieved through the provided API. Use the jsonlite::fromJSON()
function to parse the JSON responses from the API. A custom function, clean_text()
, is applied to process the raw text, removing unnecessary whitespace and empty lines.
- Variable:
x2$api
- Function:
jsonlite::fromJSON()
- Function:
clean_text()
Categorize Texts by Historical Period
Assign each letter to a historical period based on its date using the mutate()
function along with the case_when()
helper from the dplyr
package. Historical periods include Colonial, Revolutionary War, Confederation Period, and various presidencies.
- Function:
mutate()
- Function:
case_when()
3. Save Data
The cleaned and categorized data is saved into separate RDS files, one for each historical period. Use the saveRDS()
function to serialize the data frames into RDS files, which are suitable for storage and further analysis.
- Function:
saveRDS()
Set the working directory to the desired location for saving files. Utilize lapply()
to iterate through each period’s data and save it as individual RDS files.
Conclusion
By following these steps, you can efficiently scrape, clean, and save text data from the "Founding Fathers" website. This process ensures that the data is well-organized and ready for further historical analysis.