Text processing and word2vec model training
Overview
This script performs various text processing tasks on a dataset of letters of the Delegates of Congres, including:
- Extracting letter content from raw files.
- Tokenizing and cleaning text data.
- Training Word2Vec models on the processed text.
1. Libraries
The script requires several libraries for its operations:
argparse
: For command-line argument parsing.pandas
: For data manipulation and reading CSV files.tqdm
: For progress bars.numpy
: For numerical operations.nltk
: For natural language processing (tokenization and stopword removal).string
: For string operations.gensim
: For topic modeling and Word2Vec model training.os
: For file and directory operations.csv
: For CSV file operations.
2. Check System Architecture
The script adjusts the CSV field size limit based on whether the system is 32-bit or 64-bit.
3. Functions
get_content(args)
Purpose: Extracts the content of letters from text files and saves them into CSV files by year.
- Parameters:
args
(command-line arguments, though not used in the function). - Process:
- Reads metadata from
Letters.csv
. - Retrieves the content from text files based on the 'TCP' field.
- Saves each year's content to a separate CSV file.
split_into_sentences(text)
Purpose: Splits text into sentences.
- Parameters:
text
(string). - Returns: List of sentences.
split_into_words(text)
Purpose: Tokenizes sentences into words.
- Parameters:
text
(list of sentences). - Returns: List of lists of words.
remove_stopwords_and_punctuation(text, stopwords)
Purpose: Removes stopwords and punctuation from tokenized text.
- Parameters:
text
(list of lists of words).stopwords
(list of stopwords).- Returns: Cleaned list of lists of words.
join_words(text)
Purpose: Joins lists of words into a single string document.
- Parameters:
text
(list of lists of words). - Returns: String document.
tokenize(args)
Purpose: Processes raw text files into tokenized and cleaned text, and saves them in CSV files.
- Parameters:
args
(command-line arguments, though not used in the function). - Process:
- Reads text content from CSV files.
- Tokenizes, removes stopwords, and joins words.
- Saves the processed text to new CSV files.
get_min_max_year()
Purpose: Retrieves the minimum and maximum year from tokenized data.
- Returns: Tuple of (min_year, max_year).
get_sentences_for_year(year)
Purpose: Retrieves tokenized sentences from a CSV file for a specific year.
- Parameters:
year
(int). - Returns: List of lists of sentences.
get_sentences_in_range(start_y, end_y)
Purpose: Retrieves tokenized sentences for a range of years.
- Parameters:
start_y
(int): Start year.end_y
(int): End year.- Returns: List of lists of sentences.
train(args)
Purpose: Trains Word2Vec models on text data from specified year ranges.
- Parameters:
args
(command-line arguments, includes optional window size for model training). - Process:
- Constructs and trains a Word2Vec model for each year range.
- Saves the trained model to disk.
4. Main Function
Purpose: Parses command-line arguments and executes the corresponding function (get_content
, tokenize
, or train
).
- Command-line Arguments:
content
: Extracts content from raw data.tokenize
: Tokenizes and cleans text data.train
: Trains Word2Vec models, with an optional window size argument.
Conclusion
This script provides a comprehensive solution for processing and analyzing historical letter data. It extracts and tokenizes text content, prepares it for modeling, and trains Word2Vec models to capture semantic relationships between words.