Text processing and word2vec model training

Overview

This script performs various text processing tasks on a dataset of letters of the Delegates of Congres, including:

Extracting letter content from raw files.
Tokenizing and cleaning text data.
Training Word2Vec models on the processed text.

1. Libraries

The script requires several libraries for its operations:

argparse: For command-line argument parsing.
pandas: For data manipulation and reading CSV files.
tqdm: For progress bars.
numpy: For numerical operations.
nltk: For natural language processing (tokenization and stopword removal).
string: For string operations.
gensim: For topic modeling and Word2Vec model training.
os: For file and directory operations.
csv: For CSV file operations.

2. Check System Architecture

The script adjusts the CSV field size limit based on whether the system is 32-bit or 64-bit.

3. Functions

`get_content(args)`

Purpose: Extracts the content of letters from text files and saves them into CSV files by year.

Parameters: args (command-line arguments, though not used in the function).
Process:
Reads metadata from Letters.csv.
Retrieves the content from text files based on the 'TCP' field.
Saves each year's content to a separate CSV file.

`split_into_sentences(text)`

Purpose: Splits text into sentences.

Parameters: text (string).
Returns: List of sentences.

`split_into_words(text)`

Purpose: Tokenizes sentences into words.

Parameters: text (list of sentences).
Returns: List of lists of words.

`remove_stopwords_and_punctuation(text, stopwords)`

Purpose: Removes stopwords and punctuation from tokenized text.

Parameters:
text (list of lists of words).
stopwords (list of stopwords).
Returns: Cleaned list of lists of words.

`join_words(text)`

Purpose: Joins lists of words into a single string document.

Parameters: text (list of lists of words).
Returns: String document.

`tokenize(args)`

Purpose: Processes raw text files into tokenized and cleaned text, and saves them in CSV files.

Parameters: args (command-line arguments, though not used in the function).
Process:
Reads text content from CSV files.
Tokenizes, removes stopwords, and joins words.
Saves the processed text to new CSV files.

`get_min_max_year()`

Purpose: Retrieves the minimum and maximum year from tokenized data.

Returns: Tuple of (min_year, max_year).

`get_sentences_for_year(year)`

Purpose: Retrieves tokenized sentences from a CSV file for a specific year.

Parameters: year (int).
Returns: List of lists of sentences.

`get_sentences_in_range(start_y, end_y)`

Purpose: Retrieves tokenized sentences for a range of years.

Parameters:
start_y (int): Start year.
end_y (int): End year.
Returns: List of lists of sentences.

`train(args)`

Purpose: Trains Word2Vec models on text data from specified year ranges.

Parameters: args (command-line arguments, includes optional window size for model training).
Process:
Constructs and trains a Word2Vec model for each year range.
Saves the trained model to disk.

4. Main Function

Purpose: Parses command-line arguments and executes the corresponding function (get_content, tokenize, or train).

Command-line Arguments:
content: Extracts content from raw data.
tokenize: Tokenizes and cleans text data.
train: Trains Word2Vec models, with an optional window size argument.

Conclusion

This script provides a comprehensive solution for processing and analyzing historical letter data. It extracts and tokenizes text content, prepares it for modeling, and trains Word2Vec models to capture semantic relationships between words.