Textual Analysis on S&P500 Companies
This set of code demonstrates how to conduct textual analysis using S&P500 companies' business description. It covers some basic textual analysis steps, including:
Clean and Prepare Corpus
Form Bag of Words
Similarity Based on BOW
Similarity Based on Doc2Vec
Topic Classification using LDA
part 0: import packages
This code uses many NLP related packages, such as NLTK for stop words, Gensim and SpaCy for various preparation, vectorization tasks, and scikit-learn for vectorization and similarity calculation. Of course, there are many other ways of handling this task, and this code just serves as a toy model to demonstrate the concept.
part 1: build S&P500 index constituents
Let's first establish connection with WRDS server and extract the constituents information of S&P500 Index.
Prior to 2020, Compustat used to provide index constituents data through WRDS, however, this data component was discontinued by Compustat as of October 2020, and researchers can now access S&P500 Index constituents through CRSP. However, many of the other index constituents information are no longer available due to coverage difference between CRSP and Compustat. If you as a researcher find this change inconvenient for your research projects, please reach out to S&P support support.datafeed.mi@spglobal.com directly.
Next, let's add various other company identifiers (e.g. ticker) by linking the sp500 dataframe with crsp.msenames.
As the corpus of this code comes from CIQ's company business description, we need to add company identifier used in the CIQ eco system, GVKEY.
Optional step of linking with CIK in case you want to connect with SEC filings for your analysis:
part 2: read in business description
S&P offers various databases under its umbrella, including most commonly used Compustat (COMP) and Capital IQ (CIQ). This block of code reads in the long format of business description from CIQ and short version from COMP. We will be using the long version as corpus for this project.
Occasionally, a GVKEY can be mapped with multiple CompanyIDs (identifiers used in CIQ system). Here is one example:
To avoid double counting, for a given GVKEY, if more than one companyid is mapped, I pick the larger companyid as it is probably a more recent link. This will yield a one-to-one mapping between GVKEY and CompanyID.
Let's inspect the short and long version of business description of the same company, using AAI Corporation as an example.
part 3: clean and prepare corpus
This is a standard cleaning process that can be applied to other text bodies, user can modify the procedure to suite specific textual cleaning need. It does the following common cleaning tasks:
convert text to lower case
keep only alphabetic characters (remove numbers and special characters)
remove stop words
form bigrams
lemmatize words
Let's use companies in the S&P500 Index as of 2020 as our universe:
Start Cleaning. First convert the busdesc (short) and businessdescription (long) column into lists.
Below are several functions for textual cleaning tasks:
text_clean() - convert all texts to lower case and remove non-letter components
sent_to_words() - convert sentences to list
remove_stopwords() - take out the stop words defined by NLTK
make_bigrams() - forms bigram if needed
lemmatization_str() - lemmatize the words and convert the output to list of strings
Now let's apply these functions to the input list:
The cleaning process above should convert the input text to all lower case, remove non letters components, form bigrams and lemmatize the words. Below is a comparison of input texts versus the cleaned text, using Ralph Lauren's business description as an example:
Raw input text:
Cleaned output text:
part 4: form bag of words
This block of code uses the vectorization tools offered in scikit-learn, and forms bag-of-words (BOW) for each corpus.
The BOW dataframe contains the following information:
word: focal word
freq: number of times a word appears in that document
company identifiers
Below is a subset of BOW for the company Apple. For instance, the word "apple" appears 11 times, and the word "market" appears once in Apple's business description obtained from CIQ.
And out of curiosity, I summarize below the top 10 most frequently appeared words in all S&P500 companies' business description. Researchers might want to customize the generic stop words defined by NLTK to include certain content specific stop words, such as "company" or "inc" in this case.
part 5: similarity based on Bow
This section relies on the vectors created based on BOW and calculates the pairwise similarity score among companies in this universe. Before diving into the full sample, let's inspect this approach using individual companies: Disney, Microsoft, Wells Fargo and Citi.
And the pairwise similarity score among these companies are reported below. As anticipated, the two financial institutions, Wells Fargo and Citi, display relatively high similarity score based on BOW formed from their respective business description.
Now let's apply the similarity score calculation on the entire sample:
Tidy up the output and sort the results in descending order to find the most similar company for each focal company.
Then for each of the S&P 500 companies, find its most similar peer based on BOW similarity score:
I report below the most similar company for some of the familiar tickers. Based on BOW pairwise similarity score, some company has clear peers (e.g. GM vs F and or C vs JPM), others report most similar peer somehow different from their dominant industry (e.g. FB vs RMD or AMZN vs ATVI). The latter case often happens among companies that expand in multiple fields of operation.
part 6: similarity based on doc2vec
This is an alternative approach to calculate similarity score, and it relies on Doc2Vec as part of the Gensim package. Please note that the sample size of this toy project is probably too small but I nevertheless include it here for demonstration purpose.
Again let's first inspect the pairwise similarity score among the four individual companies we examined before.
Let's now expand the analysis to the entire sample of all 500 companies:
And then pick out the most similar company:
Again, let's inspect the same group of companies before:
And finally, let's put the BOW and Doc2Vec results side by side for comparison. For some instances, both approaches yield the same result for the focal company (e.g. ABT vs BDX, NCLH vs CCL, UAL vs AAL, WMT vs BBY), others return different candidates but generally belong to the same industry.
part 7: topic classification using lda
to be continued...