iclink - Link IBES & CRSP
This Python code builds a linkage between IBES data (containing information on company earnings and analysts forecasts) and CRSP data (containing price and return information). It builds the linkage in two layers:
linking through CUSIPs
linking through Tickers
As CUSIPs are more reliable company identifiers, we first try to match as much as possible through it. For the remaining ones that are not matched through CUSIP, we turn to TICKER as last resort. To impose additional layer of quality check, I add a company name matching layer on top of matching through CUSIPs and TICKERs. Name matching is done through FuzzyWuzzy package, but there can be many other fuzzy name matching methods.
First, import the Python packages needed for this exercise and establish connection with WRDS server.
We start the linking exercise through CUSIP matching.
The block of code below extracts company identifying information from IBES, including CUSIP, TICKER, company names, and date ranges.
We repeat the same exercise on the CRSP front, to extract PERMNO, CUSIP, company names and date ranges.
Next step is to link both through matching CUSIPs.
To guarantee that the matched pairs are indeed the same company, additional layer sanity check of company name matching is added below.
With the name_ratio created, one needs to make a subjective call as to what degree of matched company names is considered "good" match. In the code below, I used 10% as cut off point, and apply that in the function "score1" when assigning quality scores to the matched links based on the consideration of quality of the company name matching, consistency of date ranges from both IBES and CRSP sources.
Let's inspect the distribution of the score created by fuzzy name matching:
The second part of the linking exercises uses TICKER as linking key among the stocks that do not match through the CUSIP exercise above. We start by identifying the stocks that are not matched.
We then extract identifying information from CRSP.
Using the exchange TICKERs as keys, try merging the unmatched stocks. Similar to the approach in the CUSIP matching stage, on top of TICKER matching, I include additional layer of check by comparing linked pair's company names information, and assign a cutoff score at 10% of the distribution.
Last step is to combine the linking results from CUSIP and TICKER. As this linking table can potentially be called upon often from other research programs, it is a good idea to store it as a static table for easy future retrieving.
Below are two figures displaying the distribution of the linking score and "name_ratio" from this code, which give more intuition on how well these two databases are linked.