Loughran McDonald-SA-2020 Sentiment Word List

dataset

posted on 2021-04-12, 08:42 authored by Michelle Terblanche, Vukosi MarivateVukosi Marivate

The Loughran and McDonald Sentiment Word Lists were developed using corporate 10-K reports between 1994 and 2008 [14]. These reports are relevant to companies in the United States of America and required by the U.S. Securities and Exchange Commission (SEC)14.The motivation for building the LM-SA-2020 word list was based on an experiment using the above-mentioned original lists to detect sentiment-carrying words in South African financial article headlines. A corpus of 808 financial articles (relating to Sasol) were used and only 37% of headlines had words of which the sentiment matched that of the words in the Loughran and McDonald Sentiment Word Lists correctly according to ground truth labels. A gap was therefore identified in developing a method for predicting sentiment of financial articles in a South African context. Due to the size of data set, it was possible to manually examine the head-lines to identify sentiment-carrying words to be included in the original wordlists. Furthermore, synonyms were added for the existing words in the Loughran and McDonald Sentiment Word Lists using NLTK’s WordNet16 interface. The sentiment detection/prediction accuracy improved by 29% using the new word list. This sentiment word list can be further expanded/improved in future by increasing the size of the data set and/or including data from other companies. It highlights the need for not only domain-specific sentiment prediction tools but also region-specific corporate.