How to Measure Lexical Diversity with R and Python, and Why You Would Want To
Lexical diversity (LD) has been shown to correlate strongly with the scores learners received on their L2 written compositions (e.g., Crossley & McNamara, 2011; Monteiro, Crossley, & Kyle, 2020; Henriksen & Danelund, 2016) as well as with their spoken proficiency (Clenton, et al. 2020; De Jong, Groenhout, Schoonen, & Hulstijn, 2013). For this reason, measures of lexical diversity are often used as a general-purpose measure of learners' spoken and written language (Malvern et al. 2004) or as tools for measuring the complexity of learner-produced texts at the lexical level (Housen et al. 2012).
However, despite the usefulness of lexical diversity, there are several issues related to the use of lexical diversity in language research, including what measures of lexical diversity to use (Jarvis, 2013), the relationship between lexical diversity and text length (Treffers-Daller et al., 2018), and the differences that may exist when measuring lexical diversity from learners of different language backgrounds or levels of linguistic proficiency (Zenkera & Kyle, 2021). Furthermore, the online tools available for measuring lexical diversity (for example, Text Inspector), while useful, can be difficult to use when trying to compare different measures of lexical diversity across multiple texts.
This workshop is aimed to help researchers address this problem by introducing participants to the theoretical foundations behind the different measures of lexical diversity and having them begin to measure lexical diversity themselves using both R and Python. Participants will first learn how to import, clean, and measure texts for lexical diversity using R and RStudio. This includes learning how to: import and prepare the texts, analyze texts using different measures of lexical diversity, and export the results of these measures for use in subsequent statistical analysis. The presentation will finish with a brief introduction as to how a similar analysis can be done using Python, and the packages that exist in this language to prepare and process texts for lexical diversity. It is hoped that at the end of the workshop, participants will have a basic understanding of how these tools can be used in their own research contexts.
Participants will learn how to: Install and load a library into RStudio, Open and read a text file in R, Lemmatize the text, Write a simple script to count the types and tokens in the lemmatized text and calculate the Lexical diversity using TTR and Guiraud's Index, Write a simple script to calculate the lexical diversity of the file using MTLD, Use a for-loop to repeat the process over multiple files, Export the results to a csv file
Gavin Brooks is an Associate Lecturer at Nagoya University of Commerce and Business in Japan. His research interests include research into second language vocabulary, especially the vocabulary needs of EAL students, and measures of the lexical diversity and frequency in learner-produced texts. He has also researched the development and usage of learner corpora in both the EFL and ESL contexts.