My Love / Hate Relationship of Text Wordles and My Newfound Improvement
By Richard B. Lanza
I feel that textual analytics always produces something interesting yet the tools being used can take too long or outright miss the subtleties in the data. Wordles, for example, provide a useful image of the words yet are not effective in identifying deviations over time or to expected language benchmarks. To supplement this need, a new letter approach, as outlined in this blog and a new research brief, is explained more fully. Using the letters versus words approach, the analyst can make more effective use of their time while performing a more complete deviation analysis of the text at hand. Therefore quality and efficiency is increased but first off, let’s understand more about the use of the common Wordle and why I love and hate them all at once.
Wordles are otherwise called “word clouds” and are images that show a greater prominence of words appearing more frequently in the source text. For example, below (Image 1 ) is every tragedy play written by William Shakespeare with words having the highest frequency also having the largest image:
Image 1 – Shakespeare’s Tragedy Plays
What I love about Wordles is their simplicity of showing quickly the most occurring words and mesmerizing everyone in the room as they enjoy peering within the image. But it is the same minimalism that allows a data analyst to miss the many subtleties within Shakespeare’s text, or another other. More specifically, while the above image displays roughly 200 words, it misses the other 14,328 unique words being used in the tragedy plays.
Comparing Tragedy to Comedy Plays With Wordles
To understand how comparisons can be made between Wordles, let’s now look at the collection of Shakespeare’s comedy plays (17,882 words) as summarized in another 150 to 200 words below (Image 2 ). As you scroll back above (Image 1 ) and then to this image, can you see the differences?
Image 2 – Shakespeare’s Comedy Plays
In your analysis you will most likely realize quickly that:
· The words displayed tend to be the most common words used in the English language, otherwise known as function words (i.e., the, to, and that, etc.)
· The next most used words most used are pronouns (i.e., you, your, I, me, etc.)
· There are few content words such as names of people, actions taken or specific nouns that rise out of the Wordle
So, you may ask yourself how can a Wordle chart be so appealing and also be so flawed in its design? To understand this, we must first realize that a Wordle is meant to focus on the top occurring words as our screens were not built large enough to handle the magnitude of thousands of words. Also, in both the comedy and tragedy plays, roughly half of the words appear only once in each play and we can quickly realize that one occurrence is no comparative match to the word “the” which appeared 7,604 and 10,920 times in tragedy and comedy plays, respectively.
Now, assume we removed the top 25 words (making up 30% of word occurrences) as seen below for both tragedy (Image 3 ) and comedy (Image 4 ) plays:
Image 3 – Shakespeare’s Tragedy Plays Less Top 25
Image 4 – Shakespeare’s Comedy Plays Less Top 25
While this is a noticed improvement, the words still focus more on top function words and pronouns. They do begin to show some new content words that were previously unseen but we quickly see we are still missing thousands of words in the analysis. This presents a scope limitation to the analyst, user and hence, why we need to find a better way.
A Newfound Comparative Approach Using Letter Analytics
First, we should back up to how we were able to get the data for analysis. The texts were obtained from the MIT web presence of Shakespeare’s 37 plays and sonnets (http://shakespeare.mit.edu/ ). With some help from an experienced data scientist (James Patounas from Source One Management Services, LLC) and his skills in using Python software, we were able to quickly web scrape the text data on the MIT pages and organize all 37 plays for analysis.
The dashboard image below (Image 5 ), a vast improvement over a Wordle, presents all 37 plays across the categories of tragedy, comedy and now, history. Instead of making word pictures, the dashboard focuses on letter bars and more specifically, the first two letters of each word. Thus, a different chart of each two-letters was developed for each of the 26 letters (Image 6 below is an example of the letter K in isolation). That amounts to 702 two-letter combinations (AA to ZZ) that then fit within the frame of 26 single letters (A to Z). Unlike the Wordle that could not represent the word changes between the play types, the letter analytic dashboard can do so and be able to present the results in one screen.
Isolation of each letter before further analyzing the two letter combinations led to an ability to detect deviations in lower occurring letters (i.e., letter X), rather than having high occurrence letters (i.e., the letter T related to the word “the”) dilute the analysis in the dashboard. In essence, if there are 14,528 unique tragedy play words, 17,882 unique words in comedy plays or 100,000 unique words in a data set of your choice, all are reduced down to the 702 x 26 frame of letters for improved review.
Image 5 – Shakespeare Plays – Tragedy, Comedy and History - Shakes Dash Upd.jpg
Using this approach, entitled the “Lanza Approach to Letter Analytics (“LALA”), there are only 23 visual differences leading to only 3% of the 702 two-letter combinations (see red arrows denoted in Image 5 ) having noticeable visual change. Some noticeable variances were due to names of people or places in the plays such as “JU” for Juliet, “RO” for Romeo, “BR” for Brutus”, “RI” for Richard or “GL” for Gloucester.
For something a little more interesting, the word “king” appeared in many forms (king, kings, kingdom, etc.) in the 10 history plays at 2,186 times given their high occurrence of plays centered around various kings. The word “king” represented 0.3% of the total number of roughly 825,000 word occurrences in all of Shakespeare’s plays yet, as can be seen below (Image 6 ), it is a noticeable 25% deviation for the letter K in the first-two letter combination of “KI”. Also as seen from the below image (Image 6 ), the deviation of KN for history plays is due mainly to a drop in all forms of the word “know” which appeared more frequently in tragedy and comedy plays, where there is more discussion around what people know and therefore, their own introspection.
Image 6 – Shakespeare Plays and the Letter “K” - Shakes_KI_KN.jpg
To view the entire research brief of this new letter analytic approach called LALA, please click here to be brought to the International Institute for Analytics website.
Rich Lanza CPA, CFE, CGMA (www.richlanza.com) has over 25 years of audit and fraud detection experience with specialization in data analytics and cost recovery efforts. Rich wrote the first book on practical applications of using data analytics in an audit environment titled, 101 ACL Applications: A Toolkit for Today’s Auditor, in addition to writing over 19 publications, and numerous articles. Rich is proficient and consults in the practical use of analytic software including ACL, ActiveData for Excel, Arbutus Analyzer, IDEA, TeamMate Analytics and auditing with Microsoft Excel techniques. Rich has been awarded by the Association of Certified Fraud Examiners for his research on proactive fraud reporting. He is also a regular presenter for CFO.com, the Institute of Internal Auditors, Association of Certified Fraud Examiners, Auditnet ®, Lorman, and Fraud Resource Net LLC. Rich consults with companies ranging in size of $30 million to $100 billion and in all, has helped them find money through the use of technology and recovery auditing. He is also a current faculty member with the International Institute for Analytics.
Disclaimer: The opinions, beliefs and viewpoints expressed by the various authors and forum participants on this web site do not necessarily reflect the opinions, beliefs and viewpoints of AuditNet®. Any links are being provided as a convenience and for informational purposes only; they do not constitute an endorsement or an approval by AuditNet® of any of the products, services or opinions of the corporation or organization or individual. AuditNet® bears no responsibility for the accuracy, legality or content of the external site or for that of subsequent links. AuditNet® does not exercise any editorial control over the information you may find at these locations. Contact the external site for answers to questions regarding its content.