Who is Wylie H. Dallas?

Summary

Wylie H. Dallas is an alter ego for someone posing as a Dallas-area political gadfly. I’ve always suspected Wylie is someone’s outlet for things that cannot be said with his or her public persona.

I analyzed tweets of 52 Dallas-area media personalities, Wylie, and Philip Kingston. I included Philip because he was a prominent councilmember back when I started working on this, and Wylie appeared well aligned with Philip’s causes.

This analysis produced a scoring system. This shows Jim Schutze’s word use, in his tweets, is the closest match for Wylie’s word use. I wonder if Jim Schutze knows more about Wylie than he lets on. Wylie says Schutze is his favorite journalist.

Technical details

I did this analysis with R), a free software environment that is popular with the data science crowd, especially when the analysis is related to humanities, social sciences, statistics, and more.

The rest of this document is my explanation of the analysis and the results. I was inspired by–and in a few cases stole code from–David Robinson’s similar analysis of who wrote the “I Am Part of the Resistance” op-ed about the Trump administration.

First, I load some libraries. These have code, built by others, that I use throughout this analysis.

# based on http://varianceexplained.org/r/op-ed-text-analysis/
library(rtweet)
library(tidyverse)
library(tidytext)
library(knitr)
library(kableExtra)
library(widyr)

I set up a Twitter access token. This allows me to pull data out of Twitter using its API.

token <- create_token(
  app = "app name goes here",
  consumer_key = "API key goes here",
  consumer_secret = "API secret key goes here",
  access_token = "access token goes here",
  access_secret = "access token secret goes here")

Sorry, my Twitter keys are not included, but you can get your own! (That article may still show an earlier version of Twitter’s API sign up stuff. You can probably figure it out if you bang your head against the wall enough!)

Next I get a Twitter list of media personalities from Advocamentum:

source("tokens.R")
advocamentum_news_media <- lists_members(owner_user = "Advocamentum", slug = "news-media")

Why Advocamentum’s list? Because Wylie subscribes to Advocamentum’s list, and Wylie’s other subscribed lists don’t seem plausible.

Advocamentum appears to be an inactive account. Since it has been updated, media personalities have come and gone from the Dallas market. It is unlikely these are or were Wylie, unless we can spot a shift in his writing style that could be explained by a handover.

I am also analyzing Wylie’s tweets, and I included Phillip Kingston because he seems to politically align with Wylie:

advocamentum_news_media <- advocamentum_news_media %>%
  add_row(name = "Wylie H. Dallas",
          screen_name = "Wylie_H_Dallas") %>%
  add_row(name = "Philip Kingston",
          screen_name = "PhilipTKingston")

Now let’s get all of their tweets. The Twitter API limits you to 3200 tweets per account you are pulling tweets from. Let’s roll:

# this takes a long time
tweets <- map_df(advocamentum_news_media$screen_name,
                 get_timeline,
                 n = 3200)

# Save the tweets to a file. That way, next time you analyze, you just load the file with load("tweets.Rda").
save(tweets, file="tweets.Rda")

tweets is a data frame, an R concept like an Excel spreadsheet.

I ran the above code on three occasions: December 25, 2018; May 7, 2019; and May 11, 2020. Each time, I saved my downloaded tweets to an Rda file with the date embedded in the file name.

I will load all three tweet-datasets and merge them together. The datasets are from different times: the December 25, 2018 one is from a holiday season in what is otherwise business as usual, the May 7, 2019 one is from a major city political cycle, and the May 11, 2020 one is during a pandemic. My theory is that if we can detect similarities between Wylie and any other personalities, it will endure through both datasets.

I previously collected all this data and saved the data to files. Here, I load that data back into R:

load("tweets_20181225.Rda")
tweets_20181225 <- tweets # first of two-step object rename
rm(tweets) # second rename step

load("tweets_20190507.Rda")
tweets_20190507 <- tweets
rm(tweets)

load("tweets_20200511.Rda")
tweets_20200511 <- tweets
rm(tweets)

Here’s an example of the data:

tweets_20200511 %>%
  sample_n(10) %>%
  select(screen_name, created_at, text) %>%
  kable() %>%
  kable_styling(full_width = FALSE)

screen_name	created_at	text
JeffSmithi24	2019-12-27 08:01:17	All votes counted: Netanyahu: 41,792 (72.5%) Saar: 15,885 (27.5%) https://t.co/OWf11rjRrq
MonicaTVNews	2018-09-11 04:30:00	#WeWillNeverForget 9/11 https://t.co/FjNFxuOs5d
PhilipTKingston	2020-04-07 16:55:22	Cancer surgery canceled because the hospital doesn’t have enough tests or PPE. This virus is killing people who don’t have the virus. https://t.co/qWaMjgW1gm
JohnnyNBC6	2015-08-27 19:20:54	We are en route to a possible @DallasPD involved shooting near Fair Park. ETA: 15 minutes. More info to come @NBCDFW
JohnnyNBC6	2015-12-27 08:32:39	@THETXEMBASSY @GarlandTX_ @NBCDFW @GarlandPD yeah. This was just an awful situation. So sad.
RayLeszcynski	2016-12-16 01:05:25	@RayLeszcynski Here’s GuideLive’s list of the new eateries coming to The Star in Frisco: https://t.co/WQvxZJlZIm
CourtneyNBC5	2018-03-29 11:15:59	What was your favorite cereal growing up as a kid? The options on the cereal aisle are a lot healthier… What’s for breakfast?? @NBCDFW https://t.co/GpVzVyGWyT
ahuguelet	2019-06-01 03:30:20	UPDATE: Police arrested five people during #sgf abortion protests that “turned violent” Friday. https://t.co/PPQP4lnZPo
jmchiquillo	2015-10-27 01:03:16	Drain: I’m 50 yrs old, remember how different things were even 30 yrs ago. Things are better now than they’ve ever been.
PappalardoJoe	2019-05-05 00:23:24	NASA had a “vested interest” in getting the first stage back from this morning’s launch. It’s slated to fly again on the next SpaceX resupply flight to ISS in July, and potentially on the one after that in December. https://t.co/lodm5wkfxr

Now we get to the fun part. Let’s tease out the words that are distinct to each person.

Before I do that, I need to filter the data. Wylie H. Dallas is a prolific tweeter. Because Twitter’s API limits me to pulling a user’s most recent ~3200 tweets, the time range of Wylie’s ~3200 tweets will be considerably narrower than many others on the list. Here’s a plot that demonstrates this:

# This helps me color Wylie's column red. ggplot2 sorts
# the 55 names alphabetically, and he is number 54.
x_colors = rep("#000000", 55)
x_colors[54] = "red"

bind_rows(tweets_20190507, tweets_20181225, tweets_20200511) %>%
  ggplot(aes(x=screen_name, y=created_at, color=screen_name)) +
  geom_point(alpha = 0.05) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_color_manual(values=x_colors) +
  theme(legend.position="none")

plot of chunk show time periods for which I have tweets for each person

You can discern three periods of Wylie’s tweets, coming from all three datasets. He is so prolific, each of the tweet sets reach the ~3200 limit before going back too far in time. Less-prolific tweeters will go much further back in time before hitting the ~3200 limit. For example, Robert Wilonsky’s tweets go back to roughly 2011, so in a nine-year span, Wilonsky’s tweet volume approximates Wylie’s in just three weeks.

I prefer to limit analysis to tweets that only happen within the time frames of Wylie’s tweets. Why? As Wylie’s tweets cover political topics and minutia of current events, his word choice is likely to vary with time. Anyone whose tweets may correspond to Wylie’s may have similar word-use variations. We’ll check this theory a little later.

Let’s get the exact time stamps of Wylie’s first and last tweets in each dataset, when we’ll filter all tweets by those dates:

wylie_first_tweet_20181225 <- min(tweets_20181225 %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  pull(created_at))
wylie_last_tweet_20181225 <- max(tweets_20181225 %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  pull(created_at))

wylie_first_tweet_20190507 <- min(tweets_20190507 %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  pull(created_at))
wylie_last_tweet_20190507 <- max(tweets_20190507 %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  pull(created_at))

wylie_first_tweet_20200511 <- min(tweets_20200511 %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  pull(created_at))
wylie_last_tweet_20200511 <- max(tweets_20200511 %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  pull(created_at))

For all datasets, we are looking at around three months of Wylie’s data. Now let’s filter each dataset:

tweets_20181225_filtered <- tweets_20181225 %>%
  filter(created_at >= wylie_first_tweet_20181225 &
           created_at <= wylie_last_tweet_20181225)

tweets_20190507_filtered <- tweets_20190507 %>%
  filter(created_at >= wylie_first_tweet_20190507 &
           created_at <= wylie_last_tweet_20190507)

tweets_20200511_filtered <- tweets_20200511 %>%
  filter(created_at >= wylie_first_tweet_20200511 &
           created_at <= wylie_last_tweet_20200511)

Wow, that eliminated the vast majority of our tweets, reducing us from over 480,000 tweets to about 55,000 tweets!

Before we go further, let’s remove Michael Lindenberger’s and Monica Hernandez’s tweets since they left the Dallas market:

# Remove Michael Lindenberger's and Monica Hernandez's tweets since they are no longer in Dallas
removeLostSouls <- function(tweetset) {
  return(tweetset %>%
           filter(!(screen_name == "Lindenberger")) %>%
           filter(!(screen_name == "MonicaTVNews")))
}

tweets_20181225_filtered <- removeLostSouls(tweets_20181225_filtered)
tweets_20190507_filtered <- removeLostSouls(tweets_20190507_filtered)
tweets_20200511_filtered <- removeLostSouls(tweets_20200511_filtered)

Next, I create a new data frame that has each word in its own row:

tweet_words <- bind_rows(tweets_20181225_filtered, tweets_20190507_filtered, tweets_20200511_filtered) %>%
  # Remove retweets. Those don't reflect the author's own words.
  filter(!is_retweet) %>%
  # This sorts everything by the date the tweet was posted.
  arrange(created_at) %>%
  # I only care about these three fields.
  select(screen_name, text) %>%
  # Eliminate duplicate tweets.
  distinct(text, .keep_all = TRUE) %>%
  # Get rid of links back to Twitter. They show up if you reference another tweet. These are junk text as far as our analysis is concerned. Same for &amp; entity references.
  mutate(text = str_replace_all(text, "https?://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  # This splits tweets into individual words. What we are analyzing are the words, not the tweets.
  unnest_tokens(word, text, token = "tweets") %>%
  # We are only retaining words that contain at least one letter.  unnest_tokens made everything lowercase, so that is why you don't also see A-Z.
  filter(str_detect(word, "[a-z]")) %>%
  # Remove words that are stop words. Stop words do not contribute anything meaningful to the analysis, so they get removed.
  filter(!word %in% stop_words$word)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

Now we have a data frame with a row for each word that each author wrote in every tweet that we kept. Note the last line of the code: all stop words are removed. These are words that have little value for analysis: the, a, at, et al.

Just for the fun of it, let’s see the most commonly used words across all authors:

tweet_words %>%
  count(word, sort = TRUE) %>%
  head(16) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  labs(y = "# of uses among all accounts") +
  ggtitle("Most commonly used words", subtitle="from tweets from Dallas-area media personalities and Wylie H. Dallas")

plot of chunk commonly used words across all authors

An NBC-specific word appears in this top-words list. NBC employees may be coordinating their accounts.

Now we work up to the exciting analysis. Right now, the tweet_words data frame has a row for each use of a word. We will collapse this into one row per word per author, with a count of how many times each author wrote that word.

For example, here’s an excerpt of tweet_words, filtered to a few of Wylie’s uses of the word Dallas:

tweet_words %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  filter(str_detect(word, "dallas")) %>%
  filter(!str_detect(word, "[@#]")) %>%
  arrange(word) %>%
  head(10) %>%
  kable() %>%
  kable_styling(full_width = F)

screen_name	word
Wylie_H_Dallas	10dallas
Wylie_H_Dallas	cityofdallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas
Wylie_H_Dallas	dallas

The data has Wylie using the word Dallas several hundred times. Instead of hundreds of rows, each showing that Wylie wrote “Dallas”, we will condense into one row for each word and author, with a count of word use added as another column:

word_counts <- tweet_words %>%
  count(screen_name, word, sort = TRUE)

Here’s what it looks like:

word_counts %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  filter(str_detect(word, "dallas")) %>%
  filter(!str_detect(word, "[@#]")) %>%
  arrange(word) %>%
  head(10) %>%
  kable() %>%
  kable_styling(full_width = F)

screen_name	word	n
Wylie_H_Dallas	10dallas	1
Wylie_H_Dallas	cityofdallas	1
Wylie_H_Dallas	dallas	489
Wylie_H_Dallas	dallas8217	8
Wylie_H_Dallas	dallasarea	5
Wylie_H_Dallas	dallasbased	3
Wylie_H_Dallas	dallasfort	2
Wylie_H_Dallas	dallaspd	1
Wylie_H_Dallas	dallasthemed	1
Wylie_H_Dallas	dallastotaos	1

Now to the final step, but first an explanation. I will create a term frequency–inverse document frequency (TF-IDF) statistic for each word. This statistic helps you see which words tend to be distinct to a given author. If a word is relatively distinct for given author, that word will have a higher score for that author and a lower score for other authors. Suppose Wylie frequently used the word butthead, and that word was uncommonly used by the other authors. In that case, butthead would have a higher score.

What we’re really getting at is a word-use fingerprint of each of these authors.

Here’s the code:

# Compute TF-IDF using "word" as term and "screen_name" as document.
word_tf_idf <- word_counts %>%
  bind_tf_idf(word, screen_name, n) %>%
  arrange(desc(tf_idf))

Here’s Wylie’s top 10 most distinct words:

word_tf_idf %>%
  filter(screen_name == "Wylie_H_Dallas") %>%
  arrange(desc(tf_idf)) %>%
  select(screen_name, word, tf_idf) %>%
  head(10) %>%
  kable(caption = "Wylie H. Dallas's most distinct words") %>%
  kable_styling(full_width = F)

Wylie H. Dallas’s most distinct words
screen_name	word	tf_idf
Wylie_H_Dallas	@johncornyn	0.0133630
Wylie_H_Dallas	@dallasobserver	0.0124769
Wylie_H_Dallas	@cityofdallas	0.0086668
Wylie_H_Dallas	dallas	0.0070789
Wylie_H_Dallas	@visitdallas	0.0067742
Wylie_H_Dallas	@culturemapdal	0.0052427
Wylie_H_Dallas	@cmjsgates	0.0041942
Wylie_H_Dallas	@americanair	0.0040148
Wylie_H_Dallas	@johnson4dallas	0.0039866
Wylie_H_Dallas	@ncoxbarrett	0.0039711

These are the words that are both most distinct to and most frequently used by Wylie.

Hey, let’s see Jim Schutze’s relatively distinct words:

word_tf_idf %>%
  filter(screen_name == "JimSchutze") %>%
  arrange(desc(tf_idf)) %>%
  select(screen_name, word, tf_idf) %>%
  head(10) %>%
  kable(caption = "Jim Schutze's most distinct words") %>%
  kable_styling(full_width = F)

Jim Schutze’s most distinct words
screen_name	word	tf_idf
JimSchutze	@dallasobserver	0.1915410
JimSchutze	dingbat	0.0110509
JimSchutze	hypocrite	0.0110509
JimSchutze	kayak	0.0110509
JimSchutze	griggs	0.0090929
JimSchutze	creuzot	0.0076106
JimSchutze	trinity	0.0072151
JimSchutze	mlk	0.0071348
JimSchutze	wash	0.0064158
JimSchutze	scientist	0.0059894

Hmm, they both like the Dallas Observer! Other than that, I’m not seeing much. Looks like Schutze’s most distinct words relate to what he’s writing about in his day job, whereas Wylie’s most distinct words are about broader topics.

However, these are only the top ten words. Wylie, for example, has 4997 words total, so we need to do something more sophisticated. That will be a pairwise similarity calculation between all of each author’s words, taking into account their TF-IDF statistics:

similarity <- word_tf_idf %>%
  pairwise_similarity(screen_name, word, tf_idf, upper = FALSE, sort = TRUE)

Let’s look at the top 20 matches:

similarity %>%
  arrange(desc(similarity)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(full_width = FALSE)

item1	item2	similarity
ttsiaperas	rlopezwfaa	0.6127508
CourtneyNBC5	DeborahNBC5	0.5220304
KenKalthoffNBC5	BenRussellNBC5	0.4737143
ToddWFAA8	wfaalauren	0.4156295
ScottNBC5	CourtneyNBC5	0.4008045
NBC5photog	CourtneyNBC5	0.3550653
ScottNBC5	DeborahNBC5	0.3383298
DMNOpinion	medenix	0.3358262
JimSchutze	Wylie_H_Dallas	0.3358210
BenRussellNBC5	CourtneyNBC5	0.3227761

This makes sense. What you are seeing is a high degree of similarity of distinct-word use between people who work for the same company. Remember above, when I observed how NBC-related keywords rank high in total counts? NBC5 may be closely managing its Twitter accounts, which means they may have similar use of words that are relatively distinct across all authors.

Hold on a sec–does that table suggest Jim Schutze and Wylie H. Dallas work for the same company? Let’s explore this a bit further.

Let’s filter the list just to where Wylie is being compared to the journalists:

# Limit the list to just comparisons with Wylie
similarity_to_wylie <- similarity %>%
  filter(item1 == "Wylie_H_Dallas" |
           item2 == "Wylie_H_Dallas") %>%
  unite(account, item1, item2, sep="")

similarity_to_wylie$account <- str_replace(similarity_to_wylie$account, "Wylie_H_Dallas", "")

similarity_to_wylie %>%
  head(10) %>%
  kable() %>%
  kable_styling(full_width = FALSE)

account	similarity
JimSchutze	0.3358210
TristanHallman	0.2381990
DMagazine	0.1934352
Dallas_Observer	0.1845921
DMNOpinion	0.1772568
PhilipTKingston	0.1735325
RobertWilonsky	0.1567684
medenix	0.1501068
CultureMapDAL	0.1499126
johnmccaa	0.1311062

And there you go: Wylie’s similarity score is much higher for Jim Schutze than anyone else, almost 50% higher than second place. Let’s make a plot:

# Turning the account column into a factor so that ggplot doesn't reorder everything.
similarity_to_wylie <- similarity_to_wylie %>%
  mutate(account = reorder(account, similarity))

similarity_to_wylie %>%
  ggplot(aes(x=account, y=similarity)) +
  geom_col() +
  coord_flip() +
  labs(y = "Twitter user") +
  ggtitle("Similarity between Wylie H. Dallas and others ", subtitle="from tweets from Dallas-area media personalities and Wylie H. Dallas")

plot of chunk unnamed-chunk-10

This plot is a mess! Here’s the same plot with just the top 20 similarity scores:

similarity_to_wylie %>%
  top_n(20) %>%
  ggplot(aes(x=account, y=similarity)) +
  geom_col() +
  coord_flip() +
  labs(y = "Twitter user") +
  ggtitle("Similarity between Wylie H. Dallas and others ", subtitle="from tweets from Dallas-area media personalities and Wylie H. Dallas")

## Selecting by similarity

plot of chunk unnamed-chunk-11

We’re seeing the strongest relationship of word-use fingerprints between Schutze and Wylie.