Preparing for 2020!

December 2, 2019 Josephine Lukito

It’s December, which is when we tend to think about what we’ve done this year and what we hope to do for next year. For me, that reflection includes updating my personal organizing and scheduling system (e.g., planners, calendars, bullet journals, organizers).

Organizing has been essential to maintaining a consistent workflow throughout my academic career. It’s a living system—I continually revamp it to make sure I’m getting the most out of it. Right now, I’m using a “paper-dominant hybrid system”: my scheduler, to-do list, reading notes, and zettelkasten are in print, but I maintain a digital calendar, a citation system, and mind-mapper.

Organizing systems are as varied as academic scholars. This makes sense: your system should serve your needs. But regardless of whether it’s digital or physical, multi-platform or all in one place, it behooves scholars to have a system that isn’t a pile of scraps or things you write on your hand. Trust me when I say: there is too much to remember in grad school for you to “have it all in your head.” If you don’t write things down or record it, things will inevitably slip from your mind.

For that reason, I'm hoping to spend my next few blog posts talking about how I organize my academic life (from day-to-day scheduling to keeping notes that will last a decade). I’ll also talk about how I’m updating my 2019 system for the new year.

But before I proceed, here are a couple of disclaimers/considerations:

No organization system is perfect forever. In the planner community, the term “planner peace” refers to having a system you are completely satisfied with. While this sounds awesome, realistically, you won’t find a system that completely fits you for your whole life. Your planner system will change as you and your career changes—but this is how it should be, because what you need from your organizing system will change.
Maintenance is key. A good organizing system relies on regular maintenance. That might involve setting aside time weekly to update your citations, review your planner/calendar, or to clean your to-do list. As diverse as organization systems are, they all still require maintenance.
The best-laid plans of mice and men often go awry. Few of the days I schedule and organize go exactly as I anticipated. Even if I write a daily to-do list, I rarely complete it. Don’t be hard on yourself when your plans go out the window for a day (or longer). Don’t feel bad if you have to forgo your organizing system for a bit when things get hectic.
Don’t mistake planning for doing. Planning out your day is not the same as actually doing what you planned. Don’t make planning busy-work to avoid the real work you have to do.

Disney Plus Data and Chill ;)

November 22, 2019 Josephine Lukito

On December 12, 2019, Disney unveiled its streaming service, Disney+, to the world. It received significant attention, both good and back, from the press—which makes sense, because over 10 million people signed up in the first day.

Twitter was also abuzz with conversations about Disney+ (see this string-of-tweet “news story” about Twitter activity on the first day). Several pointed out that shows, including new ones like The Mandalorian and oldies like Darkwing Duck, were trending soon after Disney+ was launched.

But what would activity look like after the first day?

To answer this question, I used Mike Kearney’s rtweet package to look at tweets posted from 11/14/19 to 11/18/19 that had one of the following keywords: disneyplus, disney plus, disney+, and disney +.

Timeline

As with any long-term (> 1 day) popular topic (like elections), tweets about Disney+ had a natural seasonality. People tweet less after midnight and pick back up at 6 or 7 a.m. the next day. While activity was still pretty high on the 14th, people tweeted less and less about it over time (as would be expected). There was a little over a million tweets in the corpus (n = 1,107,413).

Topic Modeling

I also ran an LDA topic modeling, which highlights the variety of conversations on Twitter about Disney +.

Noticeably, The Mandalorian, Hannah Montana, the Simpsons (which is on Disney+ in its original 4:3 format), and Bad Girls Club were talked about frequently enough to be (mostly) stand-alone topics. The Mandalorian hashtag (#themandalorian) was also a popular keyword in the corpus.

But we also see a variety of other topics, including one about the Nickelodeon and Netflix deal (which many people viewed as a response to Disney+’s explosive popularity) and another comparing Disney+ to other streaming services (like Netflix, Hulu, and HBO). In fact, Netflix was the third most frequent term in the dataset (behind Disney and Disneyplus).

(Some of the topics were obviously noisier than others. Topics with the little red “n” are “noisier” than the others, meaning that a large number of tweets with a high beta in that topic were not related to the topic labels. Many tweets in the “Bad Girls Club” topic, for example, don’t actually have to do with that show.)

Sentiment-Laden Words

I did a quick sentiment analysis as well, using the tidytext package (specifically, the bing sentiment lexicon). This allowed me to look at frequently used, sentiment-laden terms.

As with any sentiment analysis that is based on a lexicon, there are obvious limitations. The bing dictionary, for example, includes “trump” as a positive word, but it would count any mention of Donald “Trump” as well.

We can see a similar phenomenon here with the word “chill”, which Bing treats as a negative word. If you recall from the topic modeling results, “Disney+ and Chill” was a topic in-it-of-itself. In addition to using the specific phrase “Disney+ & Chill” (which is a snowclone from “Netflix & Chill”), we see people trying to come up with their own variants, including “Disney+ and Thrust” and “Disney+ and Bust”.

For a quick and dirty analysis, this was a pretty fun corpus of tweets to go through! You can check out my code at my Github.

Armchair Linguistic-ing: ~sparkle~ or sPoNgEbOb sarcasm?

September 26, 2019 Josephine Lukito

Today, I’m going to play the role of “armchair linguist” (which is fun and something everyone can do. Everyone can be an armchair linguist.) As much as I would love to pull some data and analyze some fun text, I’m deep in analyzing my dissertation data and should really focus my computing energy towards that.

However, I was thinking about sarcasm recently when writing about the phrase, “the internet is serious business.” This is a sarcastic remark (and an early meme) that pokes fun at people taking online discourse too seriously. In my little memo, I went back and forth between two constructions of sarcasm:

(1) Even if the internet is not ~serious business~,…
(2) Even if the internet is not sErIoUs bUsInEsS,…

Both are typographic markers of sarcasm that people frequently use in online communication. The first is an example of sparkle sarcasm, sometimes described as the “sarcasm tilde.” In Because Internet, McCulloch describes the tilde as having an exaggerated rise and fall that mimics the tonal features of sarcastic language. Even single-syllable words like “thaaaaaaaanks” and “soooooo” can be elongated for sarcastic effect. Many moments in South Park’s Sarcastaball episode show off this elongation.

Source: Top definition for “~” on Urban Dictionary

The second is (now) an outgrowth of the popular “mocking spongebob” meme, which produced the now well-known sPoNgEbOb cAsE (fun fact: R has a sPoNgEbOb cAsE package, which you can check out here). I’ll call this spongebob sarcasm. The primary purpose of this case variation is to mock the tone of an idea or opinion—this draws from the mocking intent of the original meme. “Spongebob case” is obviously not the first use of alternating caps—like sparkle sarcasm, it was grouped with the use to tildes and asterisks constituting sparkly unicorn punctuation (~*~*iSn’T tHiS gReAt?!*~*~). But under the Mocking Spongebob meme, it’s taken on a life of its own, in the way that sparkle sarcasm is now distinct from ~*~*more ornate*~*~ uses of tildes and asterisks.

An early case of spongebob sarcasm. Source: (Know Your Meme)

So how is sparkle sarcasm different from spongebob sarcasm? In Because Internet, McCulloch notes a Buzzfeed reporter’s description of sparkle sarcasm: “somewhere between sarcasm and a sort of mild self-deprecatory embarrassment.” The use of sparkles suggests a type of “anti-serious” sarcasm that is “sing-songy.”

In contrast, spongebob sarcasm is direct and biting—a type of “insincere” sarcasm. If sparkle sarcasm is self-deprecatory, spongebob sarcasm is mockery. A core aspect of its early use included mockingly repeating what someone else has said (that norm carries to its current usage, even if mocking oneself):

I've realized there are two types of professors in the world:

1) WHERE aRE tHE WHITeBOaRD MARkSeRs WHO WoULD StEAL TheM?!?!?!?....every single class.

2) "Oh I always carry spare markers in my purse."

I am the former, obviously.
— Jessie Male (@ProfJMale) September 24, 2019

(Above: my favorite example of spongebob sarcasm this morning)

Having both types of sarcasm gives online communicators a greater variety of “sarcasm” to choose from. And, because it is denoted with obvious markers (tildes and alternating lower and upper case), both sparkle and spongebob sarcasm are less likely to be taken at face-value; whereas tonally-conveyed sarcasm could produce a misunderstanding.

If we think of sarcasm as a language microcosm of satire, we could also think of sparkle sarcasm as Horatian (playful and light-heartedly humorous) and spongebob sarcasm as Juvenalian (i.e., ridicule). I bring this up to highlight that these variations of sarcasm and language are not inherently new. But, we have found new ways to communicate those ideas in daily computer-mediated language, which I think is super cool.

(PS: I went with spongebob sarcasm: tHe iNtErNeT iS nOt sErIoUs bUsInEsS!1!1!!one!!1!1!!!)

New Semester, New Writing Approach

August 31, 2019 Josephine Lukito

Hello!

Long time no chat, readers!

The summer has been a whirlwind for me (writing, programming, reading, and moving has fully consumed my last few months, not including traveling for AEJMC and appearing on CNN).

For whatever reason, this summer has been “The Summer of Unfinished Drafts.” I’ve had more unfinished ideas and drafts than I’ve ever had before. The ideas keep popping up and landing on top of one another (an experience that is simultaneously exciting and anxiety-inducing). I’ve had a couple drafts on the docket, but for various reason, I haven’t been able to post them to my blog. Sometimes I start an outline, but never complete it. In other instances, I tell myself I need to proof it again (and again and again and again), resulting it me never publishing it.

At the same time, I think my publication aspirations (I recently had my first single-authored piece accepted to Political Communication) and academic writing trajectory has paralyzed some aspects of my writing. In an attempt to be so polished all. the. time., I have lost a bit of my "natural voice”—my signature informality.

This became all the more evident while reading Gretchen McCulloch’s* Because Internet (a book I highly recommend). In it, she talks about how spellcheck and grammarcheck operates as a “linguistic authority” (p. 45), reinforcing archaic rules. She (admirably so) is upfront about her stylistic choices: when to adopt accepted 21st century norms (e.g., “lol” vs “LOL”) and when to bend to the norms of standard American English writing.

This matters a lot for me and this blog, because I realize the desire to write “really clean blog posts” is hindering my willingness to share new ideas and thoughts on this blog. I want this to be a place for me to be more free-flowing, and not be hindered by where my ggplot2 title should be 10 pixels to the right (or whether I have one typo in my post).

For this reason, my subsequent posts this semester will have fairly minimal editing (if any at all). This choice reflects the kind of work I am posting here—fresh, fairly raw, but also liberated from the rigidity of many other writing genres/registers that I use (e.g., AP style news writing, academic writing). Should you want to read my more formal writing, I encourage you to check out my CJR piece, and upcoming publications.

I hope you’re excited to take this writing journey with me! For those starting semesters on campuses across the world: Happy Fall 2019! (And happy continuing quarters for those on the quarter-system.)

———

* In her book, Mulloch points out that her name is often marked as erroneous for the more common “McCullough.” Funnily enough, this happened while I was written her name for this blog post (see screenshot below).

The Hidden Conference Cost of doing Interdisciplinary Work

June 10, 2019 Josephine Lukito

Hello blog!

Long time no chat. May was entirely lost in the black hole that is the end of the semester and the start of “academic conferencing.” In the past month, I attended the International Communication Association’s conference (ICA 2019; what I would consider the “main” conference of my primary field, Mass Communication) and a workshop at the the North American Chapter of the Association for Computational Linguistics conference (NAACL NLP+CSS 2019). I have a nice break through the remainder of June and July, and then in August I have one more conference (Association in Education for Journalism and Mass Communication, AEJMC 2019).

Which brings me to my topic of the day: the cost of attending conferences to stay up to date on interdisciplinary scholarship.

Realistically, I work in three intersecting fields (four, if you include my computational stuff separately): Mass Communication, Political Science, and Linguistics. Removing a component of the trifecta is not possible; it would mean fundamentally misunderstanding my research agenda.

There are a lot of benefits and problems to doing interdisciplinary research, which many other scholars have spoken on. I love interdisciplinary work, personally, because that’s where all the enjoyable little questions are. And, as valuable as specialization can be, most research questions can be studied in many ways, depending on the department/discipline you end up in. A question about political language may produce different results if studied in Sociology, Psychology, and Political Science. So, to me, the rigorous thing would be to do interdisciplinary research—to be specific in your question, broad in where you look for theory, and concrete in your study’s operationalization and methodology.

But there are substantial professional costs to doing interdisciplinary work. A Google Scholar search of “interdisciplinary research difficulties” will yield more than enough articles to give you a sense of how much the academy has struggled to deal with interdisciplinary scholars (I choose the word “deal” carefully… rarely do I feel as if the academy “supports” interdisciplinary work).

One of those weirdly silent struggles is the cost of attending oh-so-many conferences. In an ideal world, I’d like to submit to conferences for all the fields I participate in (ICA/AEJMC for Mass Comm, LSA for Linguistics, APSA/MPSA for Political Science, NAACL/CoLing for Computational Linguistics). There conferences are important for many reasons. They help you connect with others to find jobs (a super important thing for any graduate student), they expose you to the latest studies and results in the field, and they help you connect with other people who are doing similar work to you.

But each conference can cost a substantial amount of money to attend. Below are the registration cost of the seven conferences I noted above, and a few others:

Conference	2019 Location	Regular Reg	Student Reg
AEJMC	Toronto	$ 215	$ 125
APSA	Washington D.C.	$ 160	$ 125
CoLing	Santa Fe	$ 715	$ 500
ICA	Washington D.C.	$ 300*	$ 165
IC2S2	Amsterdam	345 €	195 €
ICCSS	Amsterdam	450 €	350 €
LSA	NYC	$ 86	$ 90
NAACL	Minneapolis	$ 595	$ 295

(* ICA has tiered prices depending on where your institution is located. These are U.S. prices, Tier A.)

For each conference, you also need to account for hotel and airfare, at minimum. The best conferences are the ones that are proximity close (the location of NAACL, in Minneapolis, was a huge reason why I submitted a paper to begin with), but you are typically looking at between 300 and 500 dollars for a round-trip flight to somewhere-in-the-U.S. (aka: Chicago or DC). Conference hotels usually charge between 175 and 250 per night (graduate students bring down the cost substantially by staying with other graduate students). If you are a lucky young scholar like I am, you will have tt professors who will assist with food and drink for a good portion of the trip, but this is obviously not always the case.

All in all, you can be spending somewhere between 500 and 1000 dollars for each conference you attend. This cost increases considerably for non-(U.S. and European) scholars, who have to not only fly in from another country ($$$ international flights anyone?!) but also apply for visas, an increasingly daunting task (most of my conferences are in the U.S., which makes me double-privileged as a scholar in the States).

If you’re a scholar working in two disciplines, that’s twice the conferences you may need to pay for. Or, you’ll have to sacrifice attending certain conferences in one year to attend another. For a young scholar, particularly one doing interdisciplinary research, not attending a conference means missed opportunities to meet people, connect about research, and find future avenues of collaboration.

Given this, we need to start thinking about the conference model, and how that limits young scholars who cannot normally afford to attend so many conferences. Alternative ways to participate, cheaper locations (and cheaper hotels), and having more included in a registration can go a long way.

Using R to analyze the redacted Mueller Report

April 20, 2019 Josephine Lukito

Updated 4/22/2019 @ 11:08 CST

Since the beginning of the year, I have become increasingly active in the #rstats community to learn more about R programming and to share my excitement for computational media linguistics. One of my favorite aspects of this community is that there are tons of people who are equally excited about doing comp ling/NLP/text-as-data work.

When the redacted Mueller Report came out, I was not surprised to see many people get to work analyzing the text however they could. Below, I’ve curated a couple of tweets analyzing the Mueller Report using R. These tweets were selected by the very informal process of digging through my feed, and searching for #rstats tweets about the Mueller Report (top and latest). If you have an analysis of the Mueller Report using R, please let me know and I’ll add it to this collection!

I am somewhat hesitant to share all the tweets in this way… I have often critiqued “string of tweet” posts as Twitter curation pretending to be journalism (also, it is a major reason why IRA tweets ended up in news stories—a point that our UW-Madison Disinformation Research Group makes in our report). But I think this is the best way to show all the awesome analyses done on the Mueller Report using R (at least, those shared on Twitter). So let’s get to it!

Collecting the Data

Okay here's maybe the easiest/fastest/best way to read the text of the #MuellerReport into #rstats pic.twitter.com/E7oNxVlXz1
— Mike Kearney📊 (@kearneymw) April 18, 2019

If you want your own searchable copy of the #MuellerReport in #rstats

download.file("https://t.co/XWYgLgnHrt", "mueller-report.pdf")

report <- pdftools::pdf_text("mueller-report.pdf")

fileConn <- file("~/mueller-report.txt")
writeLines(report, fileConn)
close(fileConn)
— Lee Drake (@Lee__Drake) April 18, 2019

Well, the Mueller Report is out, and let me tell you, it's ████████!! I ran it through pdftools::pdf_text() and put the OCRed data here https://t.co/TxKlc2q5oT
— Garrick Aden-Buie (@grrrck) April 18, 2019

Garrick Aden-Buie, in fact, has done a lot of great work on the Mueller Report. In a blog post, for example, he talks about using pdftools and Emil Hvitfeldt’s ggpage to “highlight the most-often referenced people in the report” (you can check out his full blog post here).

Visualizing the Mueller Report and highlighting references to all of Stupid Watergate's marquee characters using #rstats, {pdftools}, and @Emil_Hvitfeldt's {ggpage} https://t.co/EWXy8LMh5H
— Garrick Aden-Buie (@grrrck) April 19, 2019

#MuellerReport as a dataframe for #rstats & analysis https://t.co/MAO4ac8JPK via @figshare
— christopherlortie (@cjlortie) April 18, 2019

Because these individuals provided code and scraped material of the Mueller Report, even more R programmers and data scientists were able to do text and linguistic analysis on the data (including me)! In addition to my list, Stas Kolenikov has a great ongoing Twitter thread of text analyses of the Mueller Report.

Text analysis of the Mueller report sounds like a good late-breaking session for #JSM2019 - everybody is on it today. https://t.co/8emQ73koIF
— Stas Kolenikov (@StatStas) April 19, 2019

Based on the tweets, {tidytext} seemed to be the most popular package used (although there are others, including {cleanNLP}, {textnets}). Below are some of these analyses.

Analyzing the Data

Thanks to @grrrck's processing of the Mueller report and {tidytext}, was able to do a quick plot of frequencies in Mueller vs. Watergate reports!
code: https://t.co/48SzyROiQt #rstats #dataviz #MuellerReport pic.twitter.com/CnDnUbeLRd
— Mara Averick (@dataandme) April 19, 2019

Here's a real quick count of the 25 most-often-used words in the report. Mostly what you'd expect to see in a report about president trump and russia. pic.twitter.com/nRPmMKkWj7
— Garrick Aden-Buie (@grrrck) April 18, 2019

Finally had time to jump on the Mueller #rstats #nlp bandwagon. Here are the top (lemmatized) verbs for which Trump was the subject (Trump + <verb>), using spaCy in {cleanNLP}. Thanks to @grrrck for the data! pic.twitter.com/nDAAOHLRsO
— Josephine Lukito (@JosephineLukito) April 19, 2019

My code for this analysis can be found here.

The Mueller Report, frequency of key phrases and names, by volume with highlights. #rstats #tidytext pic.twitter.com/6weiAcgU75
— Ryan Timpe 🦖📊 (@ryantimpe) April 18, 2019

The #MuellerReport is a hefty 448 pages, and no one (not even #Barr), has the time to read that! To get a feel for the entire document, I scraped the #RedactedMuellerReport in #rstats this afternoon, and wrote up a brief text/sentiment analysis here: https://t.co/7j84H3oo0e pic.twitter.com/syCFuT1NnN
— Rich Pauloo (@RichPauloo) April 20, 2019

#MuellerReport a total of 10,779 unique words (excluding the, and, that) #rstats all words mentioned at least 200 times pic.twitter.com/0y1XRUJ7If
— christopherlortie (@cjlortie) April 19, 2019

speed reading the #muellerreport with the #rstats #tidytext package... (1/2) pic.twitter.com/kMnswuIsRf
— Christopher Yee (@Eeysirhc) April 18, 2019

Christopher Yee and Christopher Lortie both have a couple of other great tweets analyzing the Mueller Report, so I encourage you to check their feeds! Yee also has his code available on his blog.

Here’s an analysis by Chris Bail using his textnets package.

I just visualized the #MuellerReport (Ties between words means they co-occur often; colors of the words represent how they can be grouped into broader themes). What does it mean? I have no idea ;) Viz produced via my R package (textnets) #socsciresearch HT @kearneymw for data pic.twitter.com/Wq2PGXRXJD
— Chris Bail (@chris_bail) April 18, 2019

Pages with the most negative and positive sentiment pic.twitter.com/DfMgohNjhX
— Mike Kearney📊 (@kearneymw) April 18, 2019

Quick visualizations of #MuellerReport using KH Coder https://t.co/e0DfawkjOv (which utilizes #rstats inside) @JosephineLukito @StatStas pic.twitter.com/3bC47CmZk1
— KH Coder (@khcoder) April 21, 2019

Hot of the "I should be writing!" press: Correspondence analyses (CA) of the #MuellerReport in #rstats.

I performed CA on, and compared the component structures of two versions of the data: @dataandme (https://t.co/H9wcMz9hea) and @chris_bail (https://t.co/9wZndlrnPa) 1/3 pic.twitter.com/bLLAJioGOC
— Derek Beaton (@derek__beaton) April 22, 2019

So for anyone who’s thinking about doing data analysis on the data analysts who are studying the Mueller Report—I hope this data is useful! ;)

Yesterday, I was a footnote in history!

April 20, 2019 Josephine Lukito

Yesterday, I received exciting news! A piece that I had written with Chris Wells for Columbia Journalism Review was cited in the Mueller Report, which was released a day ago.

I was 3 hours into my 8-hour prelim exam when I received this notification from @nausjcaa that my CJR piece with @cfwells was referenced in the Mueller Report (p. 27). 😱 In the article, we discuss how and why Russian trolls were quoted as American citizens in news stories. 1/4 https://t.co/lYFi4raEwb
— Josephine Lukito (@JosephineLukito) April 18, 2019

The piece that we wrote for CJR focused on news organizations that embedded tweets by Internet Research Agency (IRA) handles into their news stories. We’ve increased the number of outlets analyzed since the CJR piece (it was about 40 when we started, but over 100 now), and our finding still holds: a majority of news organizations cited an IRA account in at least one story.

Contrary to popular opinion, these IRA accounts were not sharing “fake news” (as in: false information). Instead, IRA tweets were often quoted for their salient, often hyper-partisan opinions. For example, one tweet advocated for a Heterosexual Pride Day as a way of inciting LGBTQ activists. Another called refugees, “rapefugees”. These accounts would often portray themselves as American people (e.g., @JennAbrams portrayed herself as a “typical” American girl, as shown by research done by my colleague Yiping Xia), or as groups (like @ten_gop, an IRA account pretending to be Tennessee GOP members, and @blacktivist, an IRA account pretending to be BlackLivesMatter organizers).

This has important implications, and speaks to Muller’s earlier indictment of the IRA, which noted that Russia’s campaign goal was “spread[ing] distrust towards the candidates and the political system in general” (p. 6). Ironically, the discovery of the IRA campaign in the summer/fall of 2017 probably fed into this distrust (especially since news organization were as likely to be “duped” as American citizens).

The (underacted) part where we are referenced focuses on this specific issue—journalists embedded these tweets thinking they reflected the opinions of U.S. citizens. This is incredibly problematic, and something that both academics and journalists want to find solutions for. Following our publication in early of 2018, several news organizations reached out to us regarding the specific articles i which they had unintentionally quoted IRA tweets. The research team was particularly excited by these exchanges because it shows that journalists care, and want to avoid doing this in the future.

The Grammar of "A Thing": Using R to Study Digital Corpora

April 7, 2019 Josephine Lukito

One of the things I love most about my field (Communication) is its unique passion for building corpora. While there is an obvious value to studying a large, well-studied, pre-structured corpora like LOB or COCA (e.g., multiple scholars working on one dataset increases knowledge about that dataset, reproducibility, etc.), some research questions require more specialized text data.

This is often the situation that I find myself in. If I want to study a linguistic phenomenon in a specific register—like the use of “a thing” in English tweets—I usually have to build my own corpus. So how does one do that?

I’ll break my process down into four broad steps: (1) armchair linguistic-ing, (2) creating the corpus, (3) finding your linguistic phenomenon, (4) corpus analysis.

01. Armchair Linguistic-ing

I became primarily interested in this construction because of its frequency in language use. In spoken English, sentences like, “oh yeah, that’s a thing” are commonplace, even in formal-ish settings, like classrooms (I’m in a J-School, so it’s not unusual to hear someone say, “Yeah, AP style is a thing”).

I’ve always liked this construction, because “a” and “thing” are particularly vague English words. The determiner “a” (as in “I gave her a book”) is indefinite, meaning it refers to something non-specific (contrast this to “I gave her the book”). And “thing” is so broad, it could refer to any tangible, inanimate object. A watch, a book, a stroller, a ticket to Disney—all of these things are things. But when we put “a thing” together, it can suddenly take on a whole new meaning. When someone says, “AP style is a thing”, they mean “people know about AP style” or “AP style is popular”. In this context, the “a thing” is more than an indefinite determiner and a vague noun. Rather, it signifies some degree of importance.

But is this always the case? I wasn’t sure. So, I turned to my corpus building skills to find out.

I figured there could be four general places that “a thing” could be situated in. The first is the subject, like in the sentences below:

A thing needs to be done.
A thing just arrived.

The second possibility is in the object position, such as in the examples below:

I know a thing or two about school.
I made a thing.

It is also possible that “a thing” is used as a predicate noun/nominative. It is also a subject complement, because it completes a linking verb (in English, “to be”). This is the structure I was most interested in.

This is a thing.
That has been a thing for a long time.

And finally, we’ll look into object complements, such as in the examples below:

He considered the party a thing.
He cooked his friends a thing.

Now that I knew what I was looking for, it was time to build and parse my corpus.

02. Building Your Corpus

The first thing you’ll need to think about is where you want to get the data from. Do you want to look at journal articles? Fiction novels? Text messages between friends?

I settled on Twitter for a few reasons, the most important of which was “it is an informal register that is easy to get.” I figured the feature I was looking for would likely not be in a formal register, like news stories or presidential speeches. However, Twitter (and social media language as a whole) is simultaneously beautiful and frustrating in its kind-of-formal, kind-of-informal language norms (beautiful in that language evolves so quickly, frustrating in that there are way too many people who use prescriptivism to put down other people’s tweets).

If you are trying to access the Twitter RestAPI through R, I strongly advocate using rtweet, by Mike Kearney. It’s a really cool package, and a great way to build interesting Twitter corpora at your leisure.

library(rtweet)
#?search_tweets
rstats_tweets <- search_tweets(q = '"a thing"',
                               n = 1000000, 
                               retryonratelimit = TRUE) #max 18,000 every 15 minutes

head(rstats_tweets, n = 5) #looks at the top 2 tweets

This search yielded about 500,000 tweets (510,574, to be exact). To identify whether the bigram “a thing” would be used as a subject, object, predicate, or object complement, I would need to annotate this bad boy.

Right now, I’m using the R clearNLP package, with back ends to spaCy and CoreNLP. I tend to use the latter more (coreNLP) because I’ve gotten better results. But spaCy is much faster and has additional support. I strongly encourage it for those who are both R and Python-proficient (it can also support word vectors and has a great displaCy visualizer).

library(rJava)
library(tokenizers)
library(cleanNLP)

In order to use cleanNLP, you’ll need to interface with the back end (either coreNLP or spaCy).

#cnlp_init_tokenizers() #initializes tokenizer backend
cnlp_download_corenlp()
cnlp_init_corenlp("en", anno_level = 2)
# cnlp_init_spacy

Once you have done this, you are ready to parse your corpus! For the purposes of this exercise, I’m going to use some toy data (parsing the full corpus took about 2 days—I had about 20 million dependencies total).

Toy Data

If you notice, 8 of the 9 sentences are the ones in my previous examples.

toy_data <- data.frame(id = c("s1", "s2", "o1", "o2", "sp1", "sp2", "sp3", "dc1", "dc2"),
                       sentence = c("A thing needs to be done.", "A thing just arrived.", 
                                    "I know a thing or two about school.",
                                    "I made a thing.", 
                                    "This is a thing.", 
                                    "Is summer camp a thing?",
                                    "That has been a thing for so long.", 
                                    "He considered the party a thing.", 
                                    "He made his friends a thing."))

starttime <- Sys.time()
full_corpus_dep <- toy_data$sentence %>% as.character() %>%
  cnlp_annotate(as_strings = TRUE, doc_ids = toy_data$id) %>%
  cnlp_get_dependency(get_token = TRUE)
endtime <- Sys.time()

You want to make sure that you indicate the doc_ids of the data, as that is what you will use to re-align the dependency information to the original tweet or sentence.

Once you do this, you should get a data frame that looks something like this:

Let’s break what cnlp_get_dependency produces. Each row represents one dependency relationship. Each column represents some information about that dependency (e.g., what document or sentence the dependency is in, what words the dependency relationship is linking, etc.)

A brief interlude to help us understand dependency grammar… Dependency grammar interprets two words as having a dependency (relationship) between them. This differs from constituency grammar, which breaks down word relationship into phrases, not dependencies. An important skillset in this work is being able to read the results of one and interpret it as the other (e.g., see dependency relations and conceptualize them as phrases, or see phrases and construct the dependencies).

Because dependencies focus on relationship between two words, we can conceive of a dependency relationship as having a “word”, a “wordtarget”, and a “relation”. Consider the very simple example of “I run.” In this sentence, we have a subject and a verb. In dependency grammar, the verb is the “root” or the center of the sentence. Therefore, each of your sentences will usually have a root. Arrows lead out from the root to other words (these are the “word targets”). Thus, if “run” is the root verb, then the word target “I” is the subject to that verb.

Let’s now look at each column in more detail. The <id> is pretty obvious: it’s the document id, or <doc_ids>, you indicated previously. The <sid> is the sentence number. For most tweets and sentences, the <sid> number will be a 1. However, blog posts, news articles, products reviews, and other longer documents are all likely to have multiple sentences. The <tid> refers to the token number of the word. There is also a <tid_target>, which is the token number of the word target.

The six other columns are: <relation>, <relation_full>, <word>, <lemma>, <word_target>, and <lemma_target>. The <lemma> and <lemma_target> are the lemmatized forms of the word and word_target (for example, the words “thinking”, “thought”, and “thinks” can be represented by the lemma /think/. Using the lemmatized form meant I largely did not have to worry about tense issues.

The <relation>, <word>, and <word_target> are the meat of the dependency analysis. The first “dependency” of a sentence is usually the ROOT verb. Let’s return to our “I run.” Example below.

As you can see, the “run” verb is identified as the root. This is not really a dependency, but more an identification of what the root verb is (hence why there is no actual <word>, and why the <tid> is 0). The second row identifies a <nsubj> dependency “relation”, with the root verb “run” as the <word>, and the noun subject “I” as the <word_target>.

There are many (many) possible dependency relations. You can find a list of them here.

There is some older documentation that can also be potentially useful here (this version of the dependencies is no longer maintained).

Let’s now apply this knowledge to our toy data.

03. Finding your Linguistic Phenomenon

Recall that our goal is to identify whether the bigram “a thing” appears as a subject, object, predicate, or object complement.

Let’s do so by identifying all the dependencies for which “thing” is a <word> or <word_target> (the “a” in “a thing” will be identified as a determiner <word_target> to the “thing”).

thing_word <- subset(full_corpus_dep, word == "thing")
thing_target <- subset(full_corpus_dep, word_target == "thing")

Notice that the “a thing” dependency shows up in the <thing_word> subsetted data. But the more useful dataset for us is the <thing_target> data.

Notice that, if in the subject position, the “thing” <word_target> has a <nsubj> (noun subject) <relation> to a verb. In the object position, the “thing” <word_target> has a <dobj> (direct object) <relation>. In the predicate position, the “thing” <word_target> is the ROOT (if you check the <word> data, you will also note a <cop>, or copula, <relation> from the verb “to be” to the “thing” <word>). In the object complement position, the “thing” <word_target> has an <xcomp>, or an “open clausal complement” <relation>.

Below is an image of all the dependency relationships I was interested in, as related to the “a thing” bigram.

Side note: While the toy data plays nicely, real data isn’t always perfectly parsed. For example, I had about 2,000 tweets where a copula-predicate relationship was identified as a subject-verb(“to be”)-object relationship (these had a “nsubj” + “det” + “dobj” relationship, but the root lemma was “be”—this meant they were initially coded as “objects” but, upon further examination, I subset them to the predicate list).

04. Corpus Analysis

Now that we know what the relationships are, we can re-aggregate to the tweet level. My corpus had a few instances (<10) where “a thing” was used twice. In all these instances, however, the “a thing” bigrams were in the same position.

subject <- subset(thing_target, relation == "nsubj", select=id) %>% mutate(subject = 1)
object <- subset(thing_target, relation == "dobj", select=id) %>% mutate(object = 1)
predicate <- subset(thing_target, relation == "root", select=id) %>% mutate(predicate = 1)
complement <- subset(thing_target, relation == "xcomp", select=id)  %>% mutate(complement = 1)

toy_data2 <- merge(toy_data, subject, by = "id", all.x = T) %>% 
  merge(object, by = "id", all.x = T) %>%
  merge(predicate, by = "id", all.x = T) %>%
  merge(complement, by = "id", all.x = T)
toy_data2[is.na(toy_data2)] <- 0

Let us now turn to the results of the full data.

Results

As we can see, the bigram “a thing” is most likely to appear in the object position (“I made a thing”) or predicate position (“This is a thing“).Rarely is “a thing” used in the subject position (e.g., “Love the phrase ‘a meteoric rise’, a thing a meteor has never done”). Fewer than 250 tweets had “a thing” in the object complement position.

“A thing” in the Predicate position

As I expected, tweets that used “a thing” in the predicate noun position discussed a subject as popular, socially important, or at least well-known. These tweets usually followed a similar structure: the word “thing” is the root. The word_target “a” is a determiner to “thing”, and the lemma “to be” (representing “is”, “are”, “was”, and “were”) is a copula to “thing”. Finally, the “nsubj” relationship would link the noun word_target to the “thing” word (this is why we need the thing_word subsetted data).

So what are nouns are described as “a thing”?

The figure above shows that, when "a thing" is a predicate, it often link to demonstrative determiners (this, that) or the pronoun "it". We also see some more specific nouns, such as ‘church”, “abortion”, and “harassment”.

Many of these tweets were exclamations (e.g., “I didn’t know this was a thing!” or “OMG this is a thing?!” or “Had no idea this was a thing!!”). Some were questions, about whether something was “still a thing”: “I grew up being told about thick and thin. Is that still a thing[?]”

“Church” appeared often in tweets like, “Is church still a thing?” and “How is church and religion still a thing?” In at least one instance, a tweet was incorrectly parsed: “separation of church and state is a thing, you know that right little @mike_pense?” (the parser coded this as ([NP] separation ([PP]of ([N] church)) ([CONJ] and) ([N] state), rather than treating “church and state” as a conjunction within the preposition phrase).

Many people also described abortions as a thing (or not a thing). A few tweets noted “Abortions are a thing” to note the frequency with which they occur. One tweet said “Late term abortions should not be a thing”, focusing on a specific type of abortions. Another said, “Post-term abortion is not a thing”, referring specifically to President Trump’s coinage.

“A thing” in the object position

Let us now turn to the use of “a thing” in the object position. In these cases, “a thing” is still relatively vague. If you “make a thing”, you’re not necessarily saying the thing you made is popular or well-known—you may simply be happy that you did it (e.g., “I did a thing!”).We can explore this structure more by looking at the verbs associated with the “a thing” bigram in the object position. The dependency relationship would be a dobj from the word_target “a thing” to a verb.

The figure below displays the verbs that appeared at least 5000+ tweets, for which “a thing” was a direct object.

Keep in mind that these verbs are lemmatized (therefore the lemma “know” represents “know” and “knew”). By far, the most common verbs used were “to do” and “to make” (as in “I did a thing” and “I made a thing”). This is followed by “to have”, “to know”, “to miss”, and “to learn”.

For example, one tweet said “I made a thing for Pokemon fans and Kingdom Hearts fans [Image]”. Note that the author provided additional info (a prepositional phrase and an image) to describe the “thing”. Another user tweeted, “It looks bad but i did a thing [image].” Many of these tweets expressed some pride over doing, making, creating, having, buying, or owning “a thing”. This was almost always accompanied by pictures of what the “thing” was.

Some tweets also referenced love, as in “love don’t mean a thing” or “love don’t cost a thing”. About 6000 tweets used the verb “change”, and were often about commenting on other people’s worth (e.g., “don’t change a thing!”)

Overall, I really enjoyed doing this analysis. It’s been more difficult to do this analysis with my Ph.D work picking up, but I’m glad I can still find the time every now and again.

Attending the R Forwards Women's Package Workshop (Hosted by R-Ladies Chicago)

February 24, 2019 Josephine Lukito

This weekend, I had the pleasure of attending an R Forwards Women's Package Workshop. It was hosted and run by members of R-Ladies Chicago: Angela Li and Stephanie Kirmer.

Though I have attended and run many one or two hour workshops, this was my first long-day, single-topic workshop (9:30 to 4:00 pm) and I thoroughly enjoyed the experience! It’s definitely a format that would be useful to teach more complex topics, like software development and package building. There was a lot of info-in-brain-cramming, but I also felt like I learned a ton in a very short time span.

Attendees of the 2019 R Forward Package Workshop, taught by Angela and Stephanie of R Ladies Chicago!

The session was broken down into a couple broad topics: package development, git+r, unit testing, documentation, and package sharing (e.g., licences, indicating dependencies, CRAN). This made the material useful for both specific-use packages (e.g., building a data wrangling package for my specific research group) and for more public-facing packages (e.g., a package that one would want to upload to CRAN).

Top 5 Things I Learned:

The usethis package is so convenient and important for package development. For example, the function
```
usethis::create_package("~/Desktop/mypackage")
```
will create the skeleton of the package files for you, including folders for R-code, the “man” folder (“man” stands for manual), and description/namespace files. This makes package building so much earlier! You can learn more about it in Wickham’s R package book.
Using the “::” operator allows you to see the exported variables or functions in a package namespace. But if you really want to see under the hood “:::” allows you to see everything (there’s some more about it on StackOverflow), including the functions that are not publicly exported.
Semantic Versioning - How have I only learned about this now, despite attending several data carpentry workshops and classes?! I am such a stickler for version recording, even in my non-computational work (I have been subconsciously semantic versioning my human content analysis codebooks), and it’s so nice to finally have a specific phrase associated with this process.
A quote from my favorite slide of the day: “If the first line of your #rstats script is
```
setwd("C:\Users\jenny\path\that\only\I\have")
```
I will come into your lab and SET YOUR COMPUTER ON FIRE 🔥.”
Confession time: I do this a lot! 🙈 ::embarrassed:: In terms of workflow, I am generally quite sloppy about separating projects and keeping relative paths. I first read about projects in R for Data Science, but I never took the lesson to heart until this weekend. I know… it’s bad given how much I code. So I guess I’ll be “konmari-ing” my R code this semester(i.e., create an Rproject space for each of my projects)!
A great tip from Stephanie’s top tweet (about Git): “when you screw up a git merge, you can use git reset --hard master@{"300 minutes ago"} with any time quantity you want in there to get back to where things were a period of time ago.“

.@data_stephanie shares her most popular tweet of all time and mind = blown 🤯 #RLadies #RForwardsChi pic.twitter.com/nCh97AE28j
— R-Ladies Chicago (@RLadiesChicago) February 23, 2019

Extra Bonus: Differencing vs. Logging Time Series Data

I happened to also sit next to economist Dweepobotee Brahma, which was a great coincidence since I’ve been binging time series models and papers for the past year. I happened to randomly gripe about how economic data is often processed (i.e., logged). She was kind enough to explain to me why economists did this, and why growth rates are so important to the research in economics and econometrics. Having been taught by a political scientist, whose questions are not as focused on exponential relationships, I didn’t know much about this alternative treatment/perspective on time series data, and it was really interesting!

I’m starting to wonder if this is especially important to modeling follower growth. Nearly all follower count time series I’ve analyzed have been fully integrated I(1), if not to I(2). Often I will first- or second- difference this as a way to make the series stationary. However, I’m realizing a logarithmic transformation is probably more appropriate for what we try to measure (implicitly, it’s a growth rate question).

Conclusion

Overall, I’m really, really glad that I was able to attend this workshop. It’s definitely up there on my “favorite R workshops ever” list. Organizationally, there were a lot of little things I wanted to take back to my workshop strategy (for example, this was the first workshop I attended where we used post it’s to indicate whether we needed help with specific tasks) and obviously it was great for advancing my R skills.

One of the most important “big picture” lessons I learned was that if I want to actually do software development in R (or any programming language), I have to be more organized about my code. I am organized when it comes to data management, but am definitely less-so with my scripts and functions. Workflow wise, I want to get on top of this by the end of the semester.

I’m also one step closer to completing a major R-new-year’s-resolution: Build an R package! I have a couple of functions that I rely on for data wrangling operationalized text data to time series data, so I’m eager to wrap them all up in a neat little package for future use.

And finally, attending the workshop was a great reminder about how amazing the R community is, both offline and online. That’s one of my favorite parts about being an R programmer—the community makes it easy to be excited about learning R.

I am so grateful to Angela and Stephanie for hosting this amazing workshop, and to R Forwards for sponsoring my attendance. If you are interested in checking out the materials from this weekend, they have made the workshop material available here.

Marrying "to get" and "to have"

January 10, 2019 Josephine Lukito

Typically, “have got” is considered the informal structure of “got” (see here for an example). This presumes that the meaning is the same, but the construction has a formal/informal tone (in other words, the deep structure of the sentence is sustained).

This certainly works when the object is a noun. Let’s look at four sentences where this works:

1A           “I’ve got a present”
1B           “I have a present”
2A           “I’ve got a boyfriend”
2B           “I have a boyfriend”

But this dynamic change when the object of the sentence is a pronoun. In fact, the two (“have got” and “have”) are no longer semantically similar when we apply it to pronouns. Consider the following three sentences:

3A           “I have you.”
3B           “I got you.”
3C           “I’ve got you.”

In the first of these three sentences, I have you is an indication of possession, and can be somewhat creepy in the wrong context (after all, will you have me for lunch?). However, it can also be used in a relational context. For example, see the sentences below, from COCA:

4              “You have me at a disadvantage” (Fiction)
5              “Will you have me back on the show and apologize in person?” (TV News)
6              “Once you have him chitchatting, he might inadvertently let something slip.” (Magazine)
7              “I can do anything if I have you with me.” (News)

Here, there is still the essence of possession, as typically the subject is in a position of authority. However, in the last example, [7], we can see the use of “have” in the relational context, which is where [3C] comes in.

The sentence “I got you” shows the complexity of the verb ‘to get’, since it holds multiple meanings. With nouns, often the “to get” is also an indication of possession (“I got coffee [for you]”). However, with pronouns, the verb “to get” is a signal of understanding (“she gets me” or “he gets her”).

The syntax of the last sentence, “I’ve got you” is a mix of both—possession and understanding. This combination is so much more meaningful than the individual terms, making the syntactic construction “have got” more complex than the simple “formal/informal” narrative we traditionally learn.

3C “I’ve got you.”
9 “You’ve got me.”

In the four “words” above (technically three words, one of which is a conjunction), we are expressing both possession and understanding—in other words, trust. When I say “he’s got her”, I am indicating that the object (“she”) can trust the subject (“he”) because the subject is capable of taking on (possessing) some burden, and because the subject understands the weight of that burden.

The expression of trust and wanting to be trusted is especially clear in first- and second-person pronouns, such as in [3c] and [9]. It’s the kind of language—perhaps my literal love language—that exists in relationships, when we rely on one another immensely as day-to-day safety nets. But it’s also something I find myself saying to friends and close compadres. When I use [3C], I am expressing appreciation for you by indicating how much I trust you. When I use [9], I am expressing a desire to be trusted.

So much can be said in so few words.