The devil in the data: are women as aggressive online as men?

So, sometimes things grind along veeeeery slowly in the background, and eventually, finally, at long last, I get to them. This has been one of those slow-burn projects. To understand what’s going on, it’s worth briefly casting our eyes back to this post. Alternatively, if you want a very short overview, then here it is…

On the 26th of May, 2016, Demos thinktank’s Centre for the Analysis of Social Media (CASM) presented the results of an investigation into the use of misogynistic terms on Twitter to the House of Commons. This work was a collaboration between the thinktank Demos and the TAG laboratory at the University of Sussex, and the release of the report was timed to coincide with a political, cross-party campaign, Recl@im the Internet. Launched by Labour MP Yvette Cooper, the aim of Recl@im the Internet is described thus:

Our goal is to prevent inadequate communication on the Internet as well as educate the public how to handle cases of inappropriate communication. Our goal is to point out the problems that all the populations face on the Internet and to show by the example the ways of better practice. (reclaimtheinternet.com/about)

In support of this campaign, the report included the results of,

a small scale study examining the use of two popularly used misogynistic terms (‘slut’ and ‘whore’) on the social media platform Twitter. (…) The objective was to provide a general overview of the volume and nature of how these two terms are being used. (Demos 2016: 1)

The report tackled a range of questions, such as how many times per day these terms are used on Twitter, how many of the uses could be considered aggressive, other kinds of use, the number of users this represented, the subset of those in the UK, whether those tweets mentioned others, and the gender of both the senders and recipients.

Anyway, what bothered me, amongst many things (see that blog post if you want to know the rest) was that the report invariably triggered headlines such as the following:

Twitter abuse – ‘50% of misogynistic tweets from women’ – BBC News

Half of misogynistic tweets sent by women, study finds – The Guardian

Women are responsible for half of online abuse, study finds – The Telegraph

Half of the sexist harassment women face on Twitter comes from other women – Business Insider

And so on. Note this exclusive focus on women abusing women? We could spend months wrangling over the subtext, whether this is that women should behave better, or that women should somehow be impervious to misogyny and therefore not reproduce it, or that if women are doing half of it then it’s okay/not that bad/whatever if men are also doing it, but overall, this entire enterprise annoyed me because it felt like cheap, rock-‘em-shock-em headlines from a man-bites-dog – or woman-trolls-woman – story.

The main point of this blog post is to put those headlines, and the claims that they are based on, to the test. In a small, manually coded dataset, do we find similar results? In short: do women use misogynistic attacks as often as men?

At this point I should stress that the report is clear that it is limited in scope. It notes that, “it does not claim to be a comprehensive analysis of all misogynistic words being used on Twitter” (Demos 2016: 1) and in an article that I am slowly writing up, I’ll spend endless thousands of words crossing all the Ts and dotting all the lowercase Js on a bunch of other issues besides. For a blog post, however, I thought it would be nice to share some early results on the intersection between supposed gender and supposed behaviour.

Data

Back in 2016, then, I used DataSift to collect a corpus of tweets containing either slut or whore. This was named the Mega Slut/Whore Analysis of Twitter (SWAT) corpus. The corpus collection ran from 00:00, 01 March 2016 to 00:00, 23 March 2016, just over three weeks, and produced a 2.72gb dataset of 832,184 tweets. Such a dataset is stupidly large for manual analysis, but then that was never its point. MEGASWAT will have its uses in days and weeks and, er, years to come, worry not. What I then needed was a MINISWAT: something much smaller that I had a fighting chance of manually coding within a sensible timeframe. Thus, I switched on the mighty little Ant of Fire (FireAnt) and on 12:26, Tuesday 31 May 2016, MINISWAT_1 came into being. Two hours and fifty minutes later, I stopped the MINISWAT_1 collection at a magnificent 8,007 tweets (~1% of the size of MEGASWAT).

Coding

Over the 2018 summer, I finally had the spare time and research funds to have an RA sit and laboriously code the data for me, and after a solid two weeks of work, just over quarter of it was done. Our grand total of tweets, for this toy analysis and a pre-match warm-up before the article, is a corpus of 2,152 manually tagged tweets. Every tweet was tagged for four major categories, and I discuss each in turn below, along with some preliminary results, and much more importantly, all the issues that went along with them.

Art thou pr0n, bot, or human?

Literally the least amazing thing about the internet is how awash it is with porn. Twitter is no exception. We therefore coded tweets as either porn, bot, or human. Identifying porn is simple, and a lot of porn accounts identify themselves as containing sensitive material, so it was easy to mark up the majority of these. Bots are usually also self-identified as such and even where they aren’t, they tend to spew linguistic garbage, so they’re another easy spot. And finally we had some tweets that crept into the collection because one of the search terms occurred elsewhere in the Twitter data, so these we deemed non-applicable. How did our data look?

Account type Raw frequency Percentage Demos
Porn 1,674 77.8% 56%
Bot 18 0.8% ?
NA 8 0.4% ?
Human 452 21% 44%?
Total 2,152 100% 100%

So our dataset is nearly 80% garbage, which is frustrating, and yet, inevitable. That leaves us with a tiny dataset of 452 manually tagged, useful tweets, so everything that’s claimed from here on out should be taken in light of the smallness of this dataset.

Author gender

The next issue was determining the gender of the author of each tweet. Demos used their algorithm, Method52, to do this. Since this software is proprietary, we’re never likely to get a clear idea of how it came to its conclusions, but a typical method is to use a dictionary of names that have been precoded for gender, and then apply this against the name on the account. Whether this is applied to name or username…? ¯\_(ツ)_/¯ (Again, you can read more about that whole discussion and some of the issues I had with it here.)

Anyway, manually coding for gender is no mean feat, and everything I say here goes for the Target gender section too. Of course, like software, we can look at usernames, but they are not always clear. Often the profile must also be checked for further informative clues and hints, but then the picture might be of Maru, or Harry Styles, or Rainbow Dash. The bio might contain useful clues (“mother of two”), or contradictory information (“mother of two, father of all”), or nothing. The tweets themselves may hint one way or another based on interests and how others address that user, but ultimately, determining gender was both laborious and full of uncertainty. As a result, author and target gender were also coded for low, medium, or high levels of confidence. New gender values were created every time a new denomination occurred, but within the tweet author categories, the only three that arose were female, male, and unknown. This last value, as its name suggests, was for those cases where the gender simply could not be divined. The breakdown looks as follows:

Author gender Raw frequency Percentage Demos*
Female 177 39.2% 53%
Male 156 34.5% 47%
Unknown 119 26.3% 0%
Total 452 100% 100%

*Note that Demos only categorised the gender of their aggressive tweets.

Anyway, in our data, then, women are sending slightly more tweets than men. The difference isn’t much but for later analysis, acknowledging this difference is crucial, since if left unaccounted for, it will skew the results. Annoyingly, a quarter of our dataset is gender unknown but this is not entirely because the account provided no clue. Unfortunately, this two-year-old dataset is rotting. Plenty of the accounts have been suspended, so that there was nothing to go back to for further information. And this is precisely why this has become a toy dataset for a blog post, rather than a serious one for a journal article. (The new dataset, MINISWAT_2, is being coded as we speak.) This said, remember that an automated gender-detecting algorithm would almost certainly not be so comprehensive as to concern itself with profile pictures or bios or other people’s terms of endearment, and so forth, so whilst I consider this missing information a loss, alternative approaches would not have bothered with its presence or absence at all. And this is also why we have such a tiny dataset. Checking the gender of every author and target properly is extremely time consuming and laborious.

What of our confidence levels though?

Author gender Low Medium High Total
Female 5 20 152 177
Male 1 22 133 156
Total 6 (1.33%) 42 (9.29%) 404 (89.38%) 452 (100%)

For ~90% of the accounts that could be coded then, the confidence was high, and for ~99% of them it was medium or high. It’s worth noting, though, that humans can be prone to over-confidence in their assessments, especially retrospectively, so this is indicative only, but it’s useful for showing that one in every hundred or so tweets gave pause for thought when it came to judging this characteristic.

Target gender

Our third category was the gender of the target, and here things got super messy. For a start, people tweet about people without mentioning them, they tweet about people whilst mentioning but not using their Twitter handle, they reply to many users at once, they respond to organisations and people and bots all at once, and some even tweet themselves. As a result there were many more values in this category, which are easiest digested simply by looking at a table:

Target gender Raw frequency Percentage
Female(s) 91 20.5%
Female(s) & unclear other(s) 2 0.4%
Male(s) 48 10.6%
Male(s) & female(s) 3 0.7%
None 157 34.7%
Organisation(s) & unknown(s) 1 0.2%
Organisation(s) & male(s) 1 0.2%
Unknown 149 33.0%
Total 452 100%

Note that there doesn’t seem to be a clear discussion in the Demos paper about how they categorised target gender. I assume it was done in the same way as author gender but again, either I’m not seeing it in my eternally sleep deprived state or it wasn’t included.

Again, what of our confidence?

Target gender Low Medium High Total
Female(s) 3 2 86 91
Female(s) and unclear other(s) 0 0 2 2
Male(s) 1 7 40 48
Male(s) and female(s) 0 0 3 3
None 0 0 157 157
Organisation(s) and unknown(s) 0 0 1 1
Organisation(s) and male(s) 0 0 1 1
Unknown 0 0 149 149
Total 4 9 439 452

O thou tweet of nature

And now we come to the category that caused enough headaches to keep a pharmaceutical company in fine Christmas bonuses for a long time. The “nature” category essentially tried to capture what the tweet was doing. Was it an attack? A joke? A quote? A discussion? The taxonomy was effectively an iterative process – every time a new species was found in the wild, if it was sufficiently different to anything that had gone before, it got its own label. If multiple things occurred that were clearly interrelated (e.g slut for and whore for) they were merged. This meant exhaustively returning to the start to update the codes already applied up to that point, so that whilst only 452 tweets were coded in the end, some of them have been coded as many as twenty times. Such is the nature of elaborating a meaningful taxonomy. Anyway, the table below shows and describes the categories with an example for each and the results besides:

Value Description Example Raw frequency Percentage Demos
aggression direct An attack with a clear and direct target. May be supported by second person pronoun use. you dress like a slut 66 14.60% 33% (joint with aggression indirect)
aggression indirect An attack with a clear but indirect target. May be supported by third person pronoun use. SHES A FUCKING SLUT 97 21.46% 33% (joint with aggression direct)
discussion Discussion about the term or concept slut or whore. i bet it’s easier to be a slut if you’re in america 109 24.11% 58% (joint with quote)
in-joke Personal, friendly, sociable use – a joking pseudo-insult. Happy birthday, you fucking slut. 55 12.17% 9% (joint with slut for)
literal Use of whore in its literal sense to mean prostitute. Tyrion didn’t fuck a whore in Winterfell 5 1.11%
quote Quotes from news articles, lyrics, fiction, etc. what a shame what a shame the poor grooms bride is a whore 8 1.77% 58% (joint with discussion)
not a slut/whore Denial/negation of insult. I’m not a whore I just think like one 11 2.43%
slut drop Use of the phrase “slut drop”. all the gyals are dying from slut dropping too much 2 0.44%
slut walk Use of the phrase “slut walk”. and you think slut walk is a productive name? 14 3.10%
slut/whore for; x-slut/whore Use of the phrase “slut/whore for” or “[entity] slut/whore”. Spotify whore 45 9.96% 9% (joint with in-joke)
unclear Unable to determine a clear function or purpose. can find nthg to whore at a. 40 8.85%

By contrast to this taxonomy, Demos had three categories:

  • aggressive – based on their examples, this would capture aggression direct
  • self-identity – based on their examples, this would capture slut for and in-joke, but beyond that…?
  • other – based on their examples, this would capture discussion and quote, but again, beyond that…?

There are plenty of issues with this Demos’ taxonomy, not the least of which is that it’s painfully simplistic. Let’s make no bones about it though – there are plenty of issues with the taxonomy I came up with too, the most obvious being that my taxonomy isn’t simple enough. And there are other issues besides, some of which can also be said about Demos’.

Firstly, several examples could be multicoded. For instance, consider the tweet given for slut walk: “and you think slut walk is a productive name?” and the one given for not a slut/whore. Are these better coded as discussion? As a category, discussion is effectively too big, but it’s also useful for capturing plenty of instances that would otherwise be left uncoded or would all have to be so finely graded apart that the number of categories would proliferate to the point of absurdity. Equally, the slut drop/slut walk phrases themselves have a great deal going on, but those issues are too complex to be unpicked in (what is meant to be) a tiny blog post.

Secondly, the literal category is contentious, since it only accounts for whore, and determines that if this is being used in its (unpleasant) sense to indicate a prostitute, then it qualifies. A “literal” definition of a slut is extremely problematic since by its nature it is an insult and form of attack. Again, though, this is a messy issue that’s too complex for this post.

Anyway, my point here is about aggressive and insulting uses, and when direct and indirect aggression are combined, aggression overall accounts for over a third (36%) of the dataset – a surprisingly similar number to the one Demos arrived at. Discussion, that big baggy category that I’m not especially thrilled with, accounts for the next biggest proportion, at 24%, and crucially, in third place, we have in-joke usages.

What the above doesn’t tell us, though, is who is using these kinds of tweets, and to whom they are directed, and that’s why we’re here, right? So let’s finish this thing off and get to the big reveal.

Who is sending what to whom?

To simplify things, I have only considered tweets where the author’s gender is supposedly known, and the target’s gender is also supposedly known (whether this is indicated via the account, their name, pronoun use, and so forth), and where there is a simple male-to-male, male-to-female, female-to-male, or female-to-female dimension. This cut out all the unknowns, organisations, and mixed audiences. So, let’s get to the results:

Gender: author>target

Nature of tweet

F>M F>F M>M M>F TOTAL
aggression direct 0 11 2 12 25
aggression indirect 4 8 3 15 30
discussion 3 3 1 4 11
in-joke 3 13 9 0 25
quote 3 0 0 0 3
literal 0 0 0 0 0
not slut/whore 0 1 0 0 1
slut drop 0 0 0 0 0
slut walk 4 0 0 0 4
slut/whore for; x-slut/whore 0 1 0 0 1
unclear 1 0 3 1 5
Total 18 37 18 32 105

At this point, we’re right down to a mere 105 tweets, but what do we see from this tiny, artisanal, carefully winnowed dataset? Let’s start with same-tweeting-same.

When females tweet females, the two dominant categories are aggression direct (e.g. “Badge bunny,  home wrecking want-to-be slut !!!! Hope u learned a lesson little girl!”), and in-joke (e.g. “love ya u slut X”). When males tweet males, the dominant category is in-joke (e.g. “thought you were off last week you slut?”).

But what about cross-gender communication? Well, when females tweet males, there isn’t any direct aggression, and the tweets are fairly well distributed between the various categories. And now we come to the crunch: what happens when males tweet females? The two dominant categories are aggression direct (e.g. “yeah ill be there to put fucking kidney stones bitch u asshole u satanic whore”) and aggression indirect (e.g. “Katy Perry is a third rate talent and a liberal slut”).

From here, if we focus purely on aggression, it’s easiest to address this data as a series of questions:

  1. Of all the aggressive tweets in this dataset (55 altogether), do males or females SEND more?
    58.2% are sent by males (32 altogether)
    41.8% are sent by females (23 altogether)
  1. Of all the aggressive tweets in this dataset (55 altogether), do males or females RECEIVE more?
    83.6% are targeted at females (46 altogether)
    16.3% are targeted at males (9 altogether)
  1. When males send aggressive tweets (32 altogether), who are they targeting?
    84.4% are targeting females (27 altogether)
    15.6% are targeting males (5 altogether)
  1. When females send aggressive tweets (23 altogether), who are they targeting?
    82.6% are targeting females (19 altogether)
    17.4% are targeting males (4 altogether)

I think that pretty much speaks for itself, but for the sake of joining the dots: in this dataset, males send (slightly) more tweets that are aggressive than females, females are targeted with substantially more tweets that are aggressive than males, and both males and females mostly target females with aggressive tweets.

So are women as aggressively misogynistic online as men? Well, in its extremely tentative and tiny way, this dataset suggests that the picture is WAY more complicated than that. Females are aggressive online, yes, and they target women, yes, but less often than men do, and they are overwhelmingly the targets of aggression. I’m not sure how surprising this result is to anyone, but the dataset is so tiny that it could easily be a fluke. Therefore, we need to see how these results stand up to replication in MINISWAT_2, and any future datasets.

I got 99 problems but a black box method ain’t one

As we know, I took the Demos work apart and inspected all the innards I could get my hands on and then wrote it all up, and I have no intention of holding my own work to a lower standard, so what are some of the issues here? Well, firstly, the taxonomy for the nature of the tweets is an inherently subjective categorisation system, and if given the same example to code, ten different coders could classify it in eleven different categories. Remember, though, that Demos’ own categories suffer from the same problem. Just because a computer coded the data, this does not make the coding categories themselves objective. Rigorous and methodical and ruthlessly logically applied? Yes. Objective? No. It’s essentially a permanent and inalienable problem to almost every kind of data analysis – we have to formulate some form of theoretical framework or interpretation and then apply it. The crucial thing is that in the course of manual annotation, we can recognise that new themes are emerging or finer nuances are required and update our framework accordingly, whereas the creation of a taxonomy that is then rigidly applied by software and only lightly checked afterwards can lead to the forcing of apples and oranges into boxes better suited for blueberries, pears, cherries, quince, and so on.

I’ve also said more than enough about the mess that is gender classification, and how gender isn’t binary, and how even if we “know” gender from looking at the account, people lie. If you were to generate a clone army of Twitter accounts that were to rove around hurling misogynistic abuse at females, there is a perceived tactical advantage to making those accounts look like they’re run by women, right? So there’s a possibility that the results are skewed by this issue. Whatever the case, algorithms make a hash of gender and even if they didn’t, they still wouldn’t know if people were telling the truth anyway.

The finally final point of this problems section that should have become painfully obvious by now is that whatever the dataset we start off with at the beginning, it gradually dwindles down to tiny numbers by the end, and to obtain any seriously reliable sort of dataset from which we could reasonably extrapolate, we’d need more like 10,000 manually coded tweets, each rated by 10,000 coders. More still would be more… better…er. But that’s never going to happen, so the next option is small-scale replication, and that’s what we’re aiming for with the MINISWAT_2 dataset in the coming weeks. Or months.

Conclusions

I started out by noting that the Demos research was released in conjunction with a campaign to Recl@im the Internet and that the results were presented to politicians and policymakers at the House of Commons. Yet, I continue to feel that the method that was adopted – or perhaps more accurately, the execution of it – did not hit anything like the standard it should have. As Demos themselves wrote:

It ought to be noted that even the most apparently aggressive tweet may not actually be aggressive when taken out of context (and vice versa). Judging context for each tweet is not possible when dealing with data of this magnitude. (Demos 2016: 6)

In my view, if you’re aiming to inform politicians and shape policy, and if you’re very openly publishing work to be picked up and reproduced by the media which will then inform the public’s knowledge of a matter, then the methods need to be transparent (not black box), and as rigorous as reasonably possible, even if that costs a lot of time and money. The problem is that the sort of coding I’ve had done above is expensive, slow, and cognitively demanding. I am not kidding. Try manually coding a hundred tweets for just four categories and see how soon you’re losing the plot. This invariably creates a pressure to produce faster, cheaper, easier answers and in turn, to not pry too closely into the details.

If the purpose of the analysis is to inform where to focus advertising for a new range of cheap earrings, by all means, take a simplistic black box approach with a few categories that might not be ideal and don’t worry if the accuracy of the precision or recall isn’t all that high. But when your intention is to try to keep people safe online, to focus political attention, to target limited legislative and policing resources, and to add to the public knowledge about online abuse by having such work published widely in the media, then this demands that the research is handled very carefully, including thoroughly picking apart the data to find exactly where the devil is lurking in all those details.