Let’s find that out using statistics

Image by _Alicja_ at Pixabay

I have read a few stories on Medium about writing advice, and there were some of them which, along with other tips, suggested that putting numbers in your story’s title will increase the number of views, as people tend to be more attracted by such headlines, and therefore, more people will click on your story.

It seems interesting that people are attracted by such headlines. But I don’t like to take things for granted. I want to convince myself that this fact is actually true.

So, what I have been thinking? Let’s use Statistics to check if this thing is actually true. But Statistics is useless without data. I first need to obtain some data about Medium articles and use that to do hypothesis testing. Therefore, I used Python and Beautiful Soup to scrape data about a random set of 6K+ Medium articles from 7 different publications. This dataset can be found on Kaggle. If you want to see how I scraped this data, I have an article about that here:

What we are going to do now is to split this dataset into 2 groups (or samples): one that has numbers in headlines and one without numbers. Then, we will do a hypothesis test on the expected value for the number of claps in these 2 groups. We use the number of claps as a measure of “how successful” a story is, although a more logical variable for our scenario would be the number of views as it is the one that is more directly affected by our choice for the title. People typically click on a story because of the preview that they see (including headline and image), and then after they read the story, they decide whether to clap or not. But, because the number of views is not publicly shown on Medium, we use the number of claps as it should be highly correlated with views (the more the views, the more likely is that someone would clap).

If you are not familiar with hypothesis testing, here is an article you can read:

That being said, we will consider the following model:

Sample 1: Articles with numbers in headlines

We will model the number of claps inside this group as n i.i.d. (independent and identically distributed) random variables: X₁, X₂, …, Xₙ with expected value µ₁ and variance σ₁², both of which are finite.

Sample 2: Articles without numbers in headlines

We will model the number of claps inside this group as m i.i.d. random variables: Y₁, Y₂, …, Yₘ with expected value µ₂ and variance σ₂², both of which are finite.

We formulate the null hypothesis as “articles with numbers in headlines bring no improvement over articles that have no numbers in headlines”, and the alternative hypothesis as “articles with numbers in headlines are more successful compared to articles without numbers in headlines”.

Mathematically this means:

We will consider the following test statistic:

Where Xn bar and Ym bar are the averages of sample 1, respectively sample 2.

Because the sample sizes are pretty large and due to the Central Limit Theorem, the probability distribution of our test statistic Z can be approximated very well by a standard normal distribution, and the true variances σ₁², σ₂² should be very close to the estimated variances from our data. So, when we compute the test statistic, we can just substitute the estimated variances for σ₁², σ₂².

But, what about µ₁ – µ₂? By assuming H₀ to be true, it follows that µ₁ – µ₂ ≤ 0. And we choose µ₁ – µ₂ = 0 as this value is the worst-case scenario for the probability of type I error (we don’t want to underestimate the error).

Now, let’s run some Python code. We start by importing the required packages and defining a utility function: like(x, pattern). This function is used to match regular expressions in pandas data frames; x is the column, and pattern is a regular expression. I named this function after SQL’s LIKE operator as it is meant to do something similar, but for pandas data frames.

After that, we read the CSV file into a pandas data frame:

We make sure we don’t have missing values in the “title” or “claps” columns:

Then, we create 2 new data frames (numbers/no-numbers) using the like() function defined earlier:

These 2 new data frames are shown below:

After that, we compute the quantities that we need for the test statistic:

Now, we compute the test statistic and the p-value. In our case, because we’re doing a one-sided test, the p-value is the area to the right of our test statistic under a standard gaussian:

And we got a p-value much smaller than the usual threshold of 0.05. That’s good news, we can reject the null hypothesis very confidently.

For a significance level of α = 0.001, it follows that p ≈ 0.0009 < α, and therefore we reject the null hypothesis and accept the alternative. In plain English, this means: “We are 99.9% confident that stories with numbers in their headlines are expected to have more claps than stories without numbers in headlines”.

You can find the Jupyter notebook on Kaggle.

I hope you found this information interesting and thanks for reading!

This article is also posted on Medium here. Feel free to have a look!


Passionate about Data Science, AI, Programming & Math

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x