Let’s find that out using Statistics

Image by Arek Socha at Pixabay

I have read a few stories on Medium about writing advice, and there were some of them which, along with other tips, suggested that stating your story’s title as a question will increase the number of views, as people tend to be more attracted by such headlines, and therefore, more people will click on your story.

But as I come from a mathematical background, I don’t like to take things for granted. I need to see proofs; I need to convince myself with some numbers if a given fact is actually true.

So, what I have been thinking? Let’s use Statistics to check if this thing is actually true. But Statistics is useless without data. I first need to obtain some data about Medium articles and use that to do hypothesis testing. Therefore, I used Python and Beautiful Soup to scrape data about a random set of 6K+ Medium articles from 7 different publications. This dataset can be found on Kaggle. If you want to see how I scraped this data, I have an article about that here:

What we are going to do now is to split this dataset into 2 groups (or samples): one that has questions as headlines and one with non-questions headlines. Then, we will do a hypothesis test on the expected value for the number of claps in these 2 groups. We use the number of claps as a measure of “how successful” a story is, although a more logical variable for our scenario would be the number of views as it is the one that is more directly affected by our choice for the title. People typically click on a story because of the preview that they see (including headline and image), and then after they read the story, they decide whether to clap or not. But, because the number of views is not publicly shown on Medium, we use the number of claps as it should be highly correlated with views (the more the views, the more likely is that someone would clap).

If you are not familiar with hypothesis testing, here is an article you can read:

That being said, we will consider the following model:

Sample 1: Articles with questions as headlines

We will model the number of claps inside this group as n i.i.d. (independent and identically distributed) random variables: X₁, X₂, …, Xₙ with expected value µ₁ and variance σ₁², both of which are finite.

Sample 2: Articles with non-questions as headlines

We will model the number of claps inside this group as m i.i.d. random variables: Y₁, Y₂, …, Yₘ with expected value µ₂ and variance σ₂², both of which are finite.

We formulate the null hypothesis as “question-titled articles bring no improvement over non-question-titled articles”, and the alternative hypothesis as “question-titled articles are more successful compared to non-question-titled articles”.

Mathematically this means:

We will consider the following test statistic:

Where Xn bar and Ym bar are the averages of sample 1, respectively sample 2.

Because the sample sizes are pretty large and due to the Central Limit Theorem, the probability distribution of our test statistic Z can be approximated very well by a standard normal distribution, and the true variances σ₁², σ₂² should be very close to the estimated variances from our data. So, when we compute the test statistic, we can just substitute the estimated variances for σ₁², σ₂².

But, what about µ₁ – µ₂? By assuming H₀ to be true, it follows that µ₁ – µ₂ ≤ 0. And we choose µ₁ – µ₂ = 0 as this value is the worst-case scenario for the probability of type I error (we don’t want to underestimate the error).

Now, let’s run some Python code. We start by importing the required packages and defining a utility function: like(x, pattern). This function is used to match regular expressions in pandas data frames; x is the column, and pattern is a regular expression.

After that, we read the CSV file into a pandas data frame:

We make sure we don’t have missing values in the “title” or “claps” columns:

Then, we create 2 new data frames (questions/no-questions) using the like() function defined earlier:

These 2 new data frames are shown below:

After that, we compute the quantities that we need for the test statistic:

Now, we compute the test statistic and the p-value. In our case, because we’re doing a one-sided test, the p-value is the area to the right of our test statistic under a standard gaussian:

For a significance level of α = 0.05, it follows that p ≈ 0.018 < α, and therefore we reject the null hypothesis and accept the alternative. In plain English, this means: “We are 95% confident that stories with questions as headlines are expected to have more claps than stories with non-question headlines”.

You can find the Jupyter notebook on Kaggle.

I hope you found this information interesting and thanks for reading!

This article is also posted on Medium here. Feel free to have a look!


Passionate about Data Science, AI, Programming & Math

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x