Extract data about 6K+ articles from 7 different publications
Whether you want to do an analysis on blogging and want to know what factors may determine some articles to be more successful than others, or whether you just want to practice your web scraping skills, this project of scraping data about Medium articles is one that’s not so trivial and I think it’s worth sharing it here.
This article is pure informative; I do not encourage that all of you, readers, start running the code and exhaust Medium’s servers.
There is a link to the scraped data at the end of the article for those who are interested in downloading/using it.
For those who are new to web scraping, it may be helpful to first read my introductory article about web scraping:
… and then continue with this one.
That being said, now let’s get started.
In order to be useful in drawing some statistical conclusions or to be used in machine learning, we want, ideally, to choose the articles randomly. E.g. we don’t want to scrape only the most popular or only the least popular articles. And we want to scrape articles over a longer period of time, not just scraping all posts we find on one particular day; maybe that day was an exception, an outlier, that may not represent the general trend.
At first glance, this may seem a little bit difficult. If you just go to Medium’s main page without being logged in (which is usually the case when we do a get request from python) you will see something like this:
… a page without any articles or a link/button to say something like: „Hey, here is a list with all our posts in year n”.
So, we need to be a little creative. Let’s try some publications. If you enter a publication’s home page you will find there some posts, but the problem is that they only show you the posts that are featured on that publication at that moment in time. If we just extract those, that’s not as random as I want.
After googling for a while, I found something more appropriate for what I wanted: it seems that if you append something of the form
"/archive/year/month/day” to the URL of a publication, you get to a web page that contains summary information about all the posts on that day.
So now, my strategy for scraping that data is the following:
- have a list with medium publications that publish often
- pick a year over which to scrape the data (I used 2019)
- randomly select n days from a year (I used n = 50)
- for each such selected day, scrape all articles from your list of publications
Such an archive page looks like this:
Now, based on what we see in this page I want for each article to extract the following data:
- URL of the article
- number of claps
- number of responses
- reading time
- date of publishing
Now, let’s dive into code.
We start by importing the required packages and defining a dictionary of publications to scrape data from:
convert_day(day) function takes as parameter a day of a year (a number from 1 to 365) and returns a tuple of form (month, day) that tells us the month and day of that month in which it is located. For simplicity we ignore February 29.
get_claps(claps_str) function converts a string in the form in which we found it on the web page to an integer that represents the number of claps. More exactly, if that string is something like „381” it returns the integer 381, if it is „2.1K” it returns 2100, or if it is an empty string returns 0.
get_img(img_url, dest_folder, dest_filename) function downloads the image at
dest_folder under the name
dest_filename.extension, and then returns the filename including it’s extension (which is identified from the URL).
These images we will save into a folder named images and the filename we will put into our table of data.
Next, we randomly select 50 out of 365 days:
selected_days = random.sample([i for i in range(1, 366)], 50)
When we access an URL of the form „https://mediumpublication/archive/year/month/day” there is a chance that there is no article published on that day in that publication. In this case we are redirected to URL „https://mediumpublication/archive/year/month” which contains the top 10 most popular articles in that month, and that’s not what we want. So, whenever this happens we will just skip this page, using this piece of code:
if not response.url.startswith(url.format(year, month, day)): continue
And the code below puts everything together and collects all data into a list named
Now, we construct a data frame from the
medium_df = pd.DataFrame(data, columns=[ 'id', 'url', 'title', 'subtitle', 'image', 'claps', 'responses', 'reading_time', 'publication', 'date'])
And then save them into a CSV file:
And that’s it, by now we have the data stored in the “medium_data.csv” file and all the heading images inside the folder “images”.
- The above scraped dataset can be found here: https://www.kaggle.com/dorianlazar/medium-articles-dataset
- The jupyter notebook used for scraping can be found here: https://github.com/lazuxd/medium-scraping
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!