Getting started as a kaggler
While we want to work on a data science and machine learning problem, it is nice when we find out that a dataset that is suitable for solving our desired problem is already available and ready to use on a platform like Kaggle. It makes our life much easier. Collecting data can be sometimes a difficult and slow process. Data is the new gold. By making our datasets public and by promoting an open source thinking among data science and machine learning practitioners we can accelerate the progress that is done in this field. A good place to do so is Kaggle. It is for data scientists what Github is for software developers. If we happen to have collected an interesting dataset dataset, it is good practice to publish it on Kaggle, so that others can use it too. And by doing so, we can increase our reputation on Kaggle, and this may help us in getting a job in the field; this is another benefit of publishing datasets on Kaggle.
Let us get started.
Now, assuming you already have a dataset that you can publish, the first thing you need to do is to create the dataset entry. From your Kaggle homepage, go to the “Data” tab from the left panel:
Next, click on “New Dataset” to create your dataset entry:
Now, a dialog like this opens where you can give your dataset a name, edit its URL, and upload the files:
If your dataset is large you can upload an archive and Kaggle will automatically decompress it so that when someone that visits its page, he/she can see individual files in it.
Note the “private” icon in the bottom-right corner of the dialog. When you create a dataset, it is made by default private; so that only you and people you specify can access it. This is the preferred way to create it, and after you add extra information and make sure everything is OK, you make it public. You can also create it directly as public by toggling that private/public button in the dialog.
As an example, I will upload a dataset with Medium articles scraped using Python and Beautiful Soup. If you are interested to see how I collected this data, you can read the following article:
After files were uploaded and processed you should see something like this:
Click on “Go to Dataset” and you should see your new dataset’s homepage like the one below. But it is not done, we still should do a few extra things to it before making it public.
As you can see on the above page, Kaggle computes for your dataset something called a usability score and gives you a list of things you can improve. This usability score tells you how easy-to-use and well-documented it is by other people that may not be familiar with it. This is something that we want to improve before we make it public. Right now, the usability score is low: 1.2.
Go to “Settings” tab and add a subtitle and set a license for your dataset:
Then, go back on “Data” tab and click “Add tags…” to add a few tags for your dataset:
Next, click on “Add a description…” and enter a description of your dataset and indicate how you collected the data. In my case, I put a link to the Github code used for scraping.
Then, you should add file and columns description for each (CSV) file. Go to the CSV file that you want and click on the pencil icon on the top-right side.
Next, add a description for the file and describe each column of the file.
Then, go to “Metadata” tab and edit the information there, especially “Sources”, “Collection methodology” and “Expected update frequency”.
Now, it is time to choose an image for our dataset. Go to “Settings” tab, then “Image”, and upload a 1900 x 400 image.
Now, our dataset has already 8.8 usability. The final things that we need to do are adding tasks and kernels (in case you do not know, kernels are just Jupyter notebooks; this is how they are called on Kaggle). But, to add tasks, we need to make the dataset public first, otherwise we get an error:
Now, it is time to go public. You can make it public by clicking on “Make Public” button that is right next to the usability score, or from “Settings” tab -> “Sharing”.
Then, continue with the “Tasks” tab -> “Create New Task”, write there a title and a description of the task and we are done.
Now, after you made it public and created a task, a Kaggle bot automatically created a starter kernel for your dataset:
This is it; we are done now. We did all we could to improve the usability of our dataset. The usability score is 9.4 now. This is because we did not add a description for each file in our dataset. We did so only for the CSV file, not to the rest of 6k+ images. But I consider this as a bug in Kaggle’s algorithm for calculating the usability score. Who would add a description for each file in such a big dataset? They should impose file description only for the most important files (like CSV files).
If you are interested to see the dataset used as an example, you can access it here.
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!