A few useful things to know when you have the data in Excel format
From what I have seen so far, CSV seems to be the most popular format to store data among data scientists. And that’s understandable, it gets the job done and it’s a quite simple format; in Python, even without any library, one can build a simple CSV parser in under 10 lines of code.
But you may not always find the data that you need in CSV format. Sometimes the only available format may be an Excel file. Like, for example, this dataset on ons.gov.uk about crime in England and Wales, which is only in xlsx format; dataset that I will use in the examples below.
Reading Excel files
The simplest way to read Excel files into pandas data frames is by using the following function (assuming you did
import pandas as pd):
df = pd.read_excel(‘path_to_excel_file’, sheet_name=’…’)
sheet_name can be the name of the sheet we want to read, it’s index or a list with all the sheets we want to read; the elements of the list can be mixed: sheet names or indices. If we want all the sheets, we can use
sheet_name=None. In the case in which we want more sheets to be read, they will be returned as a dictionary of data frames. The keys of such a dictionary will be either the index or name of a sheet, depending on how we specified in
sheet_name; in the case of
sheet_name=None, the keys will be sheet names.
Now, if we use it to read our Excel file we get:
That’s right, an error! It turns out that pandas cannot read Excel files on its own, so we need to install another python package to do that.
There are 2 options that we have:
openpyxl. The package
xlrd can open both Excel 2003 (.xls) and Excel 2007+ (.xlsx) files, whereas
openpyxl can open only Excel 2007+ (.xlsx) files. So, we will install
xlrd as it can open both formats:
pip install xlrd
Now, if we try to read the same data again:
But Excel files can be a little bit messier. Besides data, they may have other comments/explanations in the first and/or last couple of rows.
To tell pandas to start reading an Excel sheet from a specific row, use the argument header = 0-indexed row where to start reading. By default, header=0, and the first such row is used to give the names of the data frame columns.
To skip rows at the end of a sheet, use skipfooter = number of rows to skip.
This is a little better. There are still some issues that are specific to this data. Depending on what we want to achieve we may also need to rearrange the data values into another way. But in this article, we will focus only on reading and writing to and from data frames.
Another way to read Excel files besides the one above is by using a
pd.ExcelFile object. Such an object can be constructed by using the
pd.ExcelFile(‘excel_file_path’) constructor. An
ExcelFile object can be used in a couple of ways. Firstly, it has a
.sheet_names attribute which is a list of all the sheet names inside the opened Excel file.
ExcelFile object also has a
.parse() method that can be used to parse a sheet from the file and return a data frame. The first parameter of this method can be the index of the sheet we want to parse or its name. The rest of the parameters are the same as in the
An example of parsing the second sheet (index 1):
… and here we parse the same sheet using its name instead of an index:
ExcelFiles can also be used inside
with … as … statements, and if you want to do something a little more elaborate, like parsing only sheets with 2 words in their name, you can do something like:
The same thing you can do by using
pd.read_excel() instead of
.parse() method, like this:
… or, if you simply want all the sheets, you can do:
Writing Excel Files
Now that we know how to read excel files, the next step for us is to be able to also write a data frame to an excel file. We can do that by using the data frame method
Let’s first create a simple data frame for writing to an excel file:
Now we want to write it to an excel file:
… and we got an error.
Again, pandas can’t write to excel files on its own; we need another package for that. The main options that we have are:
xlwt— works only with Excel 2003 (.xls) files; append mode not supported
xlsxwriter— works only with Excel 2007+ (.xlsx) files; append mode not supported
openpyxl— works only with Excel 2007+ (.xlsx) files; supports append mode
If we want to be able to write to the old .xls format we should install
xlwt as it is the only one that handles those files. For .xlsx files, we will choose
openpyxl as it also supports the append mode.
pip install xlwt openpyxl
Now if we run again the above code, it works; an excel file was created:
By default, pandas also writes the index column along with our columns. To get rid of it, use
index=False like in the code below:
The index column isn’t there now:
What if we want to write more sheets? If we want to add a second sheet to the previous file, do you think that the below code will work?
The answer is no. It will just overwrite the file with only one sheet: sheet2.
To write more sheets to an Excel file we need to use a
pd.ExcelWriter object as shown below. First, we create another data frame for sheet2, then we open an Excel file as an
ExcelWriter object in which we write the 2 data frames:
Now our Excel file should have 2 sheets. If we then want to add another sheet to it, we need to open the file in append mode and run code similar to the previous one. For example:
Our Excel file, now, has 3 sheets and looks like this:
Working with Excel Formulas
Probably you are wondering, at this point, about Excel formulas. What about them? How to read from files that have formulas? How to write them to Excel files?
Well… good news. It is quite easy. Writing formulas to Excel files is as simple as just writing the string of the formula, and these strings will be automatically interpreted by Excel as formulas.
As an example:
The Excel file produced by the code above is:
Now, if we want to read an Excel file with formulas in it, pandas will read into data frames the result of those formulas.
For example, let’s read our previously created file:
Sometimes you need to save the Excel file manually for this to work and not get zeros instead of the result of formulas (hit CTRL+S before executing the above code).
Below is the code as a Jupyter notebook:
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. Feel free to have a look!