Do You Read Excel Files with Python? There is a 1000x Faster Way

Reading Excel files is a common task for data scientists, analysts, and developers working with data. Excel, with its ubiquitous presence in business and academic settings, often serves as the starting point for data processing pipelines. Traditionally, libraries like pandas have been the go-to solution for reading and manipulating Excel files in Python. However, as data sizes grow, the performance of these libraries can become a bottleneck.

In this article, we will explore a new approach that can dramatically speed up the process of reading Excel files in Python—potentially making it 1000 times faster. This involves leveraging the power of optimized libraries and parallel processing to handle large Excel files efficiently.

Traditional Method: Using `pandas`

The pandas library is a powerful and flexible data analysis tool that includes functions for reading Excel files. The pandas.read_excel function is straightforward and easy to use, making it a popular choice.

python
import pandas as pd

# Reading an Excel file using pandas
df = pd.read_excel('large_excel_file.xlsx')
print(df.head())

While pandas is convenient, it can be slow when dealing with very large Excel files. This is because pandas reads the entire file into memory, which can be time-consuming and memory-intensive.

Performance Bottlenecks

Several factors contribute to the slow performance of pandas when reading large Excel files:

Single-threaded Processing: By default, pandas operates in a single-threaded manner, meaning it does not take full advantage of multi-core processors.
Memory Usage: Large files require significant memory, which can lead to swapping and slow down the process.
Parsing Overhead: The process of parsing Excel files and converting them into DataFrame structures involves overhead that increases with file size.

Faster Alternatives

To overcome these limitations, we need to look at more efficient ways to handle Excel files. Here are some techniques and libraries that can significantly speed up the process:

1. `openpyxl` and `xlrd` for Direct Excel Manipulation

Libraries like openpyxl and xlrd provide lower-level access to Excel files. While they are not as high-level as pandas, they allow for more controlled and efficient reading of Excel data.

python
import openpyxl

# Reading an Excel file using openpyxl
wb = openpyxl.load_workbook('large_excel_file.xlsx', read_only=True)
ws = wb.active

data = []
for row in ws.iter_rows(values_only=True):
    data.append(row)

print(data[:5])

Using openpyxl with the read_only=True option can reduce memory usage by not loading the entire file into memory at once.

2. `pyxlsb` for Binary Excel Files

The pyxlsb library is specifically designed to read binary Excel files (.xlsb). Binary files are generally smaller and faster to read compared to standard .xlsx files.

python
import pyxlsb

# Reading an Excel file using pyxlsb
with pyxlsb.open_workbook('large_excel_file.xlsb') as wb:
    with wb.get_sheet(1) as sheet:
        for row in sheet.rows():
            print([item.v for item in row])

3. `modin.pandas` for Parallel Processing

modin.pandas is a drop-in replacement for pandas that uses parallel processing to speed up data operations. It allows you to utilize all available CPU cores for reading and manipulating data.

python
import modin.pandas as pd

# Reading an Excel file using modin.pandas
df = pd.read_excel('large_excel_file.xlsx')
print(df.head())

By simply replacing pandas with modin.pandas, you can achieve significant performance improvements without changing your existing code.

4. `Dask` for Out-of-Core Computation

Dask is a parallel computing library that enables out-of-core computation, meaning it can handle datasets that do not fit into memory by breaking them into smaller chunks.

python
import dask.dataframe as dd

# Reading an Excel file using Dask
df = dd.read_csv('large_excel_file.csv')
print(df.head())

Although Dask does not natively support reading Excel files directly, you can convert your Excel files to CSV format first. This approach is useful for extremely large datasets.

5. `Vaex` for Lazy DataFrames

Vaex is another library designed for efficient data manipulation and visualization of large datasets. It uses memory mapping and lazy evaluation to handle data efficiently.

python
import vaex

# Reading a CSV file using Vaex (Excel files need to be converted to CSV)
df = vaex.from_csv('large_excel_file.csv', convert=True)
print(df.head())

Combining Techniques for Maximum Efficiency

To achieve the best performance, you can combine these techniques. For example, converting Excel files to binary format and using pyxlsb for reading, then processing the data with modin or Dask can provide optimal speed and efficiency.

Step-by-Step Example

Convert Excel File to Binary Format: Use Excel or another tool to save your .xlsx file as a .xlsb file.
Read Binary Excel File with pyxlsb: Efficiently load the data.
Process Data with modin.pandas or Dask: Utilize parallel processing to handle the data.

python
import pyxlsb
import modin.pandas as pd

# Step 1: Read binary Excel file
with pyxlsb.open_workbook('large_excel_file.xlsb') as wb:
    with wb.get_sheet(1) as sheet:
        data = [tuple(item.v for item in row) for row in sheet.rows()]

# Step 2: Convert to DataFrame with modin.pandas
df = pd.DataFrame(data[1:], columns=data[0])
print(df.head())

Conclusion

Reading and processing large Excel files in Python can be a challenging task, especially with traditional methods that can be slow and memory-intensive. However, by leveraging optimized libraries such as pyxlsb, modin.pandas, and Dask, you can achieve significant performance improvements.

By adopting these techniques, you can make your data processing pipelines more efficient and scalable, enabling you to handle larger datasets with ease. The key to achieving these performance gains lies in understanding the limitations of traditional methods and exploring new tools and approaches that better align with the demands of modern data processing.

As data sizes continue to grow, staying informed about the latest advancements in data processing technologies is crucial. By continually refining your toolkit and techniques, you can ensure that your data workflows remain fast, efficient, and capable of handling the challenges of tomorrow's data-driven world.

Do You Read Excel Files with Python? There is a 1000x Faster Way