Querying Large Parquet Files with Pandas
27th August 2021
By Michael A
The Scalability Challenges of Pandas
Many would agree that Pandas is the go-to tool for analysing small to medium sized data in Python on a single machine. It excels at handling data that can fit in memory, but this is also one of its biggest limitations. Attempts to use Pandas to directly query data files with 100s of millions of rows is typically met with slow performance followed by out of memory errors. Even with techniques like chunking, the time taken to load and work with subsets of large data is often too slow to be considered interactive.
When working with large data files, data analysts and data scientists may consider more scalable alternative libraries such as Dask, Koalas, and Vaex to do the heavy lifting. In exchange for solving the scalability issues of Pandas, these alternative libraries often deviate from the Pandas API in ways that can introduce friction. Modin is one of the few exceptions to this but that is a topic for another day. In this blog post, we will look at how Pandas can be used to query data volumes that would typically be associated with one of the more heavyweight alternatives.
Quite often, only a subset of the data contained in a large data file is needed for an analytical workload like data analysis. When the data is available as Parquet files that present a star schema (also referred to as a dimensional model), a level of interactive querying performance can be achieved that would otherwise not be possible with Pandas alone. This blog post will present some steps that can be used to interactively query and analyse two and a half years of NHS England prescriptions, totalling more than 500 million rows.
The Importance of Optimising Datasets for Analytical Workloads
The NHS Business Services Authority (NHSBSA) maintains a dataset for prescriptions issued by NHS England and dispensed in the United Kingdom (England, Wales, Scotland, Guernsey, Alderney, Jersey, and the Isle of Man). This is referred to as the 'English prescribing data' or the 'English prescribing dataset', both with the acronym 'EPD'. New data files are published by the NHBSA each month, and each file contains the prescribing activity for a single month. The data is current up to two months in the past. More details about the dataset can be found on the NHSBSA Open Data Portal.
To analyse the prescribing activity between January 2019 and June 2021, Pandas would need to read 30 CSV files containing over 500 million rows in total. This equates to around 180 GB of data. There is an option to download compressed versions of the files, which would significantly reduce the size on disk. However, this would have no effect on the size of the data in memory, and Pandas would still need to read the files row by row even for a small subset. Here is a great blog post that shows what lengths one must go to in order to analyse a fraction of this data using Pandas.
The Same Data, Modelled and Optimised for Analytical Queries
An enhanced version of the English prescribing data is available as an Open Data Blend dataset. The data is modelled as a star schema which means it has been transformed into fact and dimension tables. The fact tables contain columns that would typically be aggregated, such as the number of prescription items, the prescription quantity, and the actual cost. They also contain foreign key references to dimension tables. Each row in the fact table corresponds to a line on a prescription and the data has been partitioned into a Parquet file per year.
The dimension tables contain the descriptive columns that can add context to each fact table row, such as the item that was prescribed, the name of the prescribing practice, and the name of the primary care organisation which is the Clinical Commissioning Group (CCG). The dimension tables contain all the contextual values across all the years for which there is a fact table. They include the values that have changed over time (i.e. slowly changing dimensions). Although a detailed explanation of dimensional modelling concepts is beyond the scope of this blog post, you can learn about the practices and techniques here.
The star schema for the analysis-optimised English Prescribing dataset. Available as a PDF from the dataset page under 'Useful Resources'
The key benefits of the optimised dataset are:
- The data is available as Parquet files
- The Parquet file metadata enables efficient data reads thanks to its support for column pruning and predicate push-down
- A years' worth of data is about 4 GB in size. This is approximately 6% the size of the equivalent data from the raw dataset which would be around 72 GB.
- The dimension tables are very small (less than 10 MB each) because they only contain unique combinations of column values. This means they can easily fit in memory.
- The fact tables only contain numeric data type columns which means more rows can be loaded into memory due to their space efficiency.
Starting with Focused Analytical Questions
To frame our analysis, and the queries that are required to support it, the focus will be the prescribing activity between January 2019 and June 2021 relating to blood glucose test strips.
We will try to answer the following questions within this context:
- What are the most popular glucose test strip items?
- Which CCGs prescribed the most glucose test strip items?
- Which practices prescribed the most glucose test strip items?
The Modest Test Environment
All the code presented from this point forward was run in JupyterLab on an Azure Data Science Virtual Machine. The Standard DS3 v2 virtual machine size was used, and it has the following specifications:
- 4 vCPUs
- 14 GiB RAM
- 128 GiB Premium SSD LRS
One of the biggest benefits of running the code from a virtual machine was that it had a very fast internet connection (i.e. in excess of 1,500 Mbps). This made it possible to download the required data files in under 10 minutes whereas this could have taken several hours over a slow or average speed connection.
Install the required packages:
pip install opendatablend
pip install pyarrow
pip install pandas
Import the required packages:
import opendatablend as odb
import pandas as pd
Pin the analysis to a specific version of the Prescribing dataset to enable reproducibility:
dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-prescribing/20210818T044451Z/datapackage.json'
Note: If we wanted to always get the latest version of the data, we could use the endpoint without the specific version in the path (i.e. https://packages.opendatablend.io/v1/open-data-blend-prescribing/datapackage.json). You can learn more about the dataset versioning feature here.
Set the access key:
access_key=''
Note: When the access key is set to an empty string (i.e. '') then unauthenticated calls to the Open Data Blend Dataset API will be made. Unauthenticated calls have a monthly limit which you can learn more about here
The Jupyter Notebook that was used is available on GitHub.
Acquiring the Prescription Data
A prerequisite for efficiently querying the Parquet files is that they need to be cached locally. Depending on where the code will be run from, 'locally' could mean on your workstation, laptop, virtual machine, or in a cloud data lake. It is basically any location relative to where you will be running your code that will behave like, and have comparable performance to, a local filesystem.
There are three methods that can be used to acquire the Parquet files from the Prescribing dataset:
- Manually download the data files from the dataset page.
- Write some Python code to query the Open Data Blend Dataset API and use the metadata to retrieve the files.
- Use Open Data Blend for Python, a lightweight extract and load Python tool for quickly downloading and caching data files from the Open Data Blend Dataset API.
The option with the least friction is Open Data Blend for Python. This is what will be used in the examples.
To get and locally cache the data files, the following simple code can be run:
# Get the Date data file. This file is less than 10 MB.
output_date = odb.get_data(dataset_path, 'date-parquet', access_key=access_key)
# Get the Prescribing Chemical data file. The file is less than 10 MB.
output_prescribing_chemical = odb.get_data(dataset_path, 'prescribing-chemical-parquet', access_key=access_key)
# Get the Prescribing Practice data file. The file is less than 10 MB in size.
output_prescribing_practice = odb.get_data(dataset_path, 'prescribing-practice-parquet', access_key=access_key)
# Get the Primary Care Organisation data file. The file is less than 10 MB in size.
output_primary_care_organisation = odb.get_data(dataset_path, 'primary-care-organisation-parquet', access_key=access_key)
# Get the English Prescribing 2019 data file. The file is larger than 3 GB.
output_english_prescriptions_2019 = odb.get_data(dataset_path, 'english-prescribing-2019-parquet', access_key=access_key)
# Get the English Prescribing 2020 data file. The file is larger than 3 GB.
output_english_prescriptions_2020 = odb.get_data(dataset_path, 'english-prescribing-2020-parquet', access_key=access_key)
# Get the English Prescribing 2021 data file. The file is larger than 1.5 GB.
output_english_prescriptions_2021 = odb.get_data(dataset_path, 'english-prescribing-2021-parquet', access_key=access_key)
Now that the data files have been cached locally, we can start querying the data.
Efficiently Querying the Prescription Data
The general approach to achieve interactive speeds when querying large Parquet files is to:
- Only read the columns required for your analysis
- Only read the rows required for your analysis
In our scenario, we can translate the general approach into the following logical stages that we will implement using Pandas:
- Selectively load the dimension data
- Selectively load the fact data
- Combine the fact data
- Denormalise the dimensions into the fact
- Analyse the data
Something worth noting is that Pandas uses PyArrow behind the scenes when reading Parquet files. This is what makes it possible to avoid reading entire columns and groups of rows.
Stage 1: Selectively load the dimension data
First, we need to create a set of DataFrames that store the subset of the dimension data that is needed. These can be referred to as Dimension DataFrames.
Create a Date DataFrame:
# Specify the required columns
date_columns = [
'drv_date_key',
'drv_year'
]
# Specify the filter predicates to be applied
date_filters = [
('drv_date_key', '>=', 20190101),
('drv_date_key', '<=', 20210630)
]
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
date = pd.read_parquet(output_date.data_file_name, columns=date_columns, filters=date_filters)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
date.sample(n=5, random_state=1)
# Check the size in memory
date.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:01.327084')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 912 entries, 0 to 911
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_date_key 912 non-null int32
1 drv_year 912 non-null int32
dtypes: int32(2)
memory usage: 7.2 KB
Create a Prescribing Chemical DataFrame:
# Specify the required columns
prescribing_chemical_columns = [
'drv_prescribing_chemical_key',
'srk_british_national_formulary_code',
'src_british_national_formulary_description'
]
# Specify the filter predicates to be applied. We are only interested in glucose blood testing reagents (0601060D0) and Glucose and ketone blood testing reagents (0601060X0)
prescribing_chemical_filters = [
('drk_source', '=', 'English Prescribing'),
('src_british_national_formulary_chemical_substance', 'in', ['0601060D0', '0601060X0'])
]
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
prescribing_chemical = pd.read_parquet(output_prescribing_chemical.data_file_name, columns=prescribing_chemical_columns, filters=prescribing_chemical_filters)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
prescribing_chemical.sample(n=5, random_state=1)
# Check the size in memory
prescribing_chemical.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:00.107007')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_prescribing_chemical_key 94 non-null int32
1 srk_british_national_formulary_code 94 non-null object
2 src_british_national_formulary_description 94 non-null object
dtypes: int32(1), object(2)
memory usage: 14.8 KB
Creating the Prescribing Practice DataFrame:
# Specify the required columns
prescribing_prescribing_columns = [
'drv_prescribing_practice_key',
'srk_prescribing_practice_code',
'src_prescribing_practice'
]
# Specify the filter predicates to be applied
prescribing_prescribing_filters = [
('drk_source', '=', 'English Prescribing'),
]
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
prescribing_practice = pd.read_parquet(output_prescribing_practice.data_file_name, columns=prescribing_prescribing_columns, filters=prescribing_prescribing_filters)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
prescribing_practice.sample(n=5, random_state=1)
# Check the size in memory
prescribing_practice.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:00.063000')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17821 entries, 0 to 17820
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_prescribing_practice_key 17821 non-null int32
1 srk_prescribing_practice_code 17821 non-null object
2 src_prescribing_practice 17821 non-null object
dtypes: int32(1), object(2)
memory usage: 2.5 MB
Create the Primary Care Organisation DataFrame:
# Specify the required columns
primary_care_organisation_columns = [
'drv_primary_care_organisation_key',
'srk_primary_care_organisation_code',
'src_primary_care_organisation'
]
# Specify the filter predicates to be applied
primary_care_organisation_filters = [
('drk_source', '=', 'English Prescribing'),
]
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
primary_care_organisation = pd.read_parquet(output_primary_care_organisation.data_file_name, columns=primary_care_organisation_columns, filters=primary_care_organisation_filters)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
primary_care_organisation.sample(n=5, random_state=1)
# Check the size in memory
primary_care_organisation.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:00.031000')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1058 entries, 0 to 1057
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_primary_care_organisation_key 1058 non-null int32
1 srk_primary_care_organisation_code 1058 non-null object
2 src_primary_care_organisation 1058 non-null object
dtypes: int32(1), object(2)
memory usage: 152.1 KB
Notice how each of the DataFrames use a negligible amount of memory. Each of the files could have been loaded into memory without an issue at this stage, but this would cause the DataFrame at the end of stage 5 to be significantly larger than necessary.
Stage 2: Selectively load the fact data
In addition to the dimension data, we need to create a set of DataFrames that will hold the the subset of the fact data that is required for the data analysis. These will be referred to as Fact DataFrames.
Specify the required columns and filter predicates to be used for all the English Prescribing fact Parquet files:
# Specify the required columns
english_prescribing_columns = [
'drv_processing_period_start_date_key',
'drv_primary_care_organisation_key',
'drv_prescribing_practice_key',
'drv_prescribing_chemical_key',
'src_english_prescribing_items'
]
# Specify the filter predicates to be applied - We are only interested in the fact rows that correspond to the subset of the Prescribing Chemical Keys in our Prescribing Chemical DataFrame
english_prescribing_key_filter = [
('drv_prescribing_chemical_key', 'in', prescribing_chemical['drv_prescribing_chemical_key'])
]
Create the English Prescribing 2019 DataFrame:
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
english_prescribing_2019 = pd.read_parquet(output_english_prescriptions_2019.data_file_name, columns=english_prescribing_columns, filters=english_prescribing_key_filter)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
english_prescribing_2019.sample(n=5, random_state=1)
# Check the size in memory
english_prescribing_2019.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:07.883614')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1905926 entries, 0 to 1905925
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_processing_period_start_date_key 1905926 non-null int32
1 drv_primary_care_organisation_key 1905926 non-null int32
2 drv_prescribing_practice_key 1905926 non-null int32
3 drv_prescribing_chemical_key 1905926 non-null int32
4 src_english_prescribing_items 1905926 non-null int32
dtypes: int32(5)
memory usage: 36.4 MB
Create the English Prescribing 2020 DataFrame:
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
english_prescribing_2020 = pd.read_parquet(output_english_prescriptions_2020.data_file_name, columns=english_prescribing_columns, filters=english_prescribing_key_filter)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
english_prescribing_2020.sample(n=5, random_state=1)
# Check the size in memory
english_prescribing_2020.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:06.639458')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1823291 entries, 0 to 1823290
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_processing_period_start_date_key 1823291 non-null int32
1 drv_primary_care_organisation_key 1823291 non-null int32
2 drv_prescribing_practice_key 1823291 non-null int32
3 drv_prescribing_chemical_key 1823291 non-null int32
4 src_english_prescribing_items 1823291 non-null int32
dtypes: int32(5)
memory usage: 34.8 MB
Create the English Prescribing 2021 DataFrame:
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Load the data after the column selection and filter predicates have been applied
english_prescribing_2021 = pd.read_parquet(output_english_prescriptions_2021.data_file_name, columns=english_prescribing_columns, filters=english_prescribing_key_filter)
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Preview a sample of the data
english_prescribing_2021.sample(n=5, random_state=1)
# Check the size in memory
english_prescribing_2021.info(memory_usage='deep', null_counts=True, verbose=True)
Execution time and memory usage:
Timedelta('0 days 00:00:03.789186')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 882170 entries, 0 to 882169
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_processing_period_start_date_key 882170 non-null int32
1 drv_primary_care_organisation_key 882170 non-null int32
2 drv_prescribing_practice_key 882170 non-null int32
3 drv_prescribing_chemical_key 882170 non-null int32
4 src_english_prescribing_items 882170 non-null int32
dtypes: int32(5)
memory usage: 16.8 MB
When you consider how much data is stored in each of the fact DataFrames, the memory usage is relatively small. In addition to column selectivity, the small memory footprint can also be attributed to columns using a memory efficient data type (i.e. Int32). The DataFrame for 2021 is roughly half the size of the 2019 and 2020 DataFrames because it only includes six months of data (i.e. January 2021 to June 2021).
Step 3: Combine the fact data
The DataFrames are combined before joining them to the dimension DataFrames to minimise the memory consumption and processing overhead of the operation.
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Combine the three fact DataFrames
english_prescribing = pd.concat([english_prescribing_2019, english_prescribing_2020, english_prescribing_2021])
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Check the size in memory
english_prescribing.info(memory_usage='deep', null_counts=True, verbose=True)
# Free up some memory by dropping the redundant DataFrames
del english_prescribing_2019, english_prescribing_2020, english_prescribing_2021
Execution time and memory usage:
Timedelta('0 days 00:00:00.087002')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4611387 entries, 0 to 882169
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drv_processing_period_start_date_key 4611387 non-null int32
1 drv_primary_care_organisation_key 4611387 non-null int32
2 drv_prescribing_practice_key 4611387 non-null int32
3 drv_prescribing_chemical_key 4611387 non-null int32
4 src_english_prescribing_items 4611387 non-null int32
dtypes: int32(5)
memory usage: 123.1 MB
Notice that the operation took a fraction of a second. Also, as the intermediate fact DataFrames are no longer required, they are dropped to conserve memory.
Stage 4: Denormalise the dimensions data into the fact data
At this stage, we need to create a DataFrame that has all the columns required for our analysis. We will refer to this as the Analytic DataFrame. Any columns that are soley used for joining to the dimension DataFrames can be safely dropped after the Analytic DataFrame has been created.
# Capture the execution start time
execution_start_time = pd.Timestamp.now()
# Merge the dimension DataFrames into the fact DataFrame and drop columns that are no longer needed
english_prescribing_wide = (
english_prescribing
.merge(date, left_on='drv_processing_period_start_date_key', right_on='drv_date_key')
.merge(prescribing_chemical, left_on='drv_prescribing_chemical_key', right_on='drv_prescribing_chemical_key')
.merge(prescribing_practice, left_on='drv_prescribing_practice_key', right_on='drv_prescribing_practice_key')
.merge(primary_care_organisation, left_on='drv_primary_care_organisation_key', right_on='drv_primary_care_organisation_key')
).drop(columns=[
'drv_processing_period_start_date_key',
'drv_date_key',
'drv_prescribing_chemical_key',
'drv_prescribing_practice_key',
'drv_primary_care_organisation_key'
])
# Return the execution duration
pd.Timestamp.now() - execution_start_time
# Check the size in memory
english_prescribing_wide.info(memory_usage='deep', null_counts=True, verbose=True)
# Free up some memory by dropping the redundant DataFrame
del english_prescribing
Timedelta('0 days 00:00:04.633174')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4611387 entries, 0 to 4611386
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 src_english_prescribing_items 4611387 non-null int32
1 drv_year 4611387 non-null int32
2 srk_british_national_formulary_code 4611387 non-null object
3 src_british_national_formulary_description 4611387 non-null object
4 srk_prescribing_practice_code 4611387 non-null object
5 src_prescribing_practice 4611387 non-null object
6 srk_primary_care_organisation_code 4611387 non-null object
7 src_primary_care_organisation 4611387 non-null object
dtypes: int32(2), object(6)
memory usage: 1.9 GB
As expected, denormalising the data results in a sharp jump in memory usage due to the inclusion of columns with the Object data type, but the total size in memory remains well within the limits of an average modern day laptop. A further optimisation could be to convert suitable object columns to the Category data type, but this is not necessary in this case.
Stage 5: Analyse the Data
Now that we have our Analytic DataFrame, we can quickly answer the analytical questions. All the queries execute in less than five seconds because the DataFrame is a size that Pandas can handle effortlessly.
What are the top 10 blood glucose test strip items?
2019
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2019][['srk_british_national_formulary_code', 'src_british_national_formulary_description', 'src_english_prescribing_items']].groupby(by=['srk_british_national_formulary_code', 'src_british_national_formulary_description']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
2020
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2020][['srk_british_national_formulary_code', 'src_british_national_formulary_description', 'src_english_prescribing_items']].groupby(by=['srk_british_national_formulary_code', 'src_british_national_formulary_description']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
2021 up until June 2021
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2021][['srk_british_national_formulary_code', 'src_british_national_formulary_description', 'src_english_prescribing_items']].groupby(by=['srk_british_national_formulary_code', 'src_british_national_formulary_description']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
Which CCGs have prescribed the most blood glucose test strip items?
2019
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2019][['srk_primary_care_organisation_code', 'src_primary_care_organisation','src_english_prescribing_items']].groupby(by=['srk_primary_care_organisation_code', 'src_primary_care_organisation']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
2020
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2020][['srk_primary_care_organisation_code', 'src_primary_care_organisation','src_english_prescribing_items']].groupby(by=['srk_primary_care_organisation_code', 'src_primary_care_organisation']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
2021 up until June 2021
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2021][['srk_primary_care_organisation_code', 'src_primary_care_organisation','src_english_prescribing_items']].groupby(by=['srk_primary_care_organisation_code', 'src_primary_care_organisation']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
Which practices have prescribed the most blood glucose test strip items?
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2019][['srk_prescribing_practice_code', 'src_prescribing_practice','src_english_prescribing_items']].groupby(by=['srk_prescribing_practice_code', 'src_prescribing_practice']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
2020
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2020][['srk_prescribing_practice_code', 'src_prescribing_practice','src_english_prescribing_items']].groupby(by=['srk_prescribing_practice_code', 'src_prescribing_practice']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
2021 up until June 2021
english_prescribing_wide[english_prescribing_wide['drv_year'] == 2021][['srk_prescribing_practice_code', 'src_prescribing_practice','src_english_prescribing_items']].groupby(by=['srk_prescribing_practice_code', 'src_prescribing_practice']).sum().sort_values(by='src_english_prescribing_items', ascending=False).head(10)
Conclusion
When data files are available in Parquet format and the data has been optimally structured for analytical workloads (i.e. modelled as a star schema), Pandas can be used to query very large data volumes in a matter of seconds. The Parquet file format and PyArrow library enable Pandas to achieve this by skipping reads of the data that is not relevant for the analytic use case. The net effect is that this significantly reduces the in-memory footprint. After the subset of data has been combined and denormalised into a single Pandas DataFrame, the data analyst or data scientist is able to perform whatever tasks they would normally, benefiting from Pandas interactive querying speeds.
Although this blog post provides examples that query and produce some analytic outputs from just over 500 million rows of data in under 30 seconds, the same approach could be used to efficiently query Parquet files containing a few billion rows. It is also worth noting that there are more stages that could be added to further reduce the size of the Analytic DataFrame, such as assigning the Category data type to the columns with a small number of distinct value, and pre-aggregating the fact DataFrames before denormalised the dimensions into.
Follow Us and Stay Up to Date
Follow us on X and LinkedIn to keep up to date with Open Data Blend, open data, and open-source data analytics technology news. Be among the first to know when there's something new.
Blog hero image by Phil Botha on Unsplash.