Create Dataset Using Apache Parquet

Working with Dataset — Part 1: Create Dataset Using Apache Parquet

Sung Kim
10 min readJul 29, 2021

I have been in Data Science profession for a while before the term “Data Science” became popularized. In those days as well as present days, the most widely used commercial data analytics software is SAS by SAS Institute. Like most people, I have transitioned to a more open-source software-based solution. One thing I really missed with SAS is the convenience of SAS dataset where all your intermediary datasets and final datasets can be saved and accessed at a later time as needed.

This article/tutorial will explain why I recommend you should save your dataset in Apache Parquet format.

Photo by Barthelemy de Mazenod on Unsplash

Why You Need To Save DataFrame?

Pandas is nice and all, but pandas DataFrame lives in memory and any interrupt to the system, either user-activated (i.e., stopping runaway process that takes a too long time to complete) or system error results in loss of your pandas DataFrame. You can recreate the pandas DataFrame, but that takes time and effort.

You can save your intermediary pandas DataFrame and final pandas DataFrame in CSV format, but CSV has problems that are too numerous to list. You can also save pandas DataFrame as “pickle” file then you are stuck using python to…

--

--

Sung Kim
Sung Kim

Written by Sung Kim

A business analyst at heart who dabbles in ai engineering, machine learning, data science, and data engineering. threads: @sung.kim.mw

No responses yet