Create Dataset Using Apache Parquet
Working with Dataset — Part 1: Create Dataset Using Apache Parquet
I have been in Data Science profession for a while before the term “Data Science” became popularized. In those days as well as present days, the most widely used commercial data analytics software is SAS by SAS Institute. Like most people, I have transitioned to a more open-source software-based solution. One thing I really missed with SAS is the convenience of SAS dataset where all your intermediary datasets and final datasets can be saved and accessed at a later time as needed.
This article/tutorial will explain why I recommend you should save your dataset in Apache Parquet format.
Why You Need To Save DataFrame?
Pandas is nice and all, but pandas DataFrame lives in memory and any interrupt to the system, either user-activated (i.e., stopping runaway process that takes a too long time to complete) or system error results in loss of your pandas DataFrame. You can recreate the pandas DataFrame, but that takes time and effort.
You can save your intermediary pandas DataFrame and final pandas DataFrame in CSV format, but CSV has problems that are too numerous to list. You can also save pandas DataFrame as “pickle” file then you are stuck using python to…