Evolving ways of Data File Format

November 02, 2021

Data File Format define standard ways of storing information in a file or database. We require different file formats for different use cases.

For example — If we know that only Python systems are going to read our file then we can choose Pickle format as it is highly optimized.

CSV data format has been the most widely used option for data storage. Using CSV, we can read from and write to most data software. However, there is no schema attached and no standard way to control characters and it is not the best way to deal with complex data.

In this article, we will discuss the evolving ways of the data file format:

Parquet: Parquet is one of the most common data storage formats for Big Data as it is very fast. It also understands all the data types used by Pandas, including multi-index data frames. It is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. It is mostly used as a data warehouse or data lake storage format.

Parquet is column-oriented unlike other data formats such as CSV which are row-oriented.The data for divided into column chunks and is written in the form of pages. Each page contains values for a particular column only, hence pages are best suited for compression as they contain similar values.

For example, if we have a Data Table in the form: