Evolving ways of Data File Format

 

Data File Format define standard ways of storing information in a file or database. We require different file formats for different use cases.

For example — If we know that only Python systems are going to read our file then we can choose Pickle format as it is highly optimized.

CSV data format has been the most widely used option for data storage. Using CSV, we can read from and write to most data software. However, there is no schema attached and no standard way to control characters and it is not the best way to deal with complex data.

In this article, we will discuss the evolving ways of the data file format:

Parquet: Parquet is one of the most common data storage formats for Big Data as it is very fast. It also understands all the data types used by Pandas, including multi-index data frames. It is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. It is mostly used as a data warehouse or data lake storage format.

Parquet is column-oriented unlike other data formats such as CSV which are row-oriented.The data for divided into column chunks and is written in the form of pages. Each page contains values for a particular column only, hence pages are best suited for compression as they contain similar values.

For example, if we have a Data Table in the form:

Data Table
Row-oriented storage format
Column-oriented storage format

The only issue with using Parquet is that it is not human readable.

Pickle: The pickle data format uses a relatively compact binary representation. This is a highly optimized method for data storage supported for python read and write. The only disadvantage of using Pickle is that it is only understandable by Python.

Thus, we can use Pickle only when the python system will read our files.

Feather: Feather is a fast and lightweight data storage format. It naturally understands almost all data types used by Pandas.

However, it can only be read and written from Python and a handful of other programming languages. It supports only non-nested and categorical (dictionary-encoded) data types.

A simple Hands-on exercise to try out the different file formats:

A sample data with variables and their data types

Refer to Github for code (Data_File_Formats.ipynb) and data (train.csv) file.

We converted the sample CSV data into different formats — pickle, feather and parquet. Then we compared the file size and read time taken.

Comparison of file size and read time taken

- The data file in CSV format is 5.8 MB in size. With Pickle and Feather, we can see a reduction in size up to 1.5MB. However, with parquet, we see a significant reduction in file size — 0.13MB

- Talking about the read time, while CSV file takes 111ms, Pickle and Feather have only 25 and 17ms respectively.

Factors to consider when choosing the data file format are:

Structure: A data file structure is a combination of representations for data in files and of operations for accessing the data. While CSV allows storage of nested data, Feather is only suitable for non-nested data.

Compatibility: CSV is compatible with most of the tools however Pickle is only supported by Python. Thus we cannot use Pickle if our files would be read by systems other than Python.

Schema: CSV does not have any schema attached while feather defines its own simplified schemas and metadata for on-disk representation.

Readability: It is easy to read a CSV file but Pickle is not human-readable.

Hey, if you like my content on Medium or Quantifiers and find it resourceful, you can show your support by hitting the clap button.

For PM Interviews, you can refer to amazing articles at Technomanagers


Comments

Popular posts from this blog

Brain Teasers | Tiger and Sheep

Brain Teasers | Screwy Pirates

Determine Weight of Counterfeit coins [Asked in GS Interview]