A relatively faster approach for reading json lines file into pandas dataframe

2 min readApr 22, 2020

JSON-lines is a widely used file format other than CSV. The difference between JSON-lines and JSON file is only that the former contains multiple JSON objects separated by newline. The file has extension “.jl”

When i was trying to ingest a JSON-l file into pandas dataframe for one of my project i had a hard time figuring it out; because the solutions i found online suggested to loop through each json object one by one. Though it works for small files they won't succeed when you have huge data. In my case I had 1 million data to load. That is when i figured out a relatively easy approach to load the json file into pandas and convert the keys in json to columns(flatten) in pandas without using for loop.

A sample of json lines file is given below.

{"name": "John", "wins": [["straight", "7♣"]]}
{"name": "Jacob", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "Martin", "wins": []}
{"name": "Larry", "wins": [["three of a kind", "5♣"]]}

Let’s see how we can read a json lines file however huge it may be into a pandas dataframe with a short amount time avoiding looping.

Step 1: Read the JSON line file into an object like any normal file

with open(input_file) as f:
    lines = f.read().splitlines()

Step 2 : Load the ‘lines’ object into a pandas Data Frame.

import pandas as pd
df_inter = pd.DataFrame(lines)
df_inter.columns = ['json_element']

This intermediate data frame will have only one column with each json object in a row. A sample output is given below

                                          json_element                       
0       {"id":"f7ca322d-c3e8-40d2-841f-9d7250ac72ca","...                 1       {"id":"609772bc-0672-4db5-8516-4c025cfd54ca","...                 2       {"id":"1aa9d1b0-e6ba-4a48-ad0c-66552d896aac","...                 3       {"id":"719699f9-47be-4bc7-969b-b53a881c95ae","...               4       {"id":"a080f99a-07d9-47d1-8244-26a540017b7a","...

Step 3: Now we will apply json loads function on each row of the ‘json_element’ column. ‘json.loads’ is a decoder function in python which is used to decode a json object into a dictionary. ‘apply’ is a popular function in pandas that takes any function and applies to each row of the pandas dataframe or series.

import json
df_inter['json_element'].apply(json.loads)

Step 4: Once decoding is done we will apply the json normalize function to the above result. json normalize will convert any semi-structured json data into a flat table. Here it converts the JSON ‘keys’ to columns and its corresponding values to row elements.

df_final = pd.json_normalize(df_inter['json_element'].apply(json.loads))

The first few rows of the df_final dataframe output is given below

Using this method you can flatten a JSON lines file into a pandas dataframe without actually looping through each records. For 1M records it took only a few seconds for me to load and flatten the data into pandas.

A relatively faster approach for reading json lines file into pandas dataframe

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sundararaman Parameswaran

Responses (4)