A relatively faster approach for reading json lines file into pandas dataframe
JSON-lines is a widely used file format other than CSV. The difference between JSON-lines and JSON file is only that the former contains multiple JSON objects separated by newline. The file has extension “.jl”
When i was trying to ingest a JSON-l file into pandas dataframe for one of my project i had a hard time figuring it out; because the solutions i found online suggested to loop through each json object one by one. Though it works for small files they won't succeed when you have huge data. In my case I had 1 million data to load. That is when i figured out a relatively easy approach to load the json file into pandas and convert the keys in json to columns(flatten) in pandas without using for loop.
A sample of json lines file is given below.
{"name": "John", "wins": [["straight", "7♣"]]}
{"name": "Jacob", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "Martin", "wins": []}
{"name": "Larry", "wins": [["three of a kind", "5♣"]]}
Let’s see how we can read a json lines file however huge it may be into a pandas dataframe with a short amount time avoiding looping.
Step 1: Read the JSON line file into an object like any normal file
with open(input_file) as f:
lines = f.read().splitlines()
Step 2 : Load the ‘lines’ object into a pandas Data Frame.
import pandas as pd
df_inter = pd.DataFrame(lines)
df_inter.columns = ['json_element']
This intermediate data frame will have only one column with each json object in a row. A sample output is given below
json_element
0 {"id":"f7ca322d-c3e8-40d2-841f-9d7250ac72ca","... 1 {"id":"609772bc-0672-4db5-8516-4c025cfd54ca","... 2 {"id":"1aa9d1b0-e6ba-4a48-ad0c-66552d896aac","... 3 {"id":"719699f9-47be-4bc7-969b-b53a881c95ae","... 4 {"id":"a080f99a-07d9-47d1-8244-26a540017b7a","...
Step 3: Now we will apply json loads function on each row of the ‘json_element’ column. ‘json.loads’ is a decoder function in python which is used to decode a json object into a dictionary. ‘apply’ is a popular function in pandas that takes any function and applies to each row of the pandas dataframe or series.
import json
df_inter['json_element'].apply(json.loads)
Step 4: Once decoding is done we will apply the json normalize function to the above result. json normalize will convert any semi-structured json data into a flat table. Here it converts the JSON ‘keys’ to columns and its corresponding values to row elements.
df_final = pd.json_normalize(df_inter['json_element'].apply(json.loads))
The first few rows of the df_final dataframe output is given below
Using this method you can flatten a JSON lines file into a pandas dataframe without actually looping through each records. For 1M records it took only a few seconds for me to load and flatten the data into pandas.