0

Code:

In [31]: df = pd.DataFrame({"a": [[{"b": 1}], [{"b": np.nan}]]})

In [32]: df
Out[32]:
              a
0    [{'b': 1}]
1  [{'b': nan}]

In [33]: df.dtypes
Out[33]:
a    object
dtype: object

In [34]: df.to_parquet("a.parquet")

In [35]: pd.read_parquet("a.parquet")
Out[35]:
               a
0   [{'b': 1.0}]
1  [{'b': None}]

As you can see here, [{'b': 1}] becomes [{'b': 1.0}].

How can I keep dtypes even in reading the parquet file?

2
  • Are you sure the dtypes have changed or is it merely a display issue? Commented Aug 15, 2022 at 23:58
  • I think dtypes has been changed because when I pd.read_parquet("a.parquet")["a"].values.tolist() --> [array([{'b': 1.0}], dtype=object), array([{'b': None}], dtype=object)] the values are array type... which was orignally not.. Commented Aug 16, 2022 at 0:03

1 Answer 1

1

You can try to use pyarrow.parquet.read_table and pyarrow.Table.to_pandas with integer_object_nulls (see the doc)

import pyarrow.parquet as pq

pq.read_table("a.parquet").to_pandas(integer_object_nulls=True)
a
0 [{'b': 1}]
1 [{'b': None}]

On the other hand, it looks like pandas.read_parquet with use_nullable_dtypes doesn't work.

df = pd.DataFrame({"a": [[{"b": 1}], [{"b": None}]]})

df.to_parquet("a.parquet")
pd.read_parquet("a.parquet", use_nullable_dtypes=True)
a
0 [{'b': 1.0}]
1 [{'b': None}]
2
  • It still {'b': 1.0} instead of {'b': 1} ... Commented Aug 17, 2022 at 14:11
  • oops my bad. Just edited the answer
    – 0x26res
    Commented Aug 17, 2022 at 14:33

Not the answer you're looking for? Browse other questions tagged or ask your own question.