Skip to content

Memory leakage when concatting list of dataframes #20849

@ghost

Description

Code Sample, a copy-pastable example if possible

# Your code here

import pandas as pd
import numpy as np
import gc

COLUMNS = list('abcde')

df_list = []
for i in range(100):
    df = pd.DataFrame(np.random.rand(1_000_000, 5), columns=COLUMNS)

    df = df[COLUMNS]  # <-- LINE A SEE BELOW
    df = df[df.a > 0.5]  # <-- LINE B SEE BELOW

    df_list.append(df)

df_all = pd.concat(df_list, axis=0)

del df
del df_list
del df_all
gc.collect()

Problem description

When running the code above there is memory leakage in Pandas 0.22.0

  • when running python process memory goes from:
    60M --> 8.2G --> 300M

  • When commenting out LINE A & LINE B process memory goes from:
    60M --> 8.2G --> 60M

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions