Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformation tool: column headers are repeated in column values #153

Open
niamhomalleyjba opened this issue Oct 29, 2024 · 2 comments
Open
Assignees

Comments

@niamhomalleyjba
Copy link

Issue Description

Every column contains its header as a column value when there are more than batch_size rows.

Steps to Reproduce (Bugs only)

  1. Config file:
transformations:
  loc: # Transformation name
    input_format:
      name: Cede_Location
      version: 10.0.0
    output_format:
      name: OED_Location
      version: 3.0.2
    runner:
      batch_size: 750 # Number of rows to process in a single batch
    extractor:
      type: mssql # Path to the DB connection file. Other options are 'postgres' and 'sqlite'. Defaults to 'csv', when not present.
      options:
        host: myserver, 1450
        port:
        database: MY_DB
        user: foo
        password: bar
        sql_statement: ./cede_location.sql # SQL statement to extract data
    loader:
      options:
        path: ./oed_location.csv # Path to the output file
        quoting: minimal
  1. Command used:
ods_tools transform --config-file /path/to/config.yaml --nocheck
  1. Output is created but the column header appears in the column several times.

Version / Environment information

Name: ods_tools
Version: 3.2.7

Example data / logs

@niamhomalleyjba niamhomalleyjba changed the title Transformation tool: column headers are repeated in column valyes Transformation tool: column headers are repeated in column values Oct 29, 2024
@carlfischerjba
Copy link

With 1000 rows in the SQL Server database and a batch size of 750, I see that the header is repeated in the OED location file: line 1 (expected), line 1002, line 2003. Not sure where those numbers come from. Total number of rows in the csv is 3,003 (3,000 of data plus 3 headers, 2 of them redundant).

Apparently each database row yields more than one (three?) OED rows. I assume that's expected.

@carlfischerjba
Copy link

carlfischerjba commented Oct 29, 2024

Here's the error that is displayed due to this issue if we omit --nocheck and allow OED validation to go ahead. Presumably BaseFloodElevation triggers the error because it's the first numeric column encountered, but subsequent numeric columns would cause similar errors were the code to continue executing.

Interesting that the tool crashes rather than displaying a clean validation error such as "BaseFloodElevation must be a float. 'BaseFloodElevation' is not a valid value." Ideally it would report all validation errors and not just the first to avoid suggesting that there's something special about one column when many are affected.

$ ods_tools transform --config-file /workspace/LM3932_cede_conversion/CF_config.yaml
/home/carl/repos/ODS_Tools/ods_tools/odtf/connector/db/mssql.py:55: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  for batch in pd.read_sql(sql_query, conn, chunksize=batch_size):
Traceback (most recent call last):
  File "parsers.pyx", line 1160, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/carl/venvs/myvenv/bin/ods_tools", line 33, in <module>
    sys.exit(load_entry_point('ods-tools', 'console_scripts', 'ods_tools')())
  File "/home/carl/repos/ODS_Tools/ods_tools/main.py", line 206, in main
    command_action[kwargs.pop('command')](** kwargs)
  File "/home/carl/repos/ODS_Tools/ods_tools/main.py", line 112, in transform
    check(location=output_file[0])
  File "/home/carl/repos/ODS_Tools/ods_tools/main.py", line 54, in check
    oed_exposure.check()
  File "/home/carl/repos/ODS_Tools/ods_tools/oed/exposure.py", line 332, in check
    validator = Validator(self)
  File "/home/carl/repos/ODS_Tools/ods_tools/oed/validator.py", line 29, in __init__
    self.column_to_field_maps[oed_source] = oed_source.get_column_to_field()
  File "/home/carl/repos/ODS_Tools/ods_tools/oed/source.py", line 356, in get_column_to_field
    self.dataframe.columns,
  File "/usr/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/home/carl/repos/ODS_Tools/ods_tools/oed/source.py", line 326, in dataframe
    df = self.load_dataframe()
  File "/home/carl/repos/ODS_Tools/ods_tools/oed/source.py", line 392, in load_dataframe
    oed_df = self.read_csv(filepath, self.exposure.get_input_fields(self.oed_type), filter=self.filters, **read_params)
  File "/home/carl/repos/ODS_Tools/ods_tools/oed/source.py", line 559, in read_csv
    ).filter(filter).as_pandas()
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/oasis_data_manager/df_reader/backends/base.py", line 87, in filter
    self._read()
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/oasis_data_manager/df_reader/backends/base.py", line 77, in _read
    self.read_csv(*self.reader_args, **self.reader_kwargs)
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/oasis_data_manager/df_reader/backends/pandas.py", line 34, in read_csv
    self.df = pd.read_csv(self.filename_or_buffer, *args, **kwargs)
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 617, in _read
    return parser.read(nrows)
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/carl/venvs/myvenv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 1065, in pandas._libs.parsers.TextReader._convert_column_data
  File "parsers.pyx", line 1166, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: 'BaseFloodElevation'

@benhayes21 benhayes21 self-assigned this Nov 20, 2024
@benhayes21 benhayes21 moved this to In Progress in Oasis Dev Team Tasks Nov 20, 2024
@benhayes21 benhayes21 moved this from In Progress to Todo in Oasis Dev Team Tasks Jan 14, 2025
@benhayes21 benhayes21 moved this from Todo to On Hold in Oasis Dev Team Tasks Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On Hold
Development

No branches or pull requests

3 participants