Skip to content

Conversation

@eschultink
Copy link
Member

@eschultink eschultink commented Jun 3, 2025

Fixes

  • syntactic-sugar loop for (CSVRecord record : records) can throw UncheckedIOException if individual line is malformed. This avoids that to some extent, by parsing line-by-line.

Change implications

  • dependencies added/changed? no

While arguably good to avoid failing a huge file due to a couple malformed lines, may lead to encoding issues being too subtle to be discovered ... so perhaps we don't want this.


@eschultink eschultink self-assigned this Jun 3, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors bulk CSV sanitization to parse the header and body line-by-line (rather than using a single-pass parser) to better handle malformed rows and preserve original end-of-line markers.

  • Add parseFirstLine in ColumnarBulkDataSanitizerImpl to capture header contents and EOL sequence
  • Introduce processRecords to iterate rows manually, skipping malformed lines via exceptions
  • Update tests to use a shared forFile helper and cover new first-line parsing and malformed-row behavior

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ColumnarBulkDataSanitizerImpl.java Added parseFirstLine, refactored sanitize to use processRecords and handle per-line parsing
RecordBulkDataSanitizerImpl.java Changed sanitize signature to require BufferedReader (align with new parsing logic)
BulkDataSanitizer.java Updated interface to accept BufferedReader instead of Reader
StorageHandler.java Unified reader instantiation to BufferedReader
ColumnarBulkDataSanitizerImplTest.java Added parameterized tests for header EOL detection and malformed-row skipping
BulkDataSanitizerImplTest.java Replaced repeated file-reading code with forFile helper and wildcard imports
Comments suppressed due to low confidence (6)

java/core/src/main/java/co/worklytics/psoxy/storage/impl/RecordBulkDataSanitizerImpl.java:61

  • Changing the API from Reader to BufferedReader is a breaking change and reduces flexibility; consider accepting a Reader and wrapping it internally.
public void sanitize(BufferedReader reader,

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java:78

  • This override now requires BufferedReader instead of the broader Reader interface, which may break existing clients; consider reverting to Reader.
public void sanitize(@NonNull BufferedReader reader,

java/core/src/main/java/co/worklytics/psoxy/storage/BulkDataSanitizer.java:25

  • The interface signature change from Reader to BufferedReader is a breaking API change; it’s better to keep the more general Reader.
void sanitize(BufferedReader reader,

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java:322

  • transformsWithoutMappings is declared but never used; remove it to clean up dead code.
Set<String> transformsWithoutMappings = new HashSet<>();

java/core/src/test/java/co/worklytics/psoxy/storage/impl/BulkDataSanitizerImplTest.java:42

  • [nitpick] Avoid wildcard imports; prefer explicit imports for clarity and to prevent unintended dependencies.
import java.io.*;

java/core/src/main/java/co/worklytics/psoxy/storage/StorageHandler.java:382

  • [nitpick] The variable is declared as BufferedReader; consider using the Reader interface on the left-hand side to keep the code decoupled from specific implementations.
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8), bufferSize);

eschultink and others added 2 commits June 3, 2025 12:10
…narBulkDataSanitizerImplTest.java

Co-authored-by: Copilot <[email protected]>
…narBulkDataSanitizerImplTest.java

Co-authored-by: Copilot <[email protected]>
@eschultink eschultink marked this pull request as ready for review June 3, 2025 19:10
@eschultink eschultink changed the title S200 : improve bulk case more robust bulk file processing Jun 3, 2025
Copy link
Contributor

@aperez-worklytics aperez-worklytics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include the number of rows processed too as part of metadata.

And 🔴 ; pending to include that in GcsFileEventHandler for GCP (AWS and terminal are included here; missing GCP?)

* @param outputBuffer to write the processed records to
* @return number of records which could not be processed due to errors
*/
int processRecords(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd return an object with the count of errors and processed lines too

if (buffer.flush()) {
log.info(String.format("Processed records: %d", buffer.getProcessed()));
if (outputBuffer.addAndAttemptFlush(ProcessedRecord.of(Lists.newArrayList(newRecord.values())))) {
log.info(String.format("Processed records: %d", outputBuffer.getProcessed()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true, that appears in logs but just the buffer; I think if we expose the total as part of metadata could be useful

.setTrim(true)
.build();

ParsedFirstLine parsedFirstLine = parseFirstLine(reader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include a comment here about why use this and not records.getFirstEndOfLine()

Base automatically changed from rc-v0.5.3 to main June 23, 2025 09:41
@eschultink eschultink changed the base branch from main to rc-v0.5.4 June 23, 2025 17:28
Base automatically changed from rc-v0.5.4 to main July 18, 2025 23:44
@eschultink eschultink changed the base branch from main to rc-v0.5.5 July 20, 2025 19:15
Base automatically changed from rc-v0.5.5 to main August 4, 2025 22:44
@eschultink eschultink changed the base branch from main to rc-v0.5.6 August 5, 2025 14:46
Base automatically changed from rc-v0.5.6 to main August 22, 2025 17:05
@eschultink eschultink changed the base branch from main to rc-v0.5.7 August 22, 2025 19:43
Base automatically changed from rc-v0.5.7 to main September 5, 2025 21:04
@eschultink eschultink changed the base branch from main to rc-v0.5.8 September 9, 2025 17:15
Base automatically changed from rc-v0.5.8 to main September 9, 2025 19:39
@eschultink eschultink changed the base branch from main to rc-v0.5.9 September 9, 2025 21:10
Base automatically changed from rc-v0.5.9 to main September 12, 2025 19:46
@eschultink eschultink changed the base branch from main to rc-v0.5.10 September 15, 2025 17:28
@eschultink eschultink deleted the branch rc-v0.5.15 October 2, 2025 16:04
@eschultink eschultink closed this Oct 2, 2025
@eschultink eschultink reopened this Oct 2, 2025
Base automatically changed from rc-v0.5.10 to main October 8, 2025 08:41
@eschultink eschultink changed the base branch from main to rc-v0.5.11 October 8, 2025 18:31
Base automatically changed from rc-v0.5.11 to main October 28, 2025 18:08
@eschultink eschultink changed the base branch from main to rc-v0.5.12 October 28, 2025 21:34
Base automatically changed from rc-v0.5.12 to main November 12, 2025 16:19
@eschultink eschultink changed the base branch from main to rc-v0.5.13 November 12, 2025 19:21
Base automatically changed from rc-v0.5.13 to main November 17, 2025 18:49
@eschultink eschultink changed the base branch from main to rc-v0.5.14 November 17, 2025 20:35
Base automatically changed from rc-v0.5.14 to main December 2, 2025 16:39
@eschultink eschultink changed the base branch from main to rc-v0.5.15 December 2, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants