more robust bulk file processing #903

eschultink · 2025-06-03T19:07:56Z

Fixes

syntactic-sugar loop for (CSVRecord record : records) can throw UncheckedIOException if individual line is malformed. This avoids that to some extent, by parsing line-by-line.

Change implications

dependencies added/changed? no

While arguably good to avoid failing a huge file due to a couple malformed lines, may lead to encoding issues being too subtle to be discovered ... so perhaps we don't want this.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1210453538185821

…g all the transforms

Copilot

Pull Request Overview

This PR refactors bulk CSV sanitization to parse the header and body line-by-line (rather than using a single-pass parser) to better handle malformed rows and preserve original end-of-line markers.

Add parseFirstLine in ColumnarBulkDataSanitizerImpl to capture header contents and EOL sequence
Introduce processRecords to iterate rows manually, skipping malformed lines via exceptions
Update tests to use a shared forFile helper and cover new first-line parsing and malformed-row behavior

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ColumnarBulkDataSanitizerImpl.java	Added `parseFirstLine`, refactored `sanitize` to use `processRecords` and handle per-line parsing
RecordBulkDataSanitizerImpl.java	Changed `sanitize` signature to require `BufferedReader` (align with new parsing logic)
BulkDataSanitizer.java	Updated interface to accept `BufferedReader` instead of `Reader`
StorageHandler.java	Unified reader instantiation to `BufferedReader`
ColumnarBulkDataSanitizerImplTest.java	Added parameterized tests for header EOL detection and malformed-row skipping
BulkDataSanitizerImplTest.java	Replaced repeated file-reading code with `forFile` helper and wildcard imports

Comments suppressed due to low confidence (6)

java/core/src/main/java/co/worklytics/psoxy/storage/impl/RecordBulkDataSanitizerImpl.java:61

Changing the API from Reader to BufferedReader is a breaking change and reduces flexibility; consider accepting a Reader and wrapping it internally.

public void sanitize(BufferedReader reader,

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java:78

This override now requires BufferedReader instead of the broader Reader interface, which may break existing clients; consider reverting to Reader.

public void sanitize(@NonNull BufferedReader reader,

java/core/src/main/java/co/worklytics/psoxy/storage/BulkDataSanitizer.java:25

The interface signature change from Reader to BufferedReader is a breaking API change; it’s better to keep the more general Reader.

void sanitize(BufferedReader reader,

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java:322

transformsWithoutMappings is declared but never used; remove it to clean up dead code.

Set<String> transformsWithoutMappings = new HashSet<>();

java/core/src/test/java/co/worklytics/psoxy/storage/impl/BulkDataSanitizerImplTest.java:42

[nitpick] Avoid wildcard imports; prefer explicit imports for clarity and to prevent unintended dependencies.

import java.io.*;

java/core/src/main/java/co/worklytics/psoxy/storage/StorageHandler.java:382

[nitpick] The variable is declared as BufferedReader; consider using the Reader interface on the left-hand side to keep the code decoupled from specific implementations.

BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8), bufferSize);

java/core/src/test/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImplTest.java

…narBulkDataSanitizerImplTest.java Co-authored-by: Copilot <[email protected]>

aperez-worklytics

I'd include the number of rows processed too as part of metadata.

And 🔴 ; pending to include that in GcsFileEventHandler for GCP (AWS and terminal are included here; missing GCP?)

aperez-worklytics · 2025-06-04T08:12:11Z

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java

+     * @param outputBuffer to write the processed records to
+     * @return number of records which could not be processed due to errors
+     */
+    int processRecords(


I'd return an object with the count of errors and processed lines too

aperez-worklytics · 2025-06-04T08:25:13Z

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java

-            if (buffer.flush()) {
-                log.info(String.format("Processed records: %d", buffer.getProcessed()));
+                if (outputBuffer.addAndAttemptFlush(ProcessedRecord.of(Lists.newArrayList(newRecord.values())))) {
+                    log.info(String.format("Processed records: %d", outputBuffer.getProcessed()));


Yeah true, that appears in logs but just the buffer; I think if we expose the total as part of metadata could be useful

aperez-worklytics · 2025-06-04T08:29:53Z

java/core/src/main/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImpl.java

            .setTrim(true)
            .build();

+        ParsedFirstLine parsedFirstLine = parseFirstLine(reader);


I'd include a comment here about why use this and not records.getFirstEndOfLine()

eschultink added 3 commits June 3, 2025 10:16

refactor to encapsulate record processing independent from determinin…

d61fa5b

…g all the transforms

remove unused return value

cdd5bc7

make this robust against single, malformed rows in CSV

66bb18e

eschultink requested review from aperez-worklytics, Copilot and jlorper June 3, 2025 19:07

eschultink self-assigned this Jun 3, 2025

fix cli compile

974cdd6

Copilot AI reviewed Jun 3, 2025

View reviewed changes

java/core/src/test/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImplTest.java Outdated Show resolved Hide resolved

java/core/src/test/java/co/worklytics/psoxy/storage/impl/ColumnarBulkDataSanitizerImplTest.java Outdated Show resolved Hide resolved

eschultink and others added 2 commits June 3, 2025 12:10

Update java/core/src/test/java/co/worklytics/psoxy/storage/impl/Colum…

699cc6c

…narBulkDataSanitizerImplTest.java Co-authored-by: Copilot <[email protected]>

Update java/core/src/test/java/co/worklytics/psoxy/storage/impl/Colum…

3ecaefb

…narBulkDataSanitizerImplTest.java Co-authored-by: Copilot <[email protected]>

eschultink marked this pull request as ready for review June 3, 2025 19:10

eschultink changed the title ~~S200 : improve bulk case~~ more robust bulk file processing Jun 3, 2025

count row-errors in s3 metadata

eeb2d3a

aperez-worklytics requested changes Jun 4, 2025

View reviewed changes

Base automatically changed from rc-v0.5.3 to main June 23, 2025 09:41

eschultink changed the base branch from main to rc-v0.5.4 June 23, 2025 17:28

eschultink added the ON_HOLD label Jul 7, 2025

Base automatically changed from rc-v0.5.4 to main July 18, 2025 23:44

eschultink changed the base branch from main to rc-v0.5.5 July 20, 2025 19:15

Base automatically changed from rc-v0.5.5 to main August 4, 2025 22:44

eschultink force-pushed the main branch from 9358f18 to 926bdf1 Compare August 4, 2025 22:50

eschultink changed the base branch from main to rc-v0.5.6 August 5, 2025 14:46

Base automatically changed from rc-v0.5.6 to main August 22, 2025 17:05

eschultink changed the base branch from main to rc-v0.5.7 August 22, 2025 19:43

Base automatically changed from rc-v0.5.7 to main September 5, 2025 21:04

eschultink changed the base branch from main to rc-v0.5.8 September 9, 2025 17:15

Base automatically changed from rc-v0.5.8 to main September 9, 2025 19:39

eschultink changed the base branch from main to rc-v0.5.9 September 9, 2025 21:10

Base automatically changed from rc-v0.5.9 to main September 12, 2025 19:46

eschultink changed the base branch from main to rc-v0.5.10 September 15, 2025 17:28

eschultink deleted the branch rc-v0.5.15 October 2, 2025 16:04

eschultink closed this Oct 2, 2025

eschultink reopened this Oct 2, 2025

Base automatically changed from rc-v0.5.10 to main October 8, 2025 08:41

eschultink changed the base branch from main to rc-v0.5.11 October 8, 2025 18:31

Base automatically changed from rc-v0.5.11 to main October 28, 2025 18:08

eschultink changed the base branch from main to rc-v0.5.12 October 28, 2025 21:34

Base automatically changed from rc-v0.5.12 to main November 12, 2025 16:19

eschultink changed the base branch from main to rc-v0.5.13 November 12, 2025 19:21

Base automatically changed from rc-v0.5.13 to main November 17, 2025 18:49

eschultink changed the base branch from main to rc-v0.5.14 November 17, 2025 20:35

Base automatically changed from rc-v0.5.14 to main December 2, 2025 16:39

eschultink changed the base branch from main to rc-v0.5.15 December 2, 2025 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

more robust bulk file processing #903

more robust bulk file processing #903

Uh oh!

eschultink commented Jun 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

aperez-worklytics left a comment

Uh oh!

aperez-worklytics Jun 4, 2025

Uh oh!

aperez-worklytics Jun 4, 2025

Uh oh!

aperez-worklytics Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

more robust bulk file processing #903

Are you sure you want to change the base?

more robust bulk file processing #903

Uh oh!

Conversation

eschultink commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Change implications

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

aperez-worklytics left a comment

Choose a reason for hiding this comment

Uh oh!

aperez-worklytics Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

aperez-worklytics Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

aperez-worklytics Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eschultink commented Jun 3, 2025 •

edited

Loading