Skip to content

Handle line breaks encapsulated in XML tags #46

Open
@andreasnoack

Description

@andreasnoack

Indeed, this is a pretty exotic feature request but I happen to have some CSVs where the last column contains mixed text including XML and the text within the XML tags can potentially have newline characters which shouldn't be interpreted as newlines when parsing the file. Two such examples

<PAGE_AUTHORS>&#xD;\n&#xD;\n&#xD;\n&#xD;\n&#xD;\nHACKETT;Ark. &#xC3;&#xA2;&#xC2;&#x80;&#xC2;&#x94; A sheriff;admin;About the Author</PAGE_AUTHORS>

and

<PAGE_AUTHORS>K G Rana;\nMax Planck Institute of Microstructure Physics;Weinberg 2;D-06120 Halle;Germany;\nMax Planck Institute for Chemical Physics of Solids;N&#xC3;&#xB6;thnitzer Str. 40;D-01187 Dresden;O Meshcheriakova;J K&#xC3;&#xBC;bler;\nInstitut f&#xC3;&#xBC;r Festk&#xC3;&#xB6;rperphysik;Technische Universit&#xC3;&#xA4;t Darmstadt;D-64289 Darmstadt;B Ernst;J Karel;R Hillebrand;E Pippel;P Werner;A K Nayak;C Felser;S S P Parkin</PAGE_AUTHORS>

The first example is taken from the file 20160810171500.gkg.csv from the GDELT2 dataset

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions