Skip to content

Parsing and conversion to DataFrame performance enhancement #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
2b9eb1e
Memoization on `typemap` function: 30x speedup seen
mathieu17g May 7, 2025
dbd2152
Add a fast‑path for simple leaf tags in `add_element!`. 33% speed gai…
mathieu17g May 7, 2025
5a6db75
Enhanced `object` function -> 40% speed gain on reading
mathieu17g May 7, 2025
59963bc
Cleaned code
mathieu17g May 7, 2025
4a29aa3
Handle omitted comma after an altitude in XML file
mathieu17g May 8, 2025
3b46034
Several fixes:
mathieu17g May 9, 2025
7bf7998
Add a Tables.jl interface
mathieu17g May 9, 2025
4c63f1e
Refactored the packages in KML.jl, types.jl, parsing.jl, geointerface…
mathieu17g May 11, 2025
136806e
Enhance KML parsing logic for objects and improve performance enhance…
mathieu17g May 11, 2025
c786272
Refactor KML package dependencies and improve parsing functionality
mathieu17g May 15, 2025
2743ca8
Add missing REPL and TerminalMenus imports necessary for interactive …
mathieu17g May 16, 2025
66109e0
Refactor coordinate parsing logic to strip spaces on the left and han…
mathieu17g May 16, 2025
8cd960d
Refactor object and add_element! functions to improve readability by …
mathieu17g May 16, 2025
b0f651d
Refactor coordinate handling in add_element! function and update gx_c…
mathieu17g May 17, 2025
6f2d97c
Add KMLZipArchivesExt module for KMZ file support and add a `list_lay…
mathieu17g May 18, 2025
f588f54
Add KMLDataFramesExt module for DataFrame support and implement layer…
mathieu17g May 18, 2025
805d356
Enhance kml_enum macro and altitudeMode struct for improved validatio…
mathieu17g May 19, 2025
12a4852
Refactor coordinate parsing logic in add_element! function to allow P…
mathieu17g May 19, 2025
0321e67
Add support for simplifying single-part MultiGeometries and enhance D…
mathieu17g May 19, 2025
e2946b1
Add support for simplifying single-part geometries in Tables.schema a…
mathieu17g May 19, 2025
9debaff
Enhance KML parsing by adding support for Atom elements and improving…
mathieu17g May 20, 2025
b47b6a4
Add support for ISO 8601 time parsing and enhance KML time elements
mathieu17g May 25, 2025
892bb1f
Refactor time parsing in KML processing: integrate ISO 8601 parsing a…
mathieu17g May 25, 2025
a53985f
Enhance KML type handling: improve typemap function and add fieldtype…
mathieu17g May 25, 2025
b8a814a
Refactor XML node type usage: replace Node with AbstractXMLNode in KM…
mathieu17g May 26, 2025
57ba060
Refactor KML parsing: streamline child node handling and improve boun…
mathieu17g May 27, 2025
78663a2
Refactor KML parsing: update type handling for children nodes and enh…
mathieu17g May 27, 2025
4c50e6b
WIP: Enhance KML processing: add LazyKMLFile support for efficient da…
mathieu17g May 27, 2025
da9481c
WIP: Refactor KML and LazyKML file handling: enhance read functions t…
mathieu17g May 27, 2025
4a55fe8
WIP: Enhance KMLDataFramesExt and KML module: support LazyKMLFile in …
mathieu17g May 27, 2025
06b8cad
WIP: Refactor KMLZipArchivesExt and parsing: enable precompilation, r…
mathieu17g May 27, 2025
e8336cd
Aligned coordinates parsing and geometry creation in tables.jl to cor…
mathieu17g May 27, 2025
86b8f23
Refactor text extraction functions: streamline extract_text_content_f…
mathieu17g May 28, 2025
0f0e9ea
Add Coordinates module and refactor coordinate parsing: move parse_co…
mathieu17g May 29, 2025
2f035b0
Refactored layers info code to dedicated Layers module
mathieu17g May 29, 2025
1788946
Refactor Enums module to a separate file
mathieu17g May 29, 2025
4c5fa2f
WIP: In the middle of `types.jl` split into several files and modules…
mathieu17g May 30, 2025
edd0b53
`parsing.jl` splitted with 2 additionnal modules `FieldAssignment` an…
mathieu17g May 30, 2025
e872d24
Move GeoInterface to an extension and refactored the rest to simplify…
mathieu17g Jun 2, 2025
e4b66d2
Add recursive placemark counting functions for KML layers to fix wron…
mathieu17g Jun 2, 2025
0b3c490
Add Makie extension for KML geometry types and implement plotting rec…
mathieu17g Jun 2, 2025
9a9d0d8
WIP debugging tests:
mathieu17g Jun 3, 2025
4a257ea
WIP to enhance thread safety:
mathieu17g Jun 4, 2025
1c10152
Refactor XML parsing functions for improved performance and memory ef…
mathieu17g Jun 6, 2025
aa526a5
Refactor XML parsing to use immediate child functions for improved pe…
mathieu17g Jun 6, 2025
6a32cd6
WIP: Refactor XML parsing to utilize new macros for immediate child i…
mathieu17g Jun 6, 2025
a6306fe
Iterating macros: use continue instead of return not working for macro
mathieu17g Jun 6, 2025
455d8f0
Fix iteration macros to avoid infinite loops
mathieu17g Jun 7, 2025
21b9d2c
Not working : Enhance iteration macros for performance
mathieu17g Jun 7, 2025
9a6992f
Optimize iteration macros for performance
mathieu17g Jun 7, 2025
1a33ada
Update JSON3 compat version
mathieu17g Jun 10, 2025
0fd83e5
Merge branch 'main' into parsing_perf_enhancement
joshday Jun 16, 2025
3786718
merge
joshday Jun 17, 2025
9b7e555
loosen compat
joshday Jun 17, 2025
0ab6195
Remove GeoInterface functions introduced in PR #15 for test fix
mathieu17g Jun 17, 2025
6605657
Fix typo in `KML.jl` on `include("enums.jl")`
mathieu17g Jun 17, 2025
0be683a
Typo on `include("Coordinates.jl")` for case-sensitive OSes
mathieu17g Jun 17, 2025
9b08d27
Typo on `include("Layers.jl")` for case-sensitive OSes
mathieu17g Jun 17, 2025
86ce39e
Loosen compat for CI: 1st try
mathieu17g Jun 17, 2025
7c44ddd
Fix Julia 1.10 precompilation by conditionally using `generating_output`
mathieu17g Jun 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 32 additions & 9 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,42 @@ authors = ["Josh Day <[email protected]> and contributors"]
version = "0.2.5"

[deps]
GeoInterface = "cf35fbd7-0cd7-5166-be24-54bfbe79505f"
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
InteractiveUtils = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
Parsers = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
Scratch = "6c6a2e73-6563-6170-7368-637461726353"
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TimeZones = "f269a46b-ccf7-5d73-abea-4c690281aa53"
XML = "72c71f33-b9b6-44de-8c94-c961784809e2"

[weakdeps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
GeoInterface = "cf35fbd7-0cd7-5166-be24-54bfbe79505f"
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
ZipArchives = "49080126-0e18-4c2a-b176-c102e4b3760c"

[extensions]
KMLDataFramesExt = "DataFrames"
KMLGeoInterfaceExt = "GeoInterface"
KMLMakieExt = "Makie"
KMLZipArchivesExt = "ZipArchives"

[compat]
Automa = "1.1"
GeoInterface = "1.3"
JSON3 = "1.14"
OrderedCollections = "1"
XML = "0.3.0"
julia = "1"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
Parsers = "2.8"
Scratch = "1.2"
StaticArrays = "1.9"
Tables = "1.12"
TimeZones = "1.21"
XML = "0.3"
julia = "1.10"
85 changes: 85 additions & 0 deletions ext/KMLDataFramesExt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# ext/KMLDataFramesExt.jl

module KMLDataFramesExt

using DataFrames
import KML
import KML: KMLFile, LazyKMLFile, PlacemarkTable, read


"""
DataFrame(kml_file::Union{KML.KMLFile,KML.LazyKMLFile}; layer::Union{Nothing,String,Integer}=nothing, simplify_single_parts::Bool=false)

Constructs a DataFrame from the Placemarks in a `KMLFile` or `LazyKMLFile` object.

# Arguments

- `kml_file::Union{KML.KMLFile,KML.LazyKMLFile}`: The KML file object already read into memory.
LazyKMLFile is more efficient for this use case as it doesn't materialize the entire KML structure.

- `layer::Union{Nothing,String,Integer}=nothing`: Specifies the layer to extract Placemarks from.

+ If `nothing` (default): The behavior is defined by `KML.PlacemarkTable` (e.g., attempts to find a default layer or prompts if multiple are available and in interactive mode).
+ If `String`: The name of the Document or Folder to use as the layer.
+ If `Integer`: The index of the layer to use.
- `simplify_single_parts::Bool=false`: If `true`, when a MultiGeometry contains only a single geometry part, that part is extracted directly, simplifying the structure. For example, a MultiGeometry containing a single LineString will be treated as a LineString. Defaults to `false`.
"""
function DataFrames.DataFrame(
kml_file::Union{KML.KMLFile,KML.LazyKMLFile};
layer::Union{Nothing,String,Integer} = nothing,
simplify_single_parts::Bool = false,
)
placemark_table = KML.PlacemarkTable(kml_file; layer = layer, simplify_single_parts = simplify_single_parts)
return DataFrames.DataFrame(placemark_table)
end

"""
DataFrame(kml_path::AbstractString; layer::Union{Nothing,String,Integer}=nothing, simplify_single_parts::Bool=false, lazy::Bool=true)

Constructs a DataFrame from the Placemarks in a KML file specified by its path.

# Arguments

- `kml_path::AbstractString`: Path to the .kml or .kmz file.

- `layer::Union{Nothing,String,Integer}=nothing`: Specifies the layer to extract Placemarks from.

+ If `nothing` (default): The behavior is defined by `KML.PlacemarkTable` (e.g., attempts to find a default layer or prompts if multiple are available and in interactive mode).
+ If `String`: The name of the Document or Folder to use as the layer.
+ If `Integer`: The index of the layer to use.
- `simplify_single_parts::Bool=false`: If `true`, when a MultiGeometry contains only a single geometry part, it will be simplified to that single geometry. For example, a MultiGeometry containing a single Point will become just a Point. Defaults to `false`.
- `lazy::Bool=true`: If `true` (default), uses `LazyKMLFile` for better performance when only extracting placemarks.
If `false`, uses regular `KMLFile` which materializes the entire KML structure.
For DataFrame extraction, `lazy=true` is recommended as it's significantly faster for large files.

# Examples

```julia
# Default lazy loading (recommended for DataFrames)
df = DataFrame("large_file.kml")

# Force eager loading if you need the full KML structure later
df = DataFrame("file.kml"; lazy = false)

# Select a specific layer by name
df = DataFrame("file.kml"; layer = "Points of Interest")

# Select layer by index
df = DataFrame("file.kml"; layer = 2)
```
"""
function DataFrames.DataFrame(
kml_path::AbstractString;
layer::Union{Nothing,String,Integer} = nothing,
simplify_single_parts::Bool = false,
lazy::Bool = true,
)
kml_file_obj = if lazy
KML.read(kml_path, KML.LazyKMLFile)
else
KML.read(kml_path, KML.KMLFile)
end
return DataFrames.DataFrame(kml_file_obj; layer = layer, simplify_single_parts = simplify_single_parts)
end

end # module KMLDataFramesExt
Loading
Loading