Skip to content

Parsing and conversion to DataFrame performance enhancement #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 60 commits into
base: main
Choose a base branch
from

Conversation

mathieu17g
Copy link
Contributor

@mathieu17g mathieu17g commented Jun 10, 2025

This PR introduces substantial performance improvements and architectural changes to KML.jl. Due to the extensive nature of these modifications, I understand if you prefer that I create a separate public fork instead. I'm submitting this PR to give you the option to incorporate these changes, if it's not too much.

Performance improvements

Benchmarks comparing main branch vs this PR on Windows 11, Julia 1.11.5:

Operation File Size Main Branch This PR Improvement
KMLFile reading 100 placemarks 73.35 ms 5.09 ms 14.4x faster
KMLFile reading 20,000 placemarks 23.03 s 1.05 s 21.9x faster
DataFrame extraction 100 placemarks 114.9 ms 0.52 ms 221x faster
DataFrame extraction 20,000 placemarks 19.51 s 81.4 ms 240x faster
Memory usage All sizes - - 82% reduction

I used the KML test file produced with the function below:

function create_test_kml(n_placemarks::Int; filename="test.kml")
    open(filename, "w") do io
        println(io, """<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Document>
    <name>Test Document</name>
    <Folder>
      <name>Test Folder</name>""")
        
        for i in 1:n_placemarks
            lat = -90 + 180 * rand()
            lon = -180 + 360 * rand()
            println(io, """      <Placemark>
        <name>Place $i</name>
        <description>Description for place $i</description>
        <Point>
          <coordinates>$lon,$lat,0</coordinates>
        </Point>
      </Placemark>""")
        end
        
        println(io, """    </Folder>
  </Document>
</kml>""")
    end
end

Otherwise, I also benchmarked against ArchGDAL.jl for the conversion of a layer from a KML file to a DataFrame, and it is between 1.5 and 4 times faster. Here are some KML files found in the wild that I used for benchmark:

Architectural changes

1. Module restructuring

On the way to enhancing performance, I reorganized the codebase into the following modules:

  • types.jl - Type definitions with thread-safe tag mapping
  • xml_parsing.jl - XML to KML object parsing
  • coordinates.jl - Automa-based coordinate parsing
  • field_conversion.jl - Type-stable field conversions
  • tables.jl - Tables.jl interface implementation
  • time_parsing.jl - ISO 8601 time parsing with Automa
  • html_entities.jl - HTML entity decoding
  • layers.jl - Layer navigation and selection
  • Additional utility modules

2. New LazyKMLFile type

# Loads file without materializing KML objects
lazy_kml = read("file.kml", LazyKMLFile)
  • Caches layer information on first access
  • Optimized for DataFrame extraction workflows

3. PlacemarkTable for efficient data extraction

# Direct path to DataFrame
df = DataFrame(PlacemarkTable("file.kml"))
# Or with the extension loaded
df = DataFrame("file.kml")
  • Streaming placemark extraction
  • Minimal object materialization
  • Tables.jl compliant interface

4. Type-stable parsing implementation

  • Pre-compiled tag-to-type mappings
  • Thread-safe symbol caches
  • Zero-allocation coordinate parsing
  • Optimized field assignment with type inference

5. Extensions for optional dependencies

  • KMLDataFramesExt - DataFrame integration
  • KMLGeoInterfaceExt - GeoInterface.jl support
  • KMLMakieExt - Makie.jl plotting recipes
  • KMLZipArchivesExt - KMZ file support

Key implementation details

Coordinate parsing

Replaced regex-based parsing with Automa.jl FSM:

# Before: Multiple regex passes
# After: Single-pass FSM with pre-allocated output
parse_coordinates_automa("0,0 1,1")  # 10x faster

Memory optimizations

  • Removed intermediate string allocations
  • Pre-sized collections based on file structure
  • Lazy evaluation of nested elements
  • Efficient text extraction without concatenation

Thread safety

  • Immutable tag/type caches created at module initialization
  • ReentrantLock for LazyKMLFile cache access
  • No global mutable state

Breaking changes

None. All existing APIs maintained with identical behavior.

New dependencies

  • Parsing & performance: Automa (coordinate/time parsing), Parsers (number parsing), StaticArrays (coordinate storage)
  • HTML entity handling: Scratch (caching), JSON3 (parsing entity definitions), Downloads (fetching definitions), Serialization (cache storage)
  • Data handling: Tables (Tables.jl interface), TimeZones & Dates (temporal data)
  • User interface: REPL (interactive layer selection)

The package also moves GeoInterface from a direct dependency to a weak dependency (extension), along with new weak dependencies for DataFrames, Makie, and ZipArchives support.

Testing

All existing tests pass after test code adjustment. However, I have added only a minimum of additional tests for the extra code before getting your feedback on this PR.

Alternative approach

If these changes are too extensive for the main package, I'm happy to maintain them as a separate public fork (e.g., FastKML.jl or similar) to avoid fragmenting the ecosystem while providing an alternative for performance-critical applications.

mathieu17g added 30 commits May 7, 2025 11:52
- Enhance coordinate parsing
- Skip xmlns declaration when adding attributes
- Ignore `typemap` unknown attributes when adding attributes
- Enhance GeoInterface part (WIP: should probably be moved to a dedicated file and further developped)
Add a show method for Geometry checking on output color capabiliy to be shown correctly in a DataFrame
Fix some issues on GeoInterface
Use StaticArrays to lower the allocations while parsing coordinates
….jl and tables.jl

Enhanced _object_slow function (WIP)
…ment opportunities messaging n object construction
- Updated Project.toml to include new dependencies: AbstractTrees, Automa, Downloads, Gumbo, JSON3, Scratch, Serialization, and their respective compatibility versions.
- Modified KML.jl to utilize Automa for parsing alongside Parsers.
- Enhanced geointerface.jl to support dynamic trait dispatch for MultiGeometry and improved geometry handling.
- Replaced the coordinate parsing logic in parsing.jl with an Automata-based approach for better performance and clarity.
- Updated tables.jl to include a new option for stripping HTML from descriptions in PlacemarkTable.
- Introduced HtmlEntitiesAutoma.jl for handling HTML entity decoding with caching for efficiency.
- Added a manual mapping for <Pair> tags to KML.StyleMapPair in types.jl to ensure correct parsing.
…oord type to support Vector{Coord2} and Vector{Coord3}
…er` function and ability to specify layer by their index beyond by their name
…oint with empty coordinate to align with struct definition and handle Point (Coord2 or Coord3) before other geometries that have avec vector of Coord2 or Coord3
- Introduced new `TimeStamp` and `TimeSpan` structs to represent time elements in KML.
- Updated existing KML structures to accommodate new time-related fields.
- Created a comprehensive ISO 8601 parser in `KMLTimeElementParsing.jl` to handle various date and time formats, including those with time zones.
- Implemented regex patterns for validating ISO 8601 formats and generating corresponding parser functions.
- Added test cases to validate the functionality of the ISO 8601 parser.
…LFile and related functions for improved type flexibility
…dary processing in _handle_polygon_boundary!
…ance read functions with lazy loading support
…ta extraction and refactor reading methods for KML and KMZ files
…o support LazyKMLFile, add error handling for KMZ files, and unify layer info retrieval for both file types.
…DataFrame constructors for improved performance and re-export LazyKMLFile in KML module.
mathieu17g added 14 commits May 30, 2025 09:48
Enhance KML.jl with improved geometry handling, navigation, and XML serialization

- Updated `KMLMakieExt.jl` to support polygons with holes using GeometryBasics.
- Enhanced `Coordinates.jl` to handle SVector types for coordinate strings.
- Added iterable and indexable functionality to `KMLFile` for easier navigation.
- Introduced `children` function in `navigation.jl` to retrieve logical children of KML elements.
- Modified `field_conversion.jl` to improve coordinate conversion and handling of gx:LatLonQuad.
- Improved XML serialization in `xml_serialization.jl` to handle various KML elements and attributes.
- Added tests for new features, including lazy loading and coordinate parsing.
- Updated dependencies in `Project.toml` to include GeoInterface and StaticArrays.
streamline tag symbol conversion and improve coordinate parsing in tests
…teration, enhancing readability and performance
Leads to insertion of lines with empty or missing values after every normal lines + issues on layer identification
Fix non fully qualified called to nodetype in xml_parsing.jl
@joshday
Copy link
Member

joshday commented Jun 12, 2025

Whoa!

Lots to digest here! My gut reaction is that this is too much to take on in this package. My goal was to make KML.jl as lightweight as possible so that it required little or no maintenance. It was built to satisfy a specific need for a specific customer.

That being said, I've only glanced at the changes. If we merge this, would you be able to unofficially commit to supporting any issues that pop up? I'd be happy to add you as a maintainer.

@mathieu17g
Copy link
Contributor Author

Of course, after such a significant change, I would handle the issues that will inevitably arise.

@joshday
Copy link
Member

joshday commented Jun 13, 2025

Cool, can you get the CI to go green?

@mathieu17g
Copy link
Contributor Author

OK I'll make a PR to update CI which is failling for other reasons

Current runner version: '2.325.0'
Runner Image Provisioner
Operating System
Runner Image
GITHUB_TOKEN Permissions
Secret source: None
Prepare workflow directory
Prepare all required actions
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/cache: v1`. Please update your workflow to use v3/v4 of actions/cache to avoid interruptions. Learn more: https://github.blog/changelog/[2](https://github.com/JuliaComputing/KML.jl/actions/runs/15564020918/job/43823386918?pr=14#step:1:2)024-12-05-notice-of-upcoming-releases-and-breaking-changes-for-github-actions/#actions-cache-v1-v2-and-actions-toolkit-cache-package-closing-downCurrent runner version: '2.325.0'
Runner Image Provisioner
Operating System
Runner Image
GITHUB_TOKEN Permissions
Secret source: None
Prepare workflow directory
Prepare all required actions
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/cache: v1`. Please update your workflow to use v3/v4 of actions/cache to avoid interruptions. Learn more: https://github.blog/changelog/[2](https://github.com/JuliaComputing/KML.jl/actions/runs/15564020918/job/43823386918?pr=14#step:1:2)024-12-05-notice-of-upcoming-releases-and-breaking-changes-for-github-actions/#actions-cache-v1-v2-and-actions-toolkit-cache-package-closing-down

@mathieu17g
Copy link
Contributor Author

Still have to loosen compat, to match CI script

@mathieu17g
Copy link
Contributor Author

mathieu17g commented Jun 17, 2025

Good. CI is green.
@joshday: ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants