-
Notifications
You must be signed in to change notification settings - Fork 1
Parsing and conversion to DataFrame
performance enhancement
#14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Parsing and conversion to DataFrame
performance enhancement
#14
Conversation
- Enhance coordinate parsing - Skip xmlns declaration when adding attributes - Ignore `typemap` unknown attributes when adding attributes - Enhance GeoInterface part (WIP: should probably be moved to a dedicated file and further developped)
Add a show method for Geometry checking on output color capabiliy to be shown correctly in a DataFrame Fix some issues on GeoInterface Use StaticArrays to lower the allocations while parsing coordinates
….jl and tables.jl Enhanced _object_slow function (WIP)
…ment opportunities messaging n object construction
- Updated Project.toml to include new dependencies: AbstractTrees, Automa, Downloads, Gumbo, JSON3, Scratch, Serialization, and their respective compatibility versions. - Modified KML.jl to utilize Automa for parsing alongside Parsers. - Enhanced geointerface.jl to support dynamic trait dispatch for MultiGeometry and improved geometry handling. - Replaced the coordinate parsing logic in parsing.jl with an Automata-based approach for better performance and clarity. - Updated tables.jl to include a new option for stripping HTML from descriptions in PlacemarkTable. - Introduced HtmlEntitiesAutoma.jl for handling HTML entity decoding with caching for efficiency. - Added a manual mapping for <Pair> tags to KML.StyleMapPair in types.jl to ensure correct parsing.
…dle empty coordinate cases
…removing unnecessary type conversions
…oord type to support Vector{Coord2} and Vector{Coord3}
…er` function and ability to specify layer by their index beyond by their name
… extraction functions
…n and error handling
…oint with empty coordinate to align with struct definition and handle Point (Coord2 or Coord3) before other geometries that have avec vector of Coord2 or Coord3
…ataFrame construction with new options
…nd Tables.rows functions for KMLFile
… handling of KML tags in object() function
- Introduced new `TimeStamp` and `TimeSpan` structs to represent time elements in KML. - Updated existing KML structures to accommodate new time-related fields. - Created a comprehensive ISO 8601 parser in `KMLTimeElementParsing.jl` to handle various date and time formats, including those with time zones. - Implemented regex patterns for validating ISO 8601 formats and generating corresponding parser functions. - Added test cases to validate the functionality of the ISO 8601 parser.
…nd optimize error handling
…_info for better type management
…LFile and related functions for improved type flexibility
…dary processing in _handle_polygon_boundary!
…ance read functions with lazy loading support
…ta extraction and refactor reading methods for KML and KMZ files
…o support LazyKMLFile, add error handling for KMZ files, and unify layer info retrieval for both file types.
…DataFrame constructors for improved performance and re-export LazyKMLFile in KML module.
…d `FieldConversion`
…g placemark number counts
Enhance KML.jl with improved geometry handling, navigation, and XML serialization - Updated `KMLMakieExt.jl` to support polygons with holes using GeometryBasics. - Enhanced `Coordinates.jl` to handle SVector types for coordinate strings. - Added iterable and indexable functionality to `KMLFile` for easier navigation. - Introduced `children` function in `navigation.jl` to retrieve logical children of KML elements. - Modified `field_conversion.jl` to improve coordinate conversion and handling of gx:LatLonQuad. - Improved XML serialization in `xml_serialization.jl` to handle various KML elements and attributes. - Added tests for new features, including lazy loading and coordinate parsing. - Updated dependencies in `Project.toml` to include GeoInterface and StaticArrays.
streamline tag symbol conversion and improve coordinate parsing in tests
…teration, enhancing readability and performance
Leads to insertion of lines with empty or missing values after every normal lines + issues on layer identification
Fix non fully qualified called to nodetype in xml_parsing.jl
Whoa! Lots to digest here! My gut reaction is that this is too much to take on in this package. My goal was to make KML.jl as lightweight as possible so that it required little or no maintenance. It was built to satisfy a specific need for a specific customer. That being said, I've only glanced at the changes. If we merge this, would you be able to unofficially commit to supporting any issues that pop up? I'd be happy to add you as a maintainer. |
Of course, after such a significant change, I would handle the issues that will inevitably arise. |
Cool, can you get the CI to go green? |
OK I'll make a PR to update CI which is failling for other reasons
|
Still have to loosen compat, to match CI script |
Good. CI is green. |
This PR introduces substantial performance improvements and architectural changes to KML.jl. Due to the extensive nature of these modifications, I understand if you prefer that I create a separate public fork instead. I'm submitting this PR to give you the option to incorporate these changes, if it's not too much.
Performance improvements
Benchmarks comparing main branch vs this PR on Windows 11, Julia 1.11.5:
I used the KML test file produced with the function below:
Otherwise, I also benchmarked against
ArchGDAL.jl
for the conversion of a layer from a KML file to a DataFrame, and it is between 1.5 and 4 times faster. Here are some KML files found in the wild that I used for benchmark:Architectural changes
1. Module restructuring
On the way to enhancing performance, I reorganized the codebase into the following modules:
types.jl
- Type definitions with thread-safe tag mappingxml_parsing.jl
- XML to KML object parsingcoordinates.jl
- Automa-based coordinate parsingfield_conversion.jl
- Type-stable field conversionstables.jl
- Tables.jl interface implementationtime_parsing.jl
- ISO 8601 time parsing with Automahtml_entities.jl
- HTML entity decodinglayers.jl
- Layer navigation and selection2. New LazyKMLFile type
3. PlacemarkTable for efficient data extraction
4. Type-stable parsing implementation
5. Extensions for optional dependencies
KMLDataFramesExt
- DataFrame integrationKMLGeoInterfaceExt
- GeoInterface.jl supportKMLMakieExt
- Makie.jl plotting recipesKMLZipArchivesExt
- KMZ file supportKey implementation details
Coordinate parsing
Replaced regex-based parsing with Automa.jl FSM:
Memory optimizations
Thread safety
Breaking changes
None. All existing APIs maintained with identical behavior.
New dependencies
Automa
(coordinate/time parsing),Parsers
(number parsing),StaticArrays
(coordinate storage)Scratch
(caching),JSON3
(parsing entity definitions),Downloads
(fetching definitions),Serialization
(cache storage)Tables
(Tables.jl interface),TimeZones
&Dates
(temporal data)REPL
(interactive layer selection)The package also moves
GeoInterface
from a direct dependency to a weak dependency (extension), along with new weak dependencies forDataFrames
,Makie
, andZipArchives
support.Testing
All existing tests pass after test code adjustment. However, I have added only a minimum of additional tests for the extra code before getting your feedback on this PR.
Alternative approach
If these changes are too extensive for the main package, I'm happy to maintain them as a separate public fork (e.g.,
FastKML.jl
or similar) to avoid fragmenting the ecosystem while providing an alternative for performance-critical applications.