Specify requirements for numeric stop chars and other token boundaries #401

austnwil · 2025-10-09T21:02:43Z

Description of changes:

There are numerous open issues involving ambiguity/incorrect behavior around how parsers should split certain tokens in Ion text (see #176, #175, amazon-ion/ion-rust#414, amazon-ion/ion-js#724). These have caused confusing behavior and inconsistencies between and even within implementations on how certain tokens that don't have any whitespace between them are split. A common theme across some of these issues is that the spec is not very clear on requirements around numeric stop characters and behavior of +/- in sexps.

This PR adds some explicit documentation on when tokens can appear next to each other and what the behavior of +/- in sexp is (subject to change). Once this is cleared up it should be easier to fix some of the linked issues because it will be more clear whether the current behavior is correct or not (particularly with #175 and #176).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

toddjonker · 2025-10-10T00:08:14Z

Are these issues not clarified by the grammar? That's the proper spec for text parsing questions.

austnwil · 2025-10-10T16:27:13Z

Are these issues not clarified by the grammar? That's the proper spec for text parsing questions.

The grammar is not clear on what the behavior of +/- is in sexp's. Here the grammar states:

// * There are some ambiguities involving symbol operators within s-expressions
// - Do '-' and '+' (possibly others) right or left associate?
// - Do symbol operators terminate special float symbols (nan, -inf, +inf)?

This has led to implementations handling ambiguous cases in ways that appear inconsistent.

I'm not suggesting we change the spec, just that we more clearly document what the correct behavior is, or determine it if it is not known. This should make it easier to fix the inconsistencies (or close the above issues if the behavior is actually correct).

Even if the grammar is unambiguous it would still be helpful to explicitly state the correct behavior in the docs as well.

austnwil · 2025-10-10T17:09:28Z

Most implementations currently seem to require numeric stops after +inf and -inf but not nan. Assuming this is correct, this makes the behavior in #176 correct, although unintuitive, so that can be closed. I updated this PR to document this behavior.

When #175 was reported, rust parsed certain inputs differently from java and javascript; this no longer seems to be the case. Rust and java implementations still reject +inf+inf and accept nan+inf, but if the above holds, then that is also correct behavior, so this can be closed as well.

popematt · 2025-11-05T23:56:52Z

docs/boundaries.md

+Anything that is not a numeric stop-character appearing immediately after a numeric or timestamp value is a syntax error. This notably includes comments and S-expression operators:
+
+```ion
+123// a comment              // ERROR: single-line comment is not a valid numeric stop


This is surprising. Comments are explicitly called out in the spec as being equivalent to whitespace, and I believe that all of our implementations treat comments as a valid whitespace for the purposes of terminating a numeric value.

I also found this surprising, but some implementations do reject Ion that uses a comment immediately after a numeric.

The spec does say specifically:

In the text notation, real values [and int, ts] must be followed by one of the fifteen numeric stop-characters: {},"'\ \t\n\r\v\f.

Which calls out specific whitespace characters \t\n\r\v\f and excludes comments.

Here are comparisons of our Ion JS, Python, Java, and Rust impls. I did not test Go, C yet.

Handling of 123//comment:

Language Success Result

JavaScript ❌: Error Error: invalid character after number

Python ❌: IonException IERR_INVALID_SYNTAX

Java ✅
$ion_1_0
123

Rust ❌: Decoding invalid Ion syntax encountered
offset=3
buffer head=<//comment>
buffer tail=<//comment>
buffer len=9

Handling of 123/*comment*/:

Language Success Result

JavaScript ❌: Error Error: invalid character after number

Python ❌: IonException IERR_INVALID_SYNTAX

Java ✅
$ion_1_0
123

Rust ❌: Decoding invalid Ion syntax encountered
offset=3
buffer head=</comment/>
buffer tail=</comment/>
buffer len=11

Package versions used:

JS Python Java Rust

Ion version 5.2.1 0.13.0 1.11.11 1.0.0-rc.7

Here's the code I used to test:

Java impl:

import com.amazon.ion.*; import com.amazon.ion.system.IonSystemBuilder; import com.amazon.ion.system.IonTextWriterBuilder; import com.fasterxml.jackson.databind.ObjectMapper; import java.util.HashMap; import java.util.Map; public class Main { public static void main(String[] args) { ObjectMapper mapper = new ObjectMapper(); Map<String, Object> response = new HashMap<>(); try { String input = args[0]; IonSystem ION = IonSystemBuilder.standard().build(); IonValue datagram = ION.getLoader().load(input); StringBuilder sb = new StringBuilder(); IonWriter writer = IonTextWriterBuilder.pretty().build(sb); datagram.writeTo(writer); writer.close(); response.put("success", true); response.put("result", sb.toString()); System.out.println(mapper.writeValueAsString(response)); } catch (Exception e) { response.put("success", false); response.put("errorType", e.getClass().getSimpleName()); response.put("error", e.toString()); try { System.out.println(mapper.writeValueAsString(response)); } catch (Exception jsonE) { System.out.println("{\"success\":false,\"errorType\":\"JsonProcessingException\",\"error\":\"" + jsonE.toString() + "\"}"); } } } }

JS impl:

import * as ion from 'ion-js' try { const input = process.argv[2] const reader = ion.makeReader(input) const writer = ion.makePrettyWriter() writer.writeValues(reader) writer.close() const decoder = new TextDecoder("utf-8") console.log(JSON.stringify({ success: true, result: decoder.decode(writer.getBytes()) })) } catch (e) { console.log(JSON.stringify({ success: false, errorType: e.constructor.name, error: e.toString() })) }

Python impl:

#!/usr/bin/env python3 import sys import json import amazon.ion.simpleion as ion try: input_str = sys.argv[1] data = ion.loads(input_str, single_value=False) result = ion.dumps(data, sequence_as_stream=True, binary=False, indent=' ') print(json.dumps({ "success": True, "result": result })) except Exception as e: print(json.dumps({ "success": False, "errorType": type(e).__name__, "error": str(e) }))

Rust impl:

use ion_rs::*; use serde_json::json; use std::env; fn main() { let args: Vec<String> = env::args().collect(); match run(&args[1]) { Ok(result) => { println!("{}", json!({ "success": true, "result": result })); } Err(e) => { println!("{}", json!({ "success": false, "errorType": format!("{:?}", e).split('(').next().unwrap_or("Error"), "error": format!("{}", e) })); } } } fn run(input: &str) -> IonResult<String> { let elements = Element::read_all(input.as_bytes())?; let out: String = elements.encode_as(v1_0::Text.with_format(TextFormat::Pretty))?; Ok(out) }

austnwil added 2 commits October 9, 2025 13:44

Specify requirements for numeric stop chars and other token boundaries

5a3f5e4

Clarify parsing of sexps

fb112e6

austnwil marked this pull request as ready for review October 9, 2025 23:26

Update to reflect current treatment of nan, +/-inf in implementations

4daa545

popematt reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify requirements for numeric stop chars and other token boundaries #401

Specify requirements for numeric stop chars and other token boundaries #401

Uh oh!

austnwil commented Oct 9, 2025

Uh oh!

toddjonker commented Oct 10, 2025

Uh oh!

austnwil commented Oct 10, 2025

Uh oh!

austnwil commented Oct 10, 2025

Uh oh!

popematt Nov 5, 2025

Uh oh!

austnwil Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Language	Success	Result
JavaScript	❌: Error	Error: invalid character after number
Python	❌: IonException	IERR_INVALID_SYNTAX
Java	✅	$ion_1_0 123
Rust	❌: Decoding	invalid Ion syntax encountered offset=3 buffer head=<//comment> buffer tail=<//comment> buffer len=9

Specify requirements for numeric stop chars and other token boundaries #401

Are you sure you want to change the base?

Specify requirements for numeric stop chars and other token boundaries #401

Uh oh!

Conversation

austnwil commented Oct 9, 2025

Description of changes:

Uh oh!

toddjonker commented Oct 10, 2025

Uh oh!

austnwil commented Oct 10, 2025

Uh oh!

austnwil commented Oct 10, 2025

Uh oh!

popematt Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

austnwil Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants