-
Notifications
You must be signed in to change notification settings - Fork 24
Specify requirements for numeric stop chars and other token boundaries #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh-pages
Are you sure you want to change the base?
Specify requirements for numeric stop chars and other token boundaries #401
Conversation
|
Are these issues not clarified by the grammar? That's the proper spec for text parsing questions. |
The grammar is not clear on what the behavior of
This has led to implementations handling ambiguous cases in ways that appear inconsistent. I'm not suggesting we change the spec, just that we more clearly document what the correct behavior is, or determine it if it is not known. This should make it easier to fix the inconsistencies (or close the above issues if the behavior is actually correct). Even if the grammar is unambiguous it would still be helpful to explicitly state the correct behavior in the docs as well. |
|
Most implementations currently seem to require numeric stops after When #175 was reported, rust parsed certain inputs differently from java and javascript; this no longer seems to be the case. Rust and java implementations still reject |
| Anything that is not a numeric stop-character appearing immediately after a numeric or timestamp value is a syntax error. This notably includes comments and S-expression operators: | ||
|
|
||
| ```ion | ||
| 123// a comment // ERROR: single-line comment is not a valid numeric stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is surprising. Comments are explicitly called out in the spec as being equivalent to whitespace, and I believe that all of our implementations treat comments as a valid whitespace for the purposes of terminating a numeric value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found this surprising, but some implementations do reject Ion that uses a comment immediately after a numeric.
The spec does say specifically:
In the text notation, real values [and int, ts] must be followed by one of the fifteen numeric stop-characters: {},"'\ \t\n\r\v\f.
Which calls out specific whitespace characters \t\n\r\v\f and excludes comments.
Here are comparisons of our Ion JS, Python, Java, and Rust impls. I did not test Go, C yet.
Handling of 123//comment:
| Language | Success | Result |
|---|---|---|
| JavaScript | ❌: Error | Error: invalid character after number |
| Python | ❌: IonException | IERR_INVALID_SYNTAX |
| Java | ✅ | $ion_1_0 123 |
| Rust | ❌: Decoding | invalid Ion syntax encountered offset=3 buffer head=<//comment> buffer tail=<//comment> buffer len=9 |
Handling of 123/*comment*/:
| Language | Success | Result |
|---|---|---|
| JavaScript | ❌: Error | Error: invalid character after number |
| Python | ❌: IonException | IERR_INVALID_SYNTAX |
| Java | ✅ | $ion_1_0 123 |
| Rust | ❌: Decoding | invalid Ion syntax encountered offset=3 buffer head=</comment/> buffer tail=</comment/> buffer len=11 |
Package versions used:
| JS | Python | Java | Rust | |
|---|---|---|---|---|
| Ion version | 5.2.1 | 0.13.0 | 1.11.11 | 1.0.0-rc.7 |
Here's the code I used to test:
Java impl:
import com.amazon.ion.*;
import com.amazon.ion.system.IonSystemBuilder;
import com.amazon.ion.system.IonTextWriterBuilder;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.HashMap;
import java.util.Map;
public class Main {
public static void main(String[] args) {
ObjectMapper mapper = new ObjectMapper();
Map<String, Object> response = new HashMap<>();
try {
String input = args[0];
IonSystem ION = IonSystemBuilder.standard().build();
IonValue datagram = ION.getLoader().load(input);
StringBuilder sb = new StringBuilder();
IonWriter writer = IonTextWriterBuilder.pretty().build(sb);
datagram.writeTo(writer);
writer.close();
response.put("success", true);
response.put("result", sb.toString());
System.out.println(mapper.writeValueAsString(response));
} catch (Exception e) {
response.put("success", false);
response.put("errorType", e.getClass().getSimpleName());
response.put("error", e.toString());
try {
System.out.println(mapper.writeValueAsString(response));
} catch (Exception jsonE) {
System.out.println("{\"success\":false,\"errorType\":\"JsonProcessingException\",\"error\":\"" + jsonE.toString() + "\"}");
}
}
}
}JS impl:
import * as ion from 'ion-js'
try {
const input = process.argv[2]
const reader = ion.makeReader(input)
const writer = ion.makePrettyWriter()
writer.writeValues(reader)
writer.close()
const decoder = new TextDecoder("utf-8")
console.log(JSON.stringify({
success: true,
result: decoder.decode(writer.getBytes())
}))
} catch (e) {
console.log(JSON.stringify({
success: false,
errorType: e.constructor.name,
error: e.toString()
}))
}Python impl:
#!/usr/bin/env python3
import sys
import json
import amazon.ion.simpleion as ion
try:
input_str = sys.argv[1]
data = ion.loads(input_str, single_value=False)
result = ion.dumps(data, sequence_as_stream=True, binary=False, indent=' ')
print(json.dumps({
"success": True,
"result": result
}))
except Exception as e:
print(json.dumps({
"success": False,
"errorType": type(e).__name__,
"error": str(e)
}))Rust impl:
use ion_rs::*;
use serde_json::json;
use std::env;
fn main() {
let args: Vec<String> = env::args().collect();
match run(&args[1]) {
Ok(result) => {
println!("{}", json!({
"success": true,
"result": result
}));
}
Err(e) => {
println!("{}", json!({
"success": false,
"errorType": format!("{:?}", e).split('(').next().unwrap_or("Error"),
"error": format!("{}", e)
}));
}
}
}
fn run(input: &str) -> IonResult<String> {
let elements = Element::read_all(input.as_bytes())?;
let out: String = elements.encode_as(v1_0::Text.with_format(TextFormat::Pretty))?;
Ok(out)
}
Description of changes:
There are numerous open issues involving ambiguity/incorrect behavior around how parsers should split certain tokens in Ion text (see #176, #175, amazon-ion/ion-rust#414, amazon-ion/ion-js#724). These have caused confusing behavior and inconsistencies between and even within implementations on how certain tokens that don't have any whitespace between them are split. A common theme across some of these issues is that the spec is not very clear on requirements around numeric stop characters and behavior of
+/-insexps.This PR adds some explicit documentation on when tokens can appear next to each other and what the behavior of
+/-in sexp is (subject to change). Once this is cleared up it should be easier to fix some of the linked issues because it will be more clear whether the current behavior is correct or not (particularly with #175 and #176).By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.