Skip to content

Optimizes reading binary symbol and integer values #396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,8 @@ massif.out.*
bb.out
bb.out.*


# macOS folder metadata
**/.DS_Store

### VisualStudioCode ###
.vscode/*
Expand Down
13 changes: 4 additions & 9 deletions src/binary/non_blocking/binary_buffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
/// If the buffer is not empty, returns `Some(_)` containing the next byte in the buffer.
/// Otherwise, returns `None`.
pub fn peek_next_byte(&self) -> Option<u8> {
self.bytes().get(0).copied()
self.data.as_ref().get(self.start).copied()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The original code created a byte slice (&[u8]) even though we only needed the first byte in the data. Now it just gets the first byte without creating an intermediate slice.

}

/// If there are at least `n` bytes left in the buffer, returns `Some(_)` containing a slice
Expand Down Expand Up @@ -262,7 +262,7 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
///
/// See: https://amzn.github.io/ion-docs/docs/binary.html#uint-and-int-fields
pub fn read_uint(&mut self, length: usize) -> IonResult<DecodedUInt> {
if length <= mem::size_of::<usize>() {
if length <= mem::size_of::<u64>() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Here and below: the code was originally checking for usize, which had the potential to be overly conservative. On 32-bit systems, this would cause medium-sized values to be pushed into a BigInt.

In some cases (like reading a UInt that represents a SymbolId), limiting it to usize makes sense. However, that's not this method's responsibility.

return self.read_small_uint(length);
}

Expand All @@ -271,18 +271,13 @@ impl<A: AsRef<[u8]>> BinaryBuffer<A> {
}

/// Reads the first `length` bytes from the buffer as a `UInt`. The caller must confirm that
/// `length` is small enough to fit in a `usize`.
/// `length` is small enough to fit in a `u64`.
#[inline]
fn read_small_uint(&mut self, length: usize) -> IonResult<DecodedUInt> {
let uint_bytes = self
.peek_n_bytes(length)
.ok_or_else(|| incomplete_data_error_raw("a UInt", self.total_consumed()))?;
let mut magnitude: u64 = 0;
for &byte in uint_bytes {
let byte = u64::from(byte);
magnitude <<= 8;
magnitude |= byte;
}
let magnitude = DecodedUInt::small_uint_from_slice(uint_bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This bitshifting loop now lives in DecodedUInt::small_uint_from_slice so it can be called from multiple locations. I'll point out the other caller location below.

self.consume(length);
Ok(DecodedUInt::new(UInteger::U64(magnitude), length))
}
Expand Down
41 changes: 23 additions & 18 deletions src/binary/non_blocking/raw_binary_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ use crate::result::{
decoding_error, decoding_error_raw, illegal_operation, illegal_operation_raw,
incomplete_data_error,
};
use crate::types::integer::{IntAccess, UInteger};
use crate::types::integer::IntAccess;
use crate::types::SymbolId;
use crate::{
Decimal, Integer, IonResult, IonType, RawStreamItem, RawSymbolToken, StreamReader, Timestamp,
Expand All @@ -18,6 +18,7 @@ use bytes::{BigEndian, Buf, ByteOrder};
use num_bigint::BigUint;
use num_traits::Zero;
use std::io::Read;
use std::mem;
use std::ops::Range;

/// Type, offset, and length information about the serialized value over which the
Expand Down Expand Up @@ -475,18 +476,13 @@ impl<A: AsRef<[u8]>> RawBinaryBufferReader<A> {

/// If the reader is currently positioned on a symbol value, parses that value into a `SymbolId`.
pub fn read_symbol_id(&mut self) -> IonResult<SymbolId> {
let (encoded_value, mut buffer) = self.value_and_buffer(IonType::Symbol)?;
match buffer.read_uint(encoded_value.value_length())?.value() {
Comment on lines -478 to -479
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Previously, we constructed a BinaryBuffer that wrapped the value bytes. This allowed us to call buffer.read_uint(...), which was very ergonomic. However, doing this involved a few redundant steps:

  • It constructed the same slice (&[u8]) containing the value's body more than once.
  • It checked the number of bytes remaining in the buffer more than once.
  • It had multiple error handling paths even though at most one of them could be called.

UInteger::U64(symbol_id) => {
// This will always succeed on 64-bit platforms where u64 and usize are the same.
if let Ok(sid) = usize::try_from(*symbol_id) {
Ok(sid)
} else {
decoding_error("found a u64 symbol ID that was too large to fit in a usize")
}
}
UInteger::BigUInt(symbol_id) => Self::try_symbol_id_from_big_uint(symbol_id),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This code path (try_symbol_id_from_big_uint) would only succeed if the BigUInt was small enough that it could have been encoded as a usize to begin with. This code path existed solely to satisfy one of the ion-tests:

ion-tests/iontestdata/good/typecodes/T7-large.10n

Text view of its contents:

---------------------------------------------------------------------------
 Offset   |  Length   |        Binary Ion        |         Text Ion
---------------------------------------------------------------------------
          |         4 | e0 01 00 ea              |  // Ion 1.0 Version Marker
        4 |         6 | 75 00 00 00 00 00        |   '$0' // $0
       10 |         7 | 76 00 00 00 00 00 00     |   '$0' // $0
       17 |         8 | 77 00 00 00 00 00 00 00  |   '$0' // $0
       25 |         9 | 78 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00                       |
       34 |        10 | 79 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00                    |
       44 |        11 | 7a 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00                 |
       55 |        12 | 7b 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00              |
       67 |        13 | 7c 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00 00           |
       80 |        14 | 7d 00 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00 00 00        |
       94 |        16 | 7e 8e 00 00 00 00 00 00  |   '$0' // $0
          |           | 00 00 00 00 00 00 00 00  |

I have removed that code path altogether and added the test to the skip list. I don't believe this represents a substantive loss of functionality.

let (_encoded_value, bytes) = self.value_and_bytes(IonType::Symbol)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ We no longer construct a BinaryBuffer; bytes contains the value byte slice that we're interested in.

if bytes.len() > mem::size_of::<usize>() {
return decoding_error("found a symbol Id that was too large to fit in a usize");
}
let magnitude = DecodedUInt::small_uint_from_slice(bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ We can read from bytes directly with no additional boundary or error checking using the helper method we previously created: DecodedUInt::small_uint_from_slice.

// This cast is safe because we've confirmed the value was small enough to fit in a usize.
Ok(magnitude as usize)
}

/// Tries to downgrade the provided BigUint to a SymbolId (usize).
Expand Down Expand Up @@ -628,6 +624,12 @@ impl<A: AsRef<[u8]>> StreamReader for RawBinaryBufferReader<A> {
Box::new(self.annotations_iter())
}

fn has_annotations(&self) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This method is incidental to this changeset, but is also an important optimization. Readers checking to see if the current value has an annotation can simply see if the encoding had an annotations sequence; this is much faster than the default implementation that constructs a boxed iterator and reports whether it is empty.

self.encoded_value()
.map(|v| v.annotations_sequence_length > 0)
.unwrap_or(false)
}

fn field_name(&self) -> IonResult<Self::Symbol> {
// If the reader is parked on a value...
self.encoded_value()
Expand Down Expand Up @@ -685,9 +687,12 @@ impl<A: AsRef<[u8]>> StreamReader for RawBinaryBufferReader<A> {
}

fn read_integer(&mut self) -> IonResult<Integer> {
let (encoded_value, mut buffer) = self.value_and_buffer(IonType::Integer)?;
let uint: DecodedUInt = buffer.read_uint(encoded_value.value_length())?;
let value: Integer = uint.into();
let (encoded_value, bytes) = self.value_and_bytes(IonType::Integer)?;
let value: Integer = if bytes.len() <= mem::size_of::<u64>() {
DecodedUInt::small_uint_from_slice(bytes).into()
} else {
DecodedUInt::big_uint_from_slice(bytes).into()
};
Comment on lines +690 to +695
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This optimization is similar to the read_symbol_id optimization above. We:

  • Use bytes directly instead of creating an intermediate buffer.
  • Use our new helper method to read all of bytes without extra bounds checking or error reporting.

We also get a minor boost from the fact that small_uint_from_slice and big_uint_from_slice return their corresponding types. Instead of getting back a DecodedUInt (which needs to be destructured in order to turn it into an Integer), we get back a u64 or a BigUInt depending on the branch. We can then call the appropriate into() method to get an Integer without the extra destructuring step.


use self::IonTypeCode::*;
let value = match (encoded_value.header.ion_type_code, value) {
Expand Down Expand Up @@ -1195,7 +1200,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> {
// Read the length of the annotations sequence
let annotations_length = self.tx_buffer.read_var_uint()?;

// Validate that neither the annotations sequence is not empty.
// Validate that the annotations sequence is not empty.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Fixing a typo.

if annotations_length.value() == 0 {
return decoding_error("found an annotations wrapper with no annotations");
}
Expand All @@ -1204,7 +1209,7 @@ impl<'a, A: AsRef<[u8]>> TxReader<'a, A> {
let expected_value_length = annotations_and_value_length
- annotations_length.size_in_bytes()
- annotations_length.value();
self.tx_buffer.total_consumed();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Removing a stray accessor function that had no impact on the method logic (and was optimized out of the assembly anyway).


if expected_value_length == 0 {
return decoding_error("found an annotation wrapper with no value");
}
Expand Down Expand Up @@ -1749,7 +1754,7 @@ mod tests {
fn debug() -> IonResult<()> {
let data = &[
0xE0, 0x01, 0x00, 0xEA, // IVM
0xc3, 0xd2, 0x84, 0x11, // {'name': true}
0xc3, 0xd2, 0x84, 0x11, // ({'name': true})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Correcting the text Ion in the comment for this unit test.

]; // Empty string
let mut reader = RawBinaryBufferReader::new(data);
let item = reader.next()?;
Expand Down
18 changes: 18 additions & 0 deletions src/binary/uint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,24 @@ impl DecodedUInt {
}
}

/// Interprets all of the bytes in the provided slice as big-endian unsigned integer bytes.
/// The caller must confirm that `uint_bytes` is no longer than 8 bytes long; otherwise,
/// overflow may quietly occur.
pub(crate) fn small_uint_from_slice(uint_bytes: &[u8]) -> u64 {
let mut magnitude: u64 = 0;
for &byte in uint_bytes {
let byte = u64::from(byte);
magnitude <<= 8;
magnitude |= byte;
}
magnitude
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Our new helper methods that operate directly on the provided byte slice.


/// Interprets all of the bytes in the provided slice as big-endian unsigned integer bytes.
pub(crate) fn big_uint_from_slice(uint_bytes: &[u8]) -> BigUint {
BigUint::from_bytes_be(uint_bytes)
}

/// Reads a UInt with `length` bytes from the provided data source.
pub fn read<R: IonDataSource>(data_source: &mut R, length: usize) -> IonResult<DecodedUInt> {
if length > MAX_UINT_SIZE_IN_BYTES {
Expand Down
42 changes: 42 additions & 0 deletions src/types/integer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,48 @@ impl Zero for Integer {
}
}

// Trivial conversion to Integer::I64 from integers that can safely be converted to an i64
macro_rules! impl_integer_i64_from {
($($t:ty),*) => ($(
impl From<$t> for Integer {
fn from(value: $t) -> Integer {
let i64_value = i64::from(value);
Integer::I64(i64_value)
}
}
)*)
}
impl_integer_i64_from!(u8, u16, u32, i8, i16, i32, i64);

// Conversion to Integer from integer types that may or may not fit in an i64
macro_rules! impl_integer_from {
($($t:ty),*) => ($(
impl From<$t> for Integer {
fn from(value: $t) -> Integer {
match i64::try_from(value) {
Ok(i64_value) => Integer::I64(i64_value),
Err(_) => Integer::BigInt(BigInt::from(value))
}
}
}
)*)
}

impl_integer_from!(isize, usize, u64);

impl From<BigUint> for Integer {
fn from(value: BigUint) -> Self {
let big_int = BigInt::from(value);
Integer::BigInt(big_int)
}
}

impl From<BigInt> for Integer {
fn from(value: BigInt) -> Self {
Integer::BigInt(value)
}
}

impl<T> IntAccess for T
where
T: Element,
Expand Down
4 changes: 4 additions & 0 deletions tests/element_test_vectors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -788,6 +788,10 @@ mod non_blocking_native_element_tests {
"ion-tests/iontestdata/good/subfieldVarUInt15bit.ion",
"ion-tests/iontestdata/good/subfieldVarUInt16bit.ion",
"ion-tests/iontestdata/good/subfieldVarUInt32bit.ion",
// This test requires the reader to be able to read symbols whose ID is encoded
// with more than 8 bytes. Having a symbol table with more than 18 quintillion
// symbols is not very practical.
"ion-tests/iontestdata/good/typecodes/T7-large.10n",
// ---
// Requires importing shared symbol tables
"ion-tests/iontestdata/good/item1.10n",
Expand Down