Fixtures #35

vpetrovykh · 2021-06-07T17:49:03Z

vpetrovykh
Jun 7, 2021
Maintainer

Synthetic data for use in fixtures

There are a number of different open source projects that offer
synthetic data. Some of them target very specific niche usages and
concentrate on effectively anonymizing real data, while preserving the
overall form. This is sometimes useful but it's not quite in the same
scope of generating fixtures for tests based on the schema alone.
Another consideration is the language used by the tool, ideally it
would be a language we already use in our tools, so that we can
minimize the support efforts. I mention this because there seem a few
Java data generators that appear good on paper. Keeping all of this in
mind the top choices (in no particular order) appear to be:

Synth (https://openquery-io.github.io/synth/getting_started/synth),
written in Rust and integrated with Faker Python library
Faker (Python)
Mimesis (Python)

Synth

It has good performance and nice features out-of-the-box. It uses a
declarative schema to generate the data and the schema is pretty
flexible. The schema understands that there are objects and can
reference them from other objects (say, by ID, but really by any
property). The input (for the schema) and output (for the data)
formats are both JSON, so they are simple to learn, generate and
ingest.

A minor, but notable issue is that we need to be able to generate some
binary strings and for this we would either need to fork Synth and
update it with a custom generator (actually, just add a converter from
Faker binary generator to something acceptable in JSON), or we can
even use a regexp generator to construct the JSON string that
represents the needed values.

Another note is that it has been hard to google because the results
get polluted by the musical meaning of "synth" for most general
questions.

Chances are we would need to fork it (either to support latest Python
or to add some more custom generators), which means we'd also need to
maintain that fork.

Faker

It has a large selection of generators and since it's in Python, it
would be easy for us to customize if need be. The library itself is
pretty popular (stars on github and being used by other projects), so
it's likely to stay relevant. There are also localized versions of the generators.

The performance might become a bottleneck for very large datasets
(1,000,000+ objects).

There's no "regexp" text generator out of the box. So some custom
generator would have to be implemented here. It appears that there are
other smaller projects that do that (e.g.
https://github.com/asciimoo/exrex), so we could make a custom faker
generator.

It's agnostic of objects and links, so it would fall onto our own
framework to keep those in mind and correlate generated objects
correctly.

Mimesis

It is similar to Faker, has mostly the same features and claims to be
faster (about 10x compared to Faker).

Most of the strengths a weaknesses are similar to Faker. It also has
no regexp generator, so that's a shared problem, but at least
performance would be less of a concern.

Recommendation

I think that Mimesis would be a good option for us to use as the
basis for the fixture generator. It has good performance, it should be
fairly easy for us to integrate and customize without forking the
entire project and this would afford us good flexibility with feature
that we actually offer and allow us to concentrate on our end of
things designing the fixture tools and workflows. Although it lacks
any knowledge of objects, we would probably want our own layer for
handling that, because we need to account not only for links, but also
for inheritance (potentially generating objects of different types for
a link, etc.). Regex generator should be a fairly straight forward
thing to add with a few projects to look at for reference in terms of
how others approached this if we run into performance issues.

Fixture Workflow

There are different reasons for wanting fixtures that result in
slightly different workflows.

Fixtures can be used in unit tests / regression testing. This use
case often focuses on a large number of diverse minimal test cases.
This means potentially hundreds of tests each with its own slightly
different (hopefully lightweight) fixture. This workflow may benefit
from integration with such tools as Hypothesis
(https://hypothesis.works/).
Fixtures are also useful for providing some "typical" dataset to use
while developing UI. These fixtures tend to be fairly large and
potentially with a number of nuanced corner cases all baked into one
monolithic dataset.
Fixtures can be used in performance tests. Often the dataset could
be very large and control over the statistical distribution of
various fixture parameters (from length of str fields to number
of links) is very important for this use case.

Fixture-Schema integration

Some information for the fixtures can be inferred from the schema
as-is, based on types and constraints: enum values, numbers with or
without min/max constraints, etc.

Other information is more contextual: birth dates being in the past,
first/last name fields, description length. These kinds of features
are more or less global, in the sense that no matter the particular
use-case if they are present in a fixture they'd have the same
configuration. In that sense they are similar to the schema and might
benefit from some centralized way of capturing them so they are kept
consistent across all uses.

There's information that is so highly contextual that it would vary
from fixture variant to fixture variant wildly and possibly even vary
within the same fixture: number of links, "active" boolean flag being
always true for finalized objects, but occasionally being false in
some exceptional situation, booking dates being in the future for new
objects (that may have some other common features aside from the
date), but also being always in the past for logged historical booking
records (which might all have ratings, something that new booking
don't yet have), etc. There's not much re-use of these and there may
be several fixtures all configured slightly differently for one common
underlying schema.

For more monolithic fixtures it may be tempting to incorporate the
fixture settings into the schema itself (as annotations, for example)
so that there's a single source of truth and the risk of the schema
and fixture drifting apart is minimized. However, in a situation where
fixtures are used in tests the danger is that adding a new test may
require a schema change to add/alter some fixture information, which
seems like overhead.

Perhaps the fixtures should be a separate structure that mirrors the
schema. The core of this can be directly automatically inferred from
the schema and the rest work like aliases and schema views on top of
the base fixture.

1st1 · 2021-06-09T02:32:04Z

1st1
Jun 9, 2021
Maintainer

Thanks @vpetrovykh, great overview. We'll get back to this soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixtures #35

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Fixtures #35

vpetrovykh Jun 7, 2021 Maintainer

Synthetic data for use in fixtures

Synth

Faker

Mimesis

Recommendation

Fixture Workflow

Fixture-Schema integration

Replies: 1 comment

1st1 Jun 9, 2021 Maintainer

vpetrovykh
Jun 7, 2021
Maintainer

1st1
Jun 9, 2021
Maintainer