snowplow
diff --git a/‎docs/api-reference/loaders-storage-targets/s3-loader/configuration-reference/index.md‎
Lines changed: 35 additions & 25 deletions b/‎docs/api-reference/loaders-storage-targets/s3-loader/configuration-reference/index.md‎
Lines changed: 35 additions & 25 deletions
diff --git a/‎docs/api-reference/loaders-storage-targets/s3-loader/index.md‎
Lines changed: 6 additions & 31 deletions b/‎docs/api-reference/loaders-storage-targets/s3-loader/index.md‎
Lines changed: 6 additions & 31 deletions
diff --git a/‎docs/api-reference/loaders-storage-targets/s3-loader/monitoring/index.md‎
Lines changed: 9 additions & 15 deletions b/‎docs/api-reference/loaders-storage-targets/s3-loader/monitoring/index.md‎
Lines changed: 9 additions & 15 deletions
@@ -21,28 +21,38 @@ This is a complete list of the options that can be configured in the S3 loader H
 
 | parameter | description |
 |-----------|-------------|
-| `purpose`                       | Required. Use RAW to sink data exactly as-is. Use `ENRICHED_EVENTS` to also enable event latency metrics. Use `SELF_DESCRIBING` to enable partitioning self-describing data by its schema |
-| `input.appName`                 | Required. Kinesis Client Lib app name (corresponds to DynamoDB table name) |
-| `input.streamName`              | Required. Name of the kinesis stream from which to read |
-| `input.position`                | Required. Use `TRIM_HORIZON` to start streaming at the last untrimmed record in the shard, which is the oldest data record in the shard. Or use `LATEST` to start streaming just after the most recent record in the shard |
-| `input.customEndpoint`          | Optional. Override the default endpoint for kinesis client api calls |
-| `input.maxRecords`              | Required. How many records the client should pull from kinesis each time |
-| `output.s3.path`                | Required. Full path to output data, e.g. s3://acme-snowplow-output/raw/ |
-| `output.s3.partitionFormat`     | Optional. Added in version 2.1.0. Configures how files are partitioned into S3 directories.When loading raw files, you might choose to partition by `date={yy}-{mm}-{dd}`. When loading self describing jsons, you might choose to partition by `{vendor}.{name}/model={model}/date={yy}-{mm}-{dd}`. Valid substitutions are `{vendor}`, `{name}`, `{format}`, `{model}` for self-describing jsons; and `{yy}`, `{mm}`, `{dd}`, `{hh}` for year, month, day and hour. Defaults to `{vendor}.{schema}` when loading self-describing JSONs, or blank (no partitioning) when loading raw or enriched events |
-| `output.s3.filenamePrefix`      | Optional. Adds a prefix to output |
-| `output.s3.compression`         | Required. Either LZO or GZIP |
-| `output.s3.maxTimeout`          | Required. Maximum Timeout that the application is allowed to fail for, e.g. in case of S3 outage |
-| `output.s3.customEndpoint`      | Optional. Override the default endpoint for s3 client api calls |
-| `region`                        | Optional. When used with the `output.s3.customEndpoint` option, this sets the region of the bucket. Also sets the region of the dynamoDB table. Defaults to the current region |
-| `output.bad.streamName`         | Required. Name of a kinesis stream to output failures |
-| `buffer.byteLimit`              | Required. Maximum bytes to read from kinesis before flushing a file to S3 |
-| `buffer.recordLimit`            | Required. Maximum records to read from kinesis before flushing a file to S3 |
-| `buffer.timeLimit`              | Required. Maximum time to wait in milliseconds between writing files to S3 |
-| `monitoring.snowplow.collector` | Optional. E.g. `https://snplow.acme.com`. URI of a snowplow collector. Used for monitoring application lifecycle and failure events |
-| `monitoring.snowplow.appId`     | Required only if the collector uri is also configured. Sets the appId field of the snowplow events |
-| `monitoring.sentry.dsn`         | Optional, for tracking uncaught run time exceptions |
-| `monitoring.metrics.cloudwatch` | Optional boolean, with default true. This is used to disable sending metrics to cloudwatch |
-| `monitoring.metrics.hostname`   | Optional, for sending loading metrics (latency and event counts) to a `statsd` server |
-| `monitoring.metrics.port`       | Optional, port of the statsd server |
-| `monitoring.metrics.tags`       | E.g.`{ "key1": "value1", "key2": "value2" }`. Tags are used to annotate the statsd metric with any contextual information |
-| `monitoring.metrics.prefix`     | Optional, default `snoplow.s3loader`. Configures the prefix of statsd metric names |
+| `input.streamName`                                                | Required. Name of the kinesis stream from which to read |
+| `input.appName`                                                   | Optional. Default: `snowplow-blob-loader-aws`. Kinesis Client Lib app name (corresponds to DynamoDB table name) |
+| `input.initialPosition.type` (since 3.0.0)                        | Optional. Default: `TRIM_HORIZON`. Set the initial position to consume the Kinesis stream. Possible values: `LATEST` (most recent data), `TRIM_HORIZON` (oldest available data), `AT_TIMESTAMP` (start from the record at or after the specified timestamp) |
+| `input.initialPosition.timestamp` (since 3.0.0)                   | Required for `AT_TIMESTAMP`. E.g. `2020-07-17T10:00:00Z` |
+| `input.retrievalMode.type` (since 3.0.0)                          | Optional. Default: `Polling`. Set the mode for retrieving records. Possible values: `Polling` or `FanOut` |
+| `input.retrievalMode.maxRecords` (since 3.0.0)                    | Required for `Polling`. Default: `1000`. Maximum size of a batch returned by a call to `getRecords`. Records are checkpointed after a batch has been fully processed, thus the smaller `maxRecords`, the more often records can be checkpointed into DynamoDb, but possibly reducing the throughput |
+| `input.workerIdentifier` (since 3.0.0)                            | Optional. Default: host name. Name of this KCL worker used in the DynamoDB lease table |
+| `input.leaseDuration` (since 3.0.0)                               | Optional. Default: `10 seconds`. Duration of shard leases. KCL workers must periodically refresh leases in the DynamoDB table before this duration expires |
+| `input.maxLeasesToStealAtOneTimeFactor` (since 3.0.0)             | Optional. Default: `2.0`. Controls how to pick the max number of leases to steal at one time. E.g. If there are 4 available processors, and `maxLeasesToStealAtOneTimeFactor = 2.0`, then allow the KCL to steal up to 8 leases. Allows bigger instances to more quickly acquire the shard-leases they need to combat latency |
+| `input.checkpointThrottledBackoffPolicy.minBackoff` (since 3.0.0)	| Optional. Default: `100 millis`. Minimum backoff before retrying when DynamoDB provisioned throughput exceeded |
+| `input.checkpointThrottledBackoffPolicy.maxBackoff` (since 3.0.0)	| Optional. Default: `1 second`. Maximum backoff before retrying when DynamoDB provisioned throughput limit exceeded |
+| `input.debounceCheckpoints` (since 3.0.0)                         | Optional. Default: `10 seconds`. How frequently to checkpoint our progress to the DynamoDB table. By increasing this value, we can decrease the write-throughput requirements of the DynamoDB table |
+| `input.customEndpoint`                                            | Optional. Override the default endpoint for kinesis client api calls |
+| `output.good.path`                                                | Required. Full path to output data, e.g. `s3://acme-snowplow-output/` |
+| `output.good.partitionFormat` (since 2.1.0)                       | Optional. Configures how files are partitioned into S3 directories. When loading self describing jsons, you might choose to partition by `{vendor}.{name}/model={model}/date={yy}-{mm}-{dd}`. Valid substitutions are `{vendor}`, `{name}`, `{format}`, `{model}` for self-describing jsons; and `{yy}`, `{mm}`, `{dd}`, `{hh}` for year, month, day and hour. Defaults to `{vendor}.{schema}` when loading self-describing JSONs or blank when loading enriched events |
+| `output.good.filenamePrefix`                                      | Optional. Add a prefix to files |
+| `output.good.compression`                                         | Optional. Has to be `GZIP` (default) |
+| `output.bad.streamName`                                           | Required. Name of a kinesis stream to output failures |
+| `output.bad.throttledBackoffPolicy.minBackoff` (since 3.0.0)	    | Optional. Default: `100 milliseconds`. Minimum backoff before retrying when writing fails with exceeded kinesis write throughput |
+| `output.bad.throttledBackoffPolicy.maxBackoff` (since 3.0.0)	    | Optional. Default: `1 second`. Maximum backoff before retrying when writing fails with exceeded kinesis write throughput |
+| `output.bad.recordLimit` (since 3.0.0)                            | Optional. Default: `500`. Maximum allowed to records we are allowed to send to Kinesis in 1 PutRecords request |
+| `output.bad.byteLimit` (since 3.0.0)                              | Optional. Default: `5242880`. Maximum allowed to bytes we are allowed to send to Kinesis in 1 PutRecords request |
+| `purpose`                                                         | Required. `ENRICHED_EVENTS` for enriched events or `SELF_DESCRIBING` for self-describing data |
+| `batching.maxBytes` (since 3.0.0)                                 | Optional. Default: `67108864`. After this amount of compressed bytes have been added to the buffer it gets written to a file (unless `maxDelay` is reached before) |
+| `batching.maxDelay` (since 3.0.0)                                 | Optional. Default: `2 minutes`. After this delay has elapsed the buffer gets written to a file (unless `maxBytes` is reached before) |
+| `cpuParallelismFactor` (since 3.0.0)                              | Optional. Default: `1`. Controls how the app splits the workload into concurrent batches which can be run in parallel, e.g. if there are 4 available processors and `cpuParallelismFactor = 0.75` then we process 3 batches concurrently. Adjusting this value can cause the app to use more or less of the available CPU |
+| `uploadParallelismFactor` (since 3.0.0)                           | Optional. Default: `2`. Controls number of upload jobs that can be run in parallel, e.g. if there are 4 available processors and `sinkParallelismFraction = 2` then we run 8 upload job concurrently. Adjusting this value can cause the app to use more or less of the available CPU |
+| `initialBufferSize` (since 3.0.0)                                 | Optional. Default: none. Overrides the initial size of the byte buffer that holds the compressed events in-memory before they get written to a file. If not set, the initial size is picked dynamically based on other configuration options. The default is known to work well. Increasing this value is a way to reduce in-memory copying, but comes at the cost of increased memory usage |
+| `monitoring.sentry.dsn`                                           | Optional. For tracking uncaught run time exceptions |
+| `monitoring.metrics.statsd.hostname`                              | Optional. For sending loading metrics (latency and event counts) to a `statsd` server |
+| `monitoring.metrics.statsd.port`                                  | Optional. Port of the statsd server |
+| `monitoring.metrics.statsd.tags`                                  | E.g.`{ "key1": "value1", "key2": "value2" }`. Tags are used to annotate the statsd metric with any contextual information |
+| `monitoring.metrics.statsd.prefix`                                | Optional. Default `snoplow.s3loader`. Configures the prefix of statsd metric names |
+| `monitoring.healthProbe.port` (since 3.0.0)                       | Optional. Default: `8080`. Port of the HTTP server that returns OK only if the app is healthy |
+| `monitoring.healthProbe.unhealthyLatency` (since 3.0.0)           | Optional. Default: `2 minutes`. Health probe becomes unhealthy if any received event is still not fully processed before this cutoff time |
@@ -10,23 +10,12 @@ import CodeBlock from '@theme/CodeBlock';
 
 Snowplow S3 Loader consumes records from an [Amazon Kinesis](http://aws.amazon.com/kinesis/) stream and writes them to [S3](http://aws.amazon.com/s3/). A typical Snowplow pipeline would use the S3 loader in several places:
 
-- Load collector payloads from the "raw" stream, to maintain an archive of the original data, before enrichment.
 - Load enriched events from the "enriched" stream. These serve as input for [the RDB loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) when loading to a warehouse.
 - Load failed events from the "bad" stream.
 
 Records that can't be successfully written to S3 are written to a [second Kinesis stream](https://github.com/snowplow/snowplow-s3-loader/blob/master/examples/config.hocon.sample#L75) with the error message.
 
-## Output Formats
-
-### LZO
-
-Records are treated as raw byte arrays. [Elephant Bird's](https://github.com/twitter/elephant-bird/) `BinaryBlockWriter` class is used to serialize them as a [Protocol Buffers](https://github.com/google/protobuf/) array (so it is clear where one record ends and the next begins) before compressing them.
-
-The compression process generates both compressed .lzo files and small .lzo.index files ([splittable LZO](https://github.com/twitter/hadoop-lzo)). Each index file contain the byte offsets of the LZO blocks in the corresponding compressed file, meaning that the blocks can be processed in parallel.
-
-LZO encoding is generally used for raw data produced by Snowplow Collector.
-
-### Gzip
+## Output format : GZIP
 
 The records are treated as byte arrays containing UTF-8 encoded strings (whether CSV, JSON or TSV). New lines are used to separate records written to a file. This format can be used with the Snowplow Kinesis Enriched stream, among other streams.
 
@@ -42,17 +31,10 @@ A Terraform module which deploys the Snowplow S3 Loader on AWS EC2 for use with
 
 ### Docker image
 
-We publish three different flavours of the docker image.
-
-- <p> Pull the <code>{`:${versions.s3Loader}`}</code> tag if you only need GZip output format </p>
-- <p> Pull the <code>{`:${versions.s3Loader}-lzo`}</code> tag if you also need LZO output format </p>
-- <p> Pull the <code>{`:${versions.s3Loader}-distroless`}</code> tag for an lightweight alternative to <code>{`:${versions.s3Loader}`}</code> </p>
+We publish two different flavours of the docker image:
 
-<CodeBlock language="bash">{
-`docker pull snowplow/snowplow-s3-loader:${versions.s3Loader}
-docker pull snowplow/snowplow-s3-loader:${versions.s3Loader}-lzo
-docker pull snowplow/snowplow-s3-loader:${versions.s3Loader}-distroless
-`}</CodeBlock>
+- <p><code>{`snowplow/snowplow-s3-loader:${versions.s3Loader}`}</code></p>
+- <p><code>{`snowplow/snowplow-s3-loader:${versions.s3Loader}-distroless`}</code> (lightweight alternative)</p>
 
 Here is a standard command to run the loader on a EC2 instance in AWS:
 
@@ -73,15 +55,8 @@ Here is a standard command to run the loader on a EC2 instance in AWS:
 
 ### Jar
 
-JARs can be found attached to the [Github release](https://github.com/snowplow/snowplow-s3-loader/releases). Only pick the `-lzo` version of the JAR file if you need to output in LZO format
+JARs can be found attached to the [Github release](https://github.com/snowplow/snowplow-s3-loader/releases).
 
 <CodeBlock language="bash">{
 `java -jar snowplow-s3-loader-${versions.s3Loader}.jar --config config.hocon
-java -jar snowplow-s3-loader-lzo-${versions.s3Loader}.jar --config config.hocon
-`}</CodeBlock>
-
-Running the jar requires to have the native LZO binaries installed. For example for Debian this can be done with:
-
-```bash
-sudo apt-get install lzop liblzo2-dev
-```
+`}</CodeBlock>
@@ -14,11 +14,15 @@ When processing enriched events, the S3 loader can emit metrics to a statsd daem
 
 ```text
 snowplow.s3loader.count:42|c|#tag1:value1
-snowplow.s3loader.latency_collector_to_load:123.4|g|#tag1:value1
+snowplow.s3loader.latency_collector_to_load:123|g|#tag1:value1
+snowplow.s3loader.latency_millis:56|g|#tag1:value1
+snowplow.s3loader.e2e_latency_millis:123|g|#tag1:value1
 ```
 
 - `count_good`: the total number of events in the batch that was loaded.
-- `latency_collector_to_load`: this is the time difference between reaching the collector and getting loaded to S3.
+- `latency_collector_to_load`: this is the time difference between reaching the collector and getting loaded to S3 (only for enriched events).
+- `latency_millis`: delay between the input record getting written to the stream and S3 loader starting to process it.
+- `e2e_latency_millis`: same as `latency_collector_to_load` (which will get deprecated eventually).
 
 Statsd monitoring is configured by setting the `monitoring.metrics.statsd` section in [the hocon file](/docs/api-reference/loaders-storage-targets/s3-loader/configuration-reference/index.md):
 
@@ -35,6 +39,9 @@ Statsd monitoring is configured by setting the `monitoring.metrics.statsd` sec
   }
 }
 ```
+## Health probe
+
+Starting with `3.0.0` version S3 loader gets a health probe, configured via the `monitoring.healthProbe` section (see the configuration reference).
 
 ## Sentry
 
@@ -49,16 +56,3 @@ Sentry monitoring is configured by setting the `monitoring.sentry.dsn` key in
   "dsn": "http://sentry.acme.com"
 }
 ```
-
-## Snowplow Tracking
-
-The loader can emit a Snowplow event to a collector when the application experiences runtime problems. It sends [app_initialized](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.monitoring.kinesis/app_initialized/jsonschema/1-0-0) and [app_heartbeat](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.monitoring.kinesis/app_heartbeat/jsonschema/1-0-0) events to show the application is alive. A [storage_write_failed event](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.monitoring.kinesis/storage_write_failed/jsonschema/1-0-0) is sent when a file cannot be written to S3, and a [app_shutdown event](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.monitoring.kinesis/app_shutdown/jsonschema/1-0-0) is sent when the application exits due to too many S3 errors.
-
-Snowplow monitoring is configured by setting the `monitoring.snowplow` section in [the hocon file](/docs/api-reference/loaders-storage-targets/s3-loader/configuration-reference/index.md):
-
-```json
-"monitoring": {
-  "appId": "redshift-loader"
-  "collector": "collector.acme.com"
-}
-```