Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka tutorial readme #13

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ArpanRoy-Ahana
Copy link
Contributor

@tdcmeehan tdcmeehan requested a review from steveburnett August 9, 2023 23:01
Copy link
Collaborator

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! I made some comments that I'd like you to review and consider.

The Kafka Connector for Presto allows access to live topic data from Apache Kafka using Presto. This tutorial shows how to set up topics and how to create the topic description files that back Presto tables.

# Installation
This tutorial assumes familiarity with Presto and a working local Presto installation (see Deploying Presto). It will focus on setting up Apache Kafka and integrating it with Presto.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(see Deploying Presto)" Perhaps a link could be added here?

This tutorial assumes familiarity with Presto and a working local Presto installation (see Deploying Presto). It will focus on setting up Apache Kafka and integrating it with Presto.

## Step 1: Install Apache Kafka
Download and extract Apache Kafka.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Download from where?
  2. Extract how? Do you assume the reader knows how? Is there a page in the Apache Kafka documentation that describes the preferred extraction method that you can either link to, or maybe include here if it's very short and it's too much of an interruption to the reader to go to the other page?


## Start ZooKeeper and the Kafka server:
```
$ bin/zookeeper-server-start.sh config/zookeeper.properties
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bin/zookeeper-server-start.sh (...)" You specify bin, implying that a bin subdirectory containing zookeeper-server-start.sh exists local to the user's location in the filesystem.
Is there a specific directory the user should be in to run this command?


```
$ curl -o kafka-tpch https://repo1.maven.org/maven2/de/softwareforge/kafka_tpch_0811/1.0/kafka_tpch_0811-1.0.sh
$ chmod 755 kafka-tpch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You explained the curl command with "Download the tpch-kafka loader from Maven central:". Maybe describe why the user needs to run the command "chmod 755 kafka-tpch".

9
(10 rows)
```
The topic definition file maps the internal Kafka key (which is a raw long in eight bytes) onto a Presto BIGINT column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"which is a raw long" what?

Download the twistr tool
```
$ curl -o twistr https://repo1.maven.org/maven2/de/softwareforge/twistr_kafka_0811/1.2/twistr_kafka_0811-1.2.sh
$ chmod 755 twistr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion as "Download the tpch-kafka loader from Maven central:", tell the user why they must run "chmod 755 twistr".
Suggest use the same phrasing here and there for consistency, so the user sees the same explanation in both places and is less likely to be confused.

$ curl -o twistr https://repo1.maven.org/maven2/de/softwareforge/twistr_kafka_0811/1.2/twistr_kafka_0811-1.2.sh
$ chmod 755 twistr
```
Create a developer account at https://dev.twitter.com/ and set up an access and consumer token.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider if there is documentation at dev.twitter.com that you can link to for "set up an access and consumer token" to help the user find how to set up an access and consumer token.

The topic definition file maps the internal Kafka key (which is a raw long in eight bytes) onto a Presto BIGINT column.

## Step 6: Map all the values from the topic message onto columns
Update the etc/kafka/tpch.customer.json file to add fields for the message and restart Presto. As the fields in the message are JSON, it uses the json data format. This is an example where different data formats are used for the key and the message.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a line after "Update the etc/kafka/tpch.customer.json file to add fields for the message and restart Presto." to separate the instructions from the discussion.

You could make this comparison more clear with something like:
"Note that the dataFormat for the key is set to raw, and the dataFormat of the message is set to json. "

The sentence "As the fields in the message are JSON, it uses the json data format. " might not be needed depending on how you phrase the explanation.

```
Create a developer account at https://dev.twitter.com/ and set up an access and consumer token.

Create a twistr.properties file and put the access and consumer key and secrets into it:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general comment. You describe the action "create a file with the following content" three different ways in this document that I can find:

  1. add a catalog properties file etc/catalog/kafka.properties for the Kafka connector. This file lists the Kafka nodes and topics:

  2. Add the following file as etc/kafka/tpch.customer.json and restart Presto:

  3. Create a twistr.properties file and put the access and consumer key and secrets into it:

Users benefit from consistent phrasing for the same action or object. Think about revising these three instructions to use the same words.

Copy link
Member

@wanglinsong wanglinsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify the difference (goal, etc) between this and https://prestodb.io/docs/current/connector/kafka-tutorial.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants