Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds e2e tests #65

Merged
merged 36 commits into from
Dec 28, 2023
Merged

Conversation

prashastia
Copy link
Collaborator

A simple e2e tests to check read of a small BigQuery Table.

This module is similar to the BigQueryExample. A few changes to count the number of records and log them.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
@prashastia
Copy link
Collaborator Author

/gcbrun

comments CODECOV_TOKEN usage.
@prashastia
Copy link
Collaborator Author

/gcbrun

Comment on lines 12 to 13
# Install Python and Basic Python Tools (Assuming VM does not have them)
RUN apt-get -y install python3 && apt clean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the assuming postscript. If VM doesn't have them, then it's fine to install like this. But if the VM does, then we must remove this superfluous installation.

# We won't run this async as we can wait for a bounded job to succeed or fail.
gcloud dataproc jobs submit flink --id "$JOB_ID" --jar="$GCS_JAR_LOCATION" --cluster="$CLUSTER_NAME" --region="$REGION" -- --gcp-project "$PROJECT_NAME" --bq-dataset "$DATASET_NAME" --bq-table "$TABLE_NAME" --agg-prop "$AGG_PROP_NAME" --query "$QUERY"
# Wait for the logs to be saved.
sleep 20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't logs available as soon as job ends?
If not, then how was 20 seconds decided?

Comment on lines 36 to 48
# Now check the success of the job

#Check if query has been set or not.
if [ -z "$QUERY" ];
then
echo "Run without Query"
python3 cloudbuild/python-scripts/parse_logs.py -- --job_id="$JOB_ID" --project_id="$PROJECT_ID" --cluster_name="$CLUSTER_NAME" --region="$REGION" --project_name="$PROJECT_NAME" --dataset_name="$DATASET_NAME" --table_name="$TABLE_NAME"
ret=$?
else
echo "Run Query First"
python3 cloudbuild/python-scripts/parse_logs.py -- --job_id="$JOB_ID" --project_id="$PROJECT_ID" --cluster_name="$CLUSTER_NAME" --region="$REGION" --project_name="$PROJECT_NAME" --dataset_name="$DATASET_NAME" --table_name="$TABLE_NAME" --query="$QUERY"
ret=$?
fi
Copy link
Collaborator

@jayehwhyehentee jayehwhyehentee Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to explicitly check if job runs with or without a query. Since the same parse_logs module is checking logs, you send query arg every time and check inside the module whether query has a legitimate value or not, and act accordingly.

In this script, simply put:

# Now check the success of the job
python3 cloudbuild/python-scripts/parse_logs.py -- --job_id="$JOB_ID" --project_id="$PROJECT_ID" --cluster_name="$CLUSTER_NAME" --region="$REGION" --project_name="$PROJECT_NAME" --dataset_name="$DATASET_NAME" --table_name="$TABLE_NAME" --query="$QUERY"
ret=$?

case $STEP in
# Download maven and all the dependencies
init)
$MVN spotless:apply
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. We'll use these scripts on main branch, which can only have reviewed and merged code.


# Run the small e2e tests
e2e_test_small)
# 1. Run the simple table test.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention bounded as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert all changes made in this file. This does not affect our nightly pipeline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


def get_bq_query_rows(client_project_name):
client = bigquery.Client(project=client_project_name)
query = 'SELECT count(*) as count FROM `testproject-398714.testing_dataset.largeTable` where EXTRACT(HOUR from ts) = 17 and EXTRACT(DAY from ts) = 17;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please don't hard-code the table details.
  2. This line is too long. Maximum allowed characters per line for readability is 80.
  3. I'd suggest copying this file in a cider-v, and applying the editor's recommendations for best practices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look up f-strings to break a long string with variables into multiple lines.

import re


def get_bq_query_rows(client_project_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be renamed to get_bq_query_result_row_count

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

query_result = query_job.result()
records_read = 0
for result in query_result:
records_read = result[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're only returning the last result in query_results.
What does each result in query_results look like?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this usage.

records_read = result[0]
return records_read

# Remember these are the ones from args.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment. Not needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return records_read

# Remember these are the ones from args.
def get_bq_table_rows(client_project_name, project_name, dataset_name, table_name, query):
Copy link
Collaborator

@jayehwhyehentee jayehwhyehentee Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to get_bq_table_row_count

total_metric_sum_in_blob = 0
# Keep on finding the metric value as there can be
# 1 or more outputs in a log file.
metric_pattern = r'{}\s*(.*?)\s*{}'.format(re.escape(metric_string), re.escape(delimiter))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the metric string look like?

return total_metric_sum_in_blob

def check_query_correctness(logs_as_string):
query_records_pattern = r'\[\s(.*),\s(.*)\s\]'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share an example of the actual string we expect to match here

for match in matches:
hour = match[0].strip()
day = match[1].strip()
if hour !='17' or day !='17':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that these checks are hardcoded to check the filter query from above

pom.xml Outdated
@@ -69,6 +69,7 @@ under the License.
<module>flink-connector-bigquery</module>
<module>flink-sql-connector-bigquery</module>
<module>flink-connector-bigquery-examples</module>
<module>flink-connector-bigquery-integration-test</module>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is off

pom.xml Outdated
@@ -440,6 +441,7 @@ under the License.
<exclude>**/com/google/cloud/flink/bigquery/source/config/*</exclude>
<exclude>**/com/google/cloud/flink/bigquery/table/config/BigQueryConnectorOptions.*</exclude>
<exclude>**/com/google/cloud/flink/bigquery/examples/**</exclude>
<exclude>**/com/google/cloud/flink/bigquery/integration/**</exclude>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is off

Comment on lines 67 to 80
* The following cases are tested:
* <ol>
* <li>Reading a Simple Table: This test reads a simple table of 40,000 rows having size 900
* KBs.
* <li>Reading a Table with Complex Schema: This test reads a table with 15 levels (maximum
* number of levels allowed by BigQuery). The table contains 100,000 rows and has a size
* of 2.96 MB.
* <li>Reading a Large Table: This test reads a large table. The table contains __ rows and
* has a size of about 200 GBs.
* <li>Reading a Table with Large Row: This test reads a table with a large row. The table
* contains 100 rows each fo size 45 MB and has a size of about 450 GB.
* <li>Testing a BigQuery Query Run: This tests a BigQuery Query run. The query filters
* certain rows based on a condition, groups the records and finds the AVG of value of a
* column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not aware of the tests being executed using it's main application. Please remove this part of the description. The nature of tests should be inferred from nightly.yaml and nightly.sh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 454 to 461
if len(argv) > len(acceptable_arguments) + 1:
raise app.UsageError(
'[Log: parse_logs ERROR] Too many command-line arguments.'
)
elif len(argv) < len(required_arguments) + 1:
raise app.UsageError(
'[Log: parse_logs ERROR] Too less command-line arguments.'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you have a separate validate arguments method, this is not needed. Please remove.

@jayehwhyehentee jayehwhyehentee self-requested a review December 27, 2023 12:45
@jayehwhyehentee
Copy link
Collaborator

/gcbrun

cloudbuild/nightly.yaml Show resolved Hide resolved
cloudbuild/python-scripts/parse_logs.py Outdated Show resolved Hide resolved
cloudbuild/python-scripts/parse_logs.py Show resolved Hide resolved
- Replaces print with log
- Adds return in case a log file is not found.
cloudbuild/python-scripts/parse_logs.py Outdated Show resolved Hide resolved
cloudbuild/python-scripts/parse_logs.py Outdated Show resolved Hide resolved
cloudbuild/python-scripts/parse_logs.py Outdated Show resolved Hide resolved
cloudbuild/python-scripts/parse_logs.py Outdated Show resolved Hide resolved
cloudbuild/python-scripts/parse_logs.py Outdated Show resolved Hide resolved
- Removed redundant descriptions in log and error messages.
@vishalkarve15
Copy link
Collaborator

/gcbrun

@jayehwhyehentee jayehwhyehentee merged commit 57d6e7a into GoogleCloudDataproc:main Dec 28, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants