-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds e2e tests #65
Adds e2e tests #65
Conversation
This module is similar to the BigQueryExample. A few changes to count the number of records and log them.
This test reads a simpleTable. Shell script and python script to check the number of records read.
This test reads a simpleTable. Shell script and python script to check the number of records read.
This test reads a simpleTable. Shell script and python script to check the number of records read.
This test reads a simpleTable. Shell script and python script to check the number of records read.
This test reads a simpleTable. Shell script and python script to check the number of records read.
This test reads a simpleTable. Shell script and python script to check the number of records read.
/gcbrun |
comments CODECOV_TOKEN usage.
/gcbrun |
cloudbuild/Dockerfile
Outdated
# Install Python and Basic Python Tools (Assuming VM does not have them) | ||
RUN apt-get -y install python3 && apt clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the assuming
postscript. If VM doesn't have them, then it's fine to install like this. But if the VM does, then we must remove this superfluous installation.
# We won't run this async as we can wait for a bounded job to succeed or fail. | ||
gcloud dataproc jobs submit flink --id "$JOB_ID" --jar="$GCS_JAR_LOCATION" --cluster="$CLUSTER_NAME" --region="$REGION" -- --gcp-project "$PROJECT_NAME" --bq-dataset "$DATASET_NAME" --bq-table "$TABLE_NAME" --agg-prop "$AGG_PROP_NAME" --query "$QUERY" | ||
# Wait for the logs to be saved. | ||
sleep 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't logs available as soon as job ends?
If not, then how was 20 seconds decided?
# Now check the success of the job | ||
|
||
#Check if query has been set or not. | ||
if [ -z "$QUERY" ]; | ||
then | ||
echo "Run without Query" | ||
python3 cloudbuild/python-scripts/parse_logs.py -- --job_id="$JOB_ID" --project_id="$PROJECT_ID" --cluster_name="$CLUSTER_NAME" --region="$REGION" --project_name="$PROJECT_NAME" --dataset_name="$DATASET_NAME" --table_name="$TABLE_NAME" | ||
ret=$? | ||
else | ||
echo "Run Query First" | ||
python3 cloudbuild/python-scripts/parse_logs.py -- --job_id="$JOB_ID" --project_id="$PROJECT_ID" --cluster_name="$CLUSTER_NAME" --region="$REGION" --project_name="$PROJECT_NAME" --dataset_name="$DATASET_NAME" --table_name="$TABLE_NAME" --query="$QUERY" | ||
ret=$? | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to explicitly check if job runs with or without a query. Since the same parse_logs
module is checking logs, you send query
arg every time and check inside the module whether query has a legitimate value or not, and act accordingly.
In this script, simply put:
# Now check the success of the job
python3 cloudbuild/python-scripts/parse_logs.py -- --job_id="$JOB_ID" --project_id="$PROJECT_ID" --cluster_name="$CLUSTER_NAME" --region="$REGION" --project_name="$PROJECT_NAME" --dataset_name="$DATASET_NAME" --table_name="$TABLE_NAME" --query="$QUERY"
ret=$?
cloudbuild/nightly.sh
Outdated
case $STEP in | ||
# Download maven and all the dependencies | ||
init) | ||
$MVN spotless:apply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. We'll use these scripts on main branch, which can only have reviewed and merged code.
cloudbuild/nightly.sh
Outdated
|
||
# Run the small e2e tests | ||
e2e_test_small) | ||
# 1. Run the simple table test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please mention bounded as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert all changes made in this file. This does not affect our nightly pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
def get_bq_query_rows(client_project_name): | ||
client = bigquery.Client(project=client_project_name) | ||
query = 'SELECT count(*) as count FROM `testproject-398714.testing_dataset.largeTable` where EXTRACT(HOUR from ts) = 17 and EXTRACT(DAY from ts) = 17;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please don't hard-code the table details.
- This line is too long. Maximum allowed characters per line for readability is 80.
- I'd suggest copying this file in a cider-v, and applying the editor's recommendations for best practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look up f-strings
to break a long string with variables into multiple lines.
import re | ||
|
||
|
||
def get_bq_query_rows(client_project_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be renamed to get_bq_query_result_row_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
query_result = query_job.result() | ||
records_read = 0 | ||
for result in query_result: | ||
records_read = result[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're only returning the last result in query_results
.
What does each result in query_results look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this usage.
records_read = result[0] | ||
return records_read | ||
|
||
# Remember these are the ones from args. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment. Not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return records_read | ||
|
||
# Remember these are the ones from args. | ||
def get_bq_table_rows(client_project_name, project_name, dataset_name, table_name, query): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to get_bq_table_row_count
total_metric_sum_in_blob = 0 | ||
# Keep on finding the metric value as there can be | ||
# 1 or more outputs in a log file. | ||
metric_pattern = r'{}\s*(.*?)\s*{}'.format(re.escape(metric_string), re.escape(delimiter)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the metric string look like?
return total_metric_sum_in_blob | ||
|
||
def check_query_correctness(logs_as_string): | ||
query_records_pattern = r'\[\s(.*),\s(.*)\s\]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please share an example of the actual string we expect to match here
for match in matches: | ||
hour = match[0].strip() | ||
day = match[1].strip() | ||
if hour !='17' or day !='17': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention that these checks are hardcoded to check the filter query from above
pom.xml
Outdated
@@ -69,6 +69,7 @@ under the License. | |||
<module>flink-connector-bigquery</module> | |||
<module>flink-sql-connector-bigquery</module> | |||
<module>flink-connector-bigquery-examples</module> | |||
<module>flink-connector-bigquery-integration-test</module> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is off
pom.xml
Outdated
@@ -440,6 +441,7 @@ under the License. | |||
<exclude>**/com/google/cloud/flink/bigquery/source/config/*</exclude> | |||
<exclude>**/com/google/cloud/flink/bigquery/table/config/BigQueryConnectorOptions.*</exclude> | |||
<exclude>**/com/google/cloud/flink/bigquery/examples/**</exclude> | |||
<exclude>**/com/google/cloud/flink/bigquery/integration/**</exclude> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is off
* The following cases are tested: | ||
* <ol> | ||
* <li>Reading a Simple Table: This test reads a simple table of 40,000 rows having size 900 | ||
* KBs. | ||
* <li>Reading a Table with Complex Schema: This test reads a table with 15 levels (maximum | ||
* number of levels allowed by BigQuery). The table contains 100,000 rows and has a size | ||
* of 2.96 MB. | ||
* <li>Reading a Large Table: This test reads a large table. The table contains __ rows and | ||
* has a size of about 200 GBs. | ||
* <li>Reading a Table with Large Row: This test reads a table with a large row. The table | ||
* contains 100 rows each fo size 45 MB and has a size of about 450 GB. | ||
* <li>Testing a BigQuery Query Run: This tests a BigQuery Query run. The query filters | ||
* certain rows based on a condition, groups the records and finds the AVG of value of a | ||
* column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not aware of the tests being executed using it's main application. Please remove this part of the description. The nature of tests should be inferred from nightly.yaml
and nightly.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if len(argv) > len(acceptable_arguments) + 1: | ||
raise app.UsageError( | ||
'[Log: parse_logs ERROR] Too many command-line arguments.' | ||
) | ||
elif len(argv) < len(required_arguments) + 1: | ||
raise app.UsageError( | ||
'[Log: parse_logs ERROR] Too less command-line arguments.' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you have a separate validate arguments
method, this is not needed. Please remove.
...-test/src/main/java/com/google/cloud/flink/bigquery/integration/BigQueryIntegrationTest.java
Outdated
Show resolved
Hide resolved
/gcbrun |
- Replaces print with log - Adds return in case a log file is not found.
- Removed redundant descriptions in log and error messages.
/gcbrun |
A simple e2e tests to check read of a small BigQuery Table.