issues on new install on kubernetes v1.16.2 #1387

jecho · 2020-08-28T21:03:00Z

I'm not sure if this is directly related to #1122, but I am getting quite a few of these errors post install in reporting-operator

error="failed to store Prometheus metrics into table hive.metering.datasource_metering_persistentvolumeclaim_request_bytes for the range 2020-08-28 10:32:00 +0000 UTC to 2020-08-28 10:37:00 +0000 UTC: failed to store metrics into presto: presto: query failed (200 OK): \"io.prestosql.spi.PrestoException: Failed checking path: {redacted}

so basically ReportDataSource will always be empty. I've used both s3 and the sharedPVC type and yields same results. Curious if maybe its my configuration, unless its something with the current version of presto.

I'm currently using 4.7. I was encountering the python issue in presto on startup for previous releases.

this is my configuration

kind: MeteringConfig
metadata:
  name: "operator-metering"
spec:
  disableOCPFeatures: false
  tls:
    enabled: false
  hive:
    spec:
      securityContext:
        fsGroup: 0
  presto:
    spec:
      securityContext:
        fsGroup: 0
      config:
        connectors:
          prometheus:
            enabled: false
            config:
              uri: "http://prometheus-prometheus-oper-prometheus.monitoring.svc:9090"
        auth:
          enabled: false
        tls:
          enabled: false
      worker:
        replicas: 0
  storage:
    type: "hive"
    hive:
      type: "s3"
      s3:
        bucket: "aBucket/store"
        region: "us-west-2"
        secretName: "your-aws-secret"
        createBucket: false
  reporting-operator:
    spec:
      securityContext:
        fsGroup: 0
      route:
        enabled: false
        name: metering
      authProxy:
        enabled: false
        cookie:
          createSecret: false
          secretName: "aSecret"
        subjectAccessReview:
          enabled: false
        delegateURLs: 
          enabled: false
      config:
        tls:
          api:
            enabled: false
        prometheus:
          certificateAuthority:
            useServiceAccountCA: false
          url: "http://prometheus-prometheus-oper-prometheus.monitoring.svc:9090/"
          metricsImporter:
            auth:
              useServiceAccountToken: true
              tokenSecret:
                enabled: false
            config:
              chunkSize: "5m"
              pollInterval: "5m"
              stepSize: "60s"

The text was updated successfully, but these errors were encountered:

timflannagan · 2020-08-31T14:03:23Z

That MeteringConfig custom resource is a bit more involved than what I would expect, but this seems fine to me at a glance. We override a lot of those fields you've configured (like fsGroup: 0 and disable TLS entirely for our upstream installations under-the-hood. Unfortunately, all of our dev and test metering installations these days are done on an openshift cluster, and we haven't seen this error pop up in Presto yet. What kind of k8s cluster are you trying to run metering on? I might have some time this week to reproduce this.

jecho · 2020-09-01T18:16:02Z

running on bare metal

kubeadm=1.16.2
distro=ubuntu,
distro-version=18.04/bionic
cni=calico

jecho · 2020-09-01T18:18:54Z

for peace of mind, I tried to deploying on gke 1.16.3 using their default cni, everything is working fine... I don't know to much about hive or presto or apache thrift, but I'm wondering if any of the components utilize multicasting/broadcasting

if that's the case, then potentially I need to exchange the cni on our end

jecho · 2020-09-01T20:45:04Z

would nfs storage for hive-metastore-db-data be a problem?

timflannagan · 2020-09-01T20:51:53Z

would nfs storage for hive-metastore-db-data be a problem?

In theory, it should be fine, but Openshift as a whole has avoided recommending that as a storage for any of the components and we've followed their suit on that. I really haven't done any load testing on using NFS as a storage backend, but no problem has arisen in our e2e suite or local tests that would indicate it cannot be used for metering.

jecho · 2020-09-02T17:03:02Z

would nfs storage for hive-metastore-db-data be a problem?

In theory, it should be fine, but Openshift as a whole has avoided recommending that as a storage for any of the components and we've followed their suit on that. I really haven't done any load testing on using NFS as a storage backend, but no problem has arisen in our e2e suite or local tests that would indicate it cannot be used for metering.

cool, I tried swapping out derby for postgres but unsure what version to use, tried the chart from bitnami using tags 9.6.19-debian-10-r14 and 11.9.0-debian-10-r1 but it seems to hang after creating the first two entries

taken from hive-metastore

20/09/02 03:15:17 [pool-10-thread-122]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.122	cmd=source:172.16.0.122 create_table: Table(tableName:datasource_metering_pod_request_memory_bytes, dbName:metering, owner:hadoop, createTime:1599016517, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:amount, type:double, comment:null), FieldSchema(name:timestamp, type:timestamp, comment:null), FieldSchema(name:timeprecision, type:double, comment:null), FieldSchema(name:labels, type:map<string,string>, comment:null)], location:null, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dt, type:string, comment:null)], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{hadoop=[PrivilegeGrantInfo(privilege:INSERT, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true), PrivilegeGrantInfo(privilege:SELECT, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true), PrivilegeGrantInfo(privilege:UPDATE, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true), PrivilegeGrantInfo(privilege:DELETE, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true)]}, groupPrivileges:null, rolePrivileges:null), temporary:false)	
20/09/02 03:15:18 [pool-10-thread-122]: INFO common.FileUtils: Creating directory if it doesn't exist: s3a://prod-rndc/reporting/metering.db/datasource_metering_pod_request_memory_bytes

timflannagan · 2020-09-02T17:22:57Z

@jecho yeah, unfortunately, that's a known error on our side. It seems like the JDBC driver we use in the hive image is out-of-date, and I don't know off-hand what version of Postgresql will work out of the box. It looks like it's hanging on the create table call we're making internally. Can you try and delete the hive-server-0 Pod and see if that changes anything?

jecho · 2020-09-02T19:07:53Z

@timflannagan1 after kicking the hive-server, hive-metastore seems to try to create the next 2 and then settles?

hive-metastore

20/09/02 18:57:37 [pool-10-thread-19]: INFO common.FileUtils: Creating directory if it doesn't exist: s3a://prod-rndc/reporting/metering.db/datasource_metering_pod_persistentvolumeclaim_request_info
20/09/02 18:58:57 [pool-10-thread-7]: INFO metastore.HiveMetaStore: 1: Cleaning up thread local RawStore...
20/09/02 18:58:57 [pool-10-thread-7]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.87	cmd=Cleaning up thread local RawStore...	
20/09/02 18:58:57 [pool-10-thread-7]: INFO metastore.HiveMetaStore: 1: Done cleaning up thread local RawStore
20/09/02 18:58:57 [pool-10-thread-7]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.87	cmd=Done cleaning up thread local RawStore	
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: source:172.16.0.110 get_all_functions
20/09/02 18:59:26 [pool-10-thread-28]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.110	cmd=source:172.16.0.110 get_all_functions	
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.ObjectStore: ObjectStore, initialize called
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is POSTGRES
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.ObjectStore: Initialized ObjectStore
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: source:172.16.0.110 get_all_databases
20/09/02 18:59:26 [pool-10-thread-28]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.110	cmd=source:172.16.0.110 get_all_databases	
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: source:172.16.0.110 get_all_tables: db=default
20/09/02 18:59:26 [pool-10-thread-28]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.110	cmd=source:172.16.0.110 get_all_tables: db=default

reporting-operator pod

time="2020-09-02T19:01:55Z" level=error msg="error syncing ReportDataSource \"metering/persistentvolumeclaim-capacity-bytes\", adding back to queue" ReportDataSource=metering/persistentvolumeclaim-capacity-bytes app=metering component=reportDataSourceWorker error="error creating table for ReportDataSource persistentvolumeclaim-capacity-bytes: timed out waiting for Hive table to be created" logID=lWjB8gaK2x
time="2020-09-02T19:01:55Z" level=error msg="error syncing ReportDataSource \"metering/node-capacity-cpu-cores\", adding back to queue" ReportDataSource=metering/node-capacity-cpu-cores app=metering component=reportDataSourceWorker error="error creating table for ReportDataSource node-capacity-cpu-cores: timed out waiting for Hive table to be created" logID=Axc8mahShu
time="2020-09-02T19:02:19Z" level=info msg="\"GET http://172.16.0.103:8080/ready HTTP/1.1\" from 105.1.45.12:37868 - 200 30B in 192.606794ms" app=metering component=api
time="2020-09-02T19:02:47Z" level=info msg="\"GET http://172.16.0.103:8080/healthy HTTP/1.1\" from 105.1.45.12:38162 - 200 30B in 139.561526ms" app=metering component=api
time="2020-09-02T19:03:18Z" level=error msg="error syncing HiveTable reportdatasource-metering-pod-limit-memory-bytes" app=metering component=hiveTableWorker error="couldn't create table datasource_metering_pod_limit_memory_bytes in Hive: dial tcp 10.98.121.60:10000: connect: connection timed out" hiveTable=reportdatasource-metering-pod-limit-memory-bytes logID=4KY2apLEwW namespace=metering
time="2020-09-02T19:03:18Z" level=error msg="error syncing HiveTable \"metering/reportdatasource-metering-pod-limit-memory-bytes\", adding back to queue" HiveTable=metering/reportdatasource-metering-pod-limit-memory-bytes app=metering component=hiveTableWorker error="couldn't create table datasource_metering_pod_limit_memory_bytes in Hive: dial tcp 10.98.121.60:10000: connect: connection timed out" logID=4KY2apLEwW

timflannagan · 2020-09-09T13:23:52Z

Sorry for the slow response - do you still have the hive server logs? We have someone on the team looking into the Postgres bug now.

jecho · 2020-09-10T17:37:38Z

I do not. I can redeploy and grab the log and stacktrace for this Postgres bug

I did end up trying to deploy with _mysq_l at one point however. The current container images 4.7 for hive seem to not have the CLASSPATH setup properly for the driver, so I did something funky where I made hive and presto have the tag 4.5 with everything else set as 4.7. (I know, not really reasonable and was trying to avoid rebuilding an image). Anyways, as expected, everything seemed to work on gke but within our environment still receiving the error that started this thread.

time="2020-09-09T20:35:10Z" level=error msg="error syncing ReportDataSource \"metering/node-allocatable-cpu-cores\", adding back to queue" ReportDataSource=metering/node-allocatable-cpu-cores app=metering component=reportDataSourceWorker error="ImportFromLastTimestamp errored: failed to store Prometheus metrics into table hive.metering.datasource_metering_node_allocatable_cpu_cores for the range 2020-09-09 18:35:00 +0000 UTC to 2020-09-09 18:40:00 +0000 UTC: failed to store metrics into presto: presto: query failed (200 OK): \"io.prestosql.spi.PrestoException: Failed checking path: s3a://..

lmk if I should provide anything else, such as a kubeadm dump or what not?

and thanks guys!

timflannagan · 2020-09-10T20:47:18Z

@jecho I would stay away from the 4.6/4.7 tags (as we're stuck on openshift's release cycle, at least for now, and those tags are mirrored until the 4.6 release has reached GA) as we recently had to migrate everything to RHEL8 base images, which is causing some issues, like as you pointed out the postgres jdbc driver not being currently configured in the classpath. I've been pretty bogged up this week wrapping up other work, so I still need to try and spin up a non-openshift cluster installation and try to reproduce this error. I imagine it's permissions related, but I haven't dug into that outstanding Presto issue either.

jecho closed this as completed Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues on new install on kubernetes v1.16.2 #1387

issues on new install on kubernetes v1.16.2 #1387

jecho commented Aug 28, 2020 •

edited

Loading

timflannagan commented Aug 31, 2020

jecho commented Sep 1, 2020

jecho commented Sep 1, 2020 •

edited

Loading

jecho commented Sep 1, 2020

timflannagan commented Sep 1, 2020

jecho commented Sep 2, 2020 •

edited

Loading

timflannagan commented Sep 2, 2020

jecho commented Sep 2, 2020 •

edited

Loading

timflannagan commented Sep 9, 2020

jecho commented Sep 10, 2020 •

edited

Loading

timflannagan commented Sep 10, 2020

issues on new install on kubernetes v1.16.2 #1387

issues on new install on kubernetes v1.16.2 #1387

Comments

jecho commented Aug 28, 2020 • edited Loading

timflannagan commented Aug 31, 2020

jecho commented Sep 1, 2020

jecho commented Sep 1, 2020 • edited Loading

jecho commented Sep 1, 2020

timflannagan commented Sep 1, 2020

jecho commented Sep 2, 2020 • edited Loading

timflannagan commented Sep 2, 2020

jecho commented Sep 2, 2020 • edited Loading

timflannagan commented Sep 9, 2020

jecho commented Sep 10, 2020 • edited Loading

timflannagan commented Sep 10, 2020

jecho commented Aug 28, 2020 •

edited

Loading

jecho commented Sep 1, 2020 •

edited

Loading

jecho commented Sep 2, 2020 •

edited

Loading

jecho commented Sep 2, 2020 •

edited

Loading

jecho commented Sep 10, 2020 •

edited

Loading