Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues on new install on kubernetes v1.16.2 #1387

Closed
jecho opened this issue Aug 28, 2020 · 11 comments
Closed

issues on new install on kubernetes v1.16.2 #1387

jecho opened this issue Aug 28, 2020 · 11 comments

Comments

@jecho
Copy link

jecho commented Aug 28, 2020

I'm not sure if this is directly related to #1122, but I am getting quite a few of these errors post install in reporting-operator

error="failed to store Prometheus metrics into table hive.metering.datasource_metering_persistentvolumeclaim_request_bytes for the range 2020-08-28 10:32:00 +0000 UTC to 2020-08-28 10:37:00 +0000 UTC: failed to store metrics into presto: presto: query failed (200 OK): \"io.prestosql.spi.PrestoException: Failed checking path: {redacted}

so basically ReportDataSource will always be empty. I've used both s3 and the sharedPVC type and yields same results. Curious if maybe its my configuration, unless its something with the current version of presto.

I'm currently using 4.7. I was encountering the python issue in presto on startup for previous releases.

this is my configuration

kind: MeteringConfig
metadata:
  name: "operator-metering"
spec:
  disableOCPFeatures: false
  tls:
    enabled: false
  hive:
    spec:
      securityContext:
        fsGroup: 0
  presto:
    spec:
      securityContext:
        fsGroup: 0
      config:
        connectors:
          prometheus:
            enabled: false
            config:
              uri: "http://prometheus-prometheus-oper-prometheus.monitoring.svc:9090"
        auth:
          enabled: false
        tls:
          enabled: false
      worker:
        replicas: 0
  storage:
    type: "hive"
    hive:
      type: "s3"
      s3:
        bucket: "aBucket/store"
        region: "us-west-2"
        secretName: "your-aws-secret"
        createBucket: false
  reporting-operator:
    spec:
      securityContext:
        fsGroup: 0
      route:
        enabled: false
        name: metering
      authProxy:
        enabled: false
        cookie:
          createSecret: false
          secretName: "aSecret"
        subjectAccessReview:
          enabled: false
        delegateURLs: 
          enabled: false
      config:
        tls:
          api:
            enabled: false
        prometheus:
          certificateAuthority:
            useServiceAccountCA: false
          url: "http://prometheus-prometheus-oper-prometheus.monitoring.svc:9090/"
          metricsImporter:
            auth:
              useServiceAccountToken: true
              tokenSecret:
                enabled: false
            config:
              chunkSize: "5m"
              pollInterval: "5m"
              stepSize: "60s"
@timflannagan
Copy link
Contributor

That MeteringConfig custom resource is a bit more involved than what I would expect, but this seems fine to me at a glance. We override a lot of those fields you've configured (like fsGroup: 0 and disable TLS entirely for our upstream installations under-the-hood. Unfortunately, all of our dev and test metering installations these days are done on an openshift cluster, and we haven't seen this error pop up in Presto yet. What kind of k8s cluster are you trying to run metering on? I might have some time this week to reproduce this.

@jecho
Copy link
Author

jecho commented Sep 1, 2020

running on bare metal

  • kubeadm=1.16.2
  • distro=ubuntu,
  • distro-version=18.04/bionic
  • cni=calico

@jecho
Copy link
Author

jecho commented Sep 1, 2020

for peace of mind, I tried to deploying on gke 1.16.3 using their default cni, everything is working fine... I don't know to much about hive or presto or apache thrift, but I'm wondering if any of the components utilize multicasting/broadcasting

if that's the case, then potentially I need to exchange the cni on our end

@jecho
Copy link
Author

jecho commented Sep 1, 2020

would nfs storage for hive-metastore-db-data be a problem?

@timflannagan
Copy link
Contributor

would nfs storage for hive-metastore-db-data be a problem?

In theory, it should be fine, but Openshift as a whole has avoided recommending that as a storage for any of the components and we've followed their suit on that. I really haven't done any load testing on using NFS as a storage backend, but no problem has arisen in our e2e suite or local tests that would indicate it cannot be used for metering.

@jecho
Copy link
Author

jecho commented Sep 2, 2020

would nfs storage for hive-metastore-db-data be a problem?

In theory, it should be fine, but Openshift as a whole has avoided recommending that as a storage for any of the components and we've followed their suit on that. I really haven't done any load testing on using NFS as a storage backend, but no problem has arisen in our e2e suite or local tests that would indicate it cannot be used for metering.

cool, I tried swapping out derby for postgres but unsure what version to use, tried the chart from bitnami using tags 9.6.19-debian-10-r14 and 11.9.0-debian-10-r1 but it seems to hang after creating the first two entries

taken from hive-metastore

20/09/02 03:15:17 [pool-10-thread-122]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.122	cmd=source:172.16.0.122 create_table: Table(tableName:datasource_metering_pod_request_memory_bytes, dbName:metering, owner:hadoop, createTime:1599016517, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:amount, type:double, comment:null), FieldSchema(name:timestamp, type:timestamp, comment:null), FieldSchema(name:timeprecision, type:double, comment:null), FieldSchema(name:labels, type:map<string,string>, comment:null)], location:null, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dt, type:string, comment:null)], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{hadoop=[PrivilegeGrantInfo(privilege:INSERT, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true), PrivilegeGrantInfo(privilege:SELECT, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true), PrivilegeGrantInfo(privilege:UPDATE, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true), PrivilegeGrantInfo(privilege:DELETE, createTime:-1, grantor:hadoop, grantorType:USER, grantOption:true)]}, groupPrivileges:null, rolePrivileges:null), temporary:false)	
20/09/02 03:15:18 [pool-10-thread-122]: INFO common.FileUtils: Creating directory if it doesn't exist: s3a://prod-rndc/reporting/metering.db/datasource_metering_pod_request_memory_bytes

@timflannagan
Copy link
Contributor

@jecho yeah, unfortunately, that's a known error on our side. It seems like the JDBC driver we use in the hive image is out-of-date, and I don't know off-hand what version of Postgresql will work out of the box. It looks like it's hanging on the create table call we're making internally. Can you try and delete the hive-server-0 Pod and see if that changes anything?

@jecho
Copy link
Author

jecho commented Sep 2, 2020

@timflannagan1 after kicking the hive-server, hive-metastore seems to try to create the next 2 and then settles?

hive-metastore

20/09/02 18:57:37 [pool-10-thread-19]: INFO common.FileUtils: Creating directory if it doesn't exist: s3a://prod-rndc/reporting/metering.db/datasource_metering_pod_persistentvolumeclaim_request_info
20/09/02 18:58:57 [pool-10-thread-7]: INFO metastore.HiveMetaStore: 1: Cleaning up thread local RawStore...
20/09/02 18:58:57 [pool-10-thread-7]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.87	cmd=Cleaning up thread local RawStore...	
20/09/02 18:58:57 [pool-10-thread-7]: INFO metastore.HiveMetaStore: 1: Done cleaning up thread local RawStore
20/09/02 18:58:57 [pool-10-thread-7]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.87	cmd=Done cleaning up thread local RawStore	
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: source:172.16.0.110 get_all_functions
20/09/02 18:59:26 [pool-10-thread-28]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.110	cmd=source:172.16.0.110 get_all_functions	
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.ObjectStore: ObjectStore, initialize called
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is POSTGRES
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.ObjectStore: Initialized ObjectStore
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: source:172.16.0.110 get_all_databases
20/09/02 18:59:26 [pool-10-thread-28]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.110	cmd=source:172.16.0.110 get_all_databases	
20/09/02 18:59:26 [pool-10-thread-28]: INFO metastore.HiveMetaStore: 4: source:172.16.0.110 get_all_tables: db=default
20/09/02 18:59:26 [pool-10-thread-28]: INFO HiveMetaStore.audit: ugi=hadoop	ip=172.16.0.110	cmd=source:172.16.0.110 get_all_tables: db=default

reporting-operator pod

time="2020-09-02T19:01:55Z" level=error msg="error syncing ReportDataSource \"metering/persistentvolumeclaim-capacity-bytes\", adding back to queue" ReportDataSource=metering/persistentvolumeclaim-capacity-bytes app=metering component=reportDataSourceWorker error="error creating table for ReportDataSource persistentvolumeclaim-capacity-bytes: timed out waiting for Hive table to be created" logID=lWjB8gaK2x
time="2020-09-02T19:01:55Z" level=error msg="error syncing ReportDataSource \"metering/node-capacity-cpu-cores\", adding back to queue" ReportDataSource=metering/node-capacity-cpu-cores app=metering component=reportDataSourceWorker error="error creating table for ReportDataSource node-capacity-cpu-cores: timed out waiting for Hive table to be created" logID=Axc8mahShu
time="2020-09-02T19:02:19Z" level=info msg="\"GET http://172.16.0.103:8080/ready HTTP/1.1\" from 105.1.45.12:37868 - 200 30B in 192.606794ms" app=metering component=api
time="2020-09-02T19:02:47Z" level=info msg="\"GET http://172.16.0.103:8080/healthy HTTP/1.1\" from 105.1.45.12:38162 - 200 30B in 139.561526ms" app=metering component=api
time="2020-09-02T19:03:18Z" level=error msg="error syncing HiveTable reportdatasource-metering-pod-limit-memory-bytes" app=metering component=hiveTableWorker error="couldn't create table datasource_metering_pod_limit_memory_bytes in Hive: dial tcp 10.98.121.60:10000: connect: connection timed out" hiveTable=reportdatasource-metering-pod-limit-memory-bytes logID=4KY2apLEwW namespace=metering
time="2020-09-02T19:03:18Z" level=error msg="error syncing HiveTable \"metering/reportdatasource-metering-pod-limit-memory-bytes\", adding back to queue" HiveTable=metering/reportdatasource-metering-pod-limit-memory-bytes app=metering component=hiveTableWorker error="couldn't create table datasource_metering_pod_limit_memory_bytes in Hive: dial tcp 10.98.121.60:10000: connect: connection timed out" logID=4KY2apLEwW

@timflannagan
Copy link
Contributor

Sorry for the slow response - do you still have the hive server logs? We have someone on the team looking into the Postgres bug now.

@jecho
Copy link
Author

jecho commented Sep 10, 2020

I do not. I can redeploy and grab the log and stacktrace for this Postgres bug

I did end up trying to deploy with _mysq_l at one point however. The current container images 4.7 for hive seem to not have the CLASSPATH setup properly for the driver, so I did something funky where I made hive and presto have the tag 4.5 with everything else set as 4.7. (I know, not really reasonable and was trying to avoid rebuilding an image). Anyways, as expected, everything seemed to work on gke but within our environment still receiving the error that started this thread.

time="2020-09-09T20:35:10Z" level=error msg="error syncing ReportDataSource \"metering/node-allocatable-cpu-cores\", adding back to queue" ReportDataSource=metering/node-allocatable-cpu-cores app=metering component=reportDataSourceWorker error="ImportFromLastTimestamp errored: failed to store Prometheus metrics into table hive.metering.datasource_metering_node_allocatable_cpu_cores for the range 2020-09-09 18:35:00 +0000 UTC to 2020-09-09 18:40:00 +0000 UTC: failed to store metrics into presto: presto: query failed (200 OK): \"io.prestosql.spi.PrestoException: Failed checking path: s3a://..

lmk if I should provide anything else, such as a kubeadm dump or what not?

and thanks guys!

@timflannagan
Copy link
Contributor

@jecho I would stay away from the 4.6/4.7 tags (as we're stuck on openshift's release cycle, at least for now, and those tags are mirrored until the 4.6 release has reached GA) as we recently had to migrate everything to RHEL8 base images, which is causing some issues, like as you pointed out the postgres jdbc driver not being currently configured in the classpath. I've been pretty bogged up this week wrapping up other work, so I still need to try and spin up a non-openshift cluster installation and try to reproduce this error. I imagine it's permissions related, but I haven't dug into that outstanding Presto issue either.

@jecho jecho closed this as completed Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants