Skip to content

Cloud Shell Commands for Google's Cloud Managed Hadoop Service (DataProc)

Notifications You must be signed in to change notification settings

bonibruno/dataproc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Dataproc is Google's Cloud Managed Hadoop/Spark Service. You can use cloud shell commands to quickly create a hadoop cluster.

I provide the cloud shell commands to create a Dataproc Hadoop Cluster and to also launch TeraGen, TeraSort, and TeraValidate jobs. These jobs are good for benchmarking your cluster.

Replace project-name with your Google Cloud project and bucket-name with your Google Storage Bucket Name. You need a storage bucket to store data, if you already have a Google Storage Bucket, skip the gsutil command.

Note: The default quota is 2400 CPUs, so if you need a cluster that is larger than 2400 CPUs, send a request to Google to increase your quota limit.

If this is for testing pruposes, don't forget to delete your cluster when your done with: gcloud dataproc clusters delete (cluster-name).

I also created an HDP 3.1 cluster on the same number and type of instances used for dataproc, but intead of using Google's object storage I used Dell Technologies Cloud OneFS HDFS storage and stored the results in the attached spreadsheet. It was impressive to see that Cloud OneFS outperform Google Cloud Storage which is massive. Furthermore, The Cloud OneFS system tested only had 56 nodes and only half the number of network connections were enabled at the time of testing.

Regards,

boni bruno

About

Cloud Shell Commands for Google's Cloud Managed Hadoop Service (DataProc)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published