Overview of Big Data Services

Introduction to Cloud PubSub 1

  • Message queue service
  • Asynchronous
  • Topic: Structure for storing messages
  • Subscription: Enables application to read messages from a topic

Introduction to Cloud PubSub 2

  • Create topic
    • Topic id
  • Create subscription
    • Subscription id
    • Delivery type (pull, push)
    • Expiration date
    • Ack deadline
    • Message retention duration
  • Publish message
  • Create a snapshot
    • To save the state of the topic

Introduction to Cloud Dataproc

  • Service that manages Hadoop clusters for us
  • Create cluster
    • Cluster mode (number of master and worker nodes)
    • Master machine type
    • Master primary disk size/type
    • Worker machine type (vertical or horizontal scaling)
    • Worker primary disk size/type
    • Number of worker nodes
    • Preemtible worker nodes
  • Create jobs
    • Job type (Hadoop, Spark, PySpark, Hive, Pig, etc.)

Introduction to Cloud Dataflow

  • Stream and batch processing framework based on Apache Beam
  • Runs jobs written in Java or Python
  • ETL
  • Temporary files are written in GCS
  • Serverless
  • DAG
  • Lots of templates

Introduction to Cloud Transfer

  • Used for transfering from one bucket to a GCP bucket
  • Also for scheduling jobs
  • Source could be AWS S3, destination is always GCP
  • Extra filters
  • Overwrite policy

Introduction to BigQuery

  • Analytics, data warehousing database
  • Uses SQL, but not relational database
  • Command line bq
  • Data transfer service

Quiz

  • What service would you use to transfer 30TB of data to Cloud Storage from your on premises data center?

    Data Transfer Appliance (not Service)

  • What service would you use to process an IoT stream of time series data and create summary statistics for each minute of data?

    Cloud Dataflow

  • What kind of subscription would you create on a Cloud Pub/Sub topic if you want the program processing the messages to control when the message is read?

    Pull subscription

  • Your manager would like to stop managing a Hadoop cluster after migrating to GCP. What service would you recommend using to replaced a self-managed Hadoop cluster?

    Dataproc