Analyze your data

This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key data-analytics services, and how they can help at the various stages of the data lifecycle. These best practices help you to meet your data analytics needs and create your system design.

Core principles

Businesses want to analyze data and generate actionable insights from that data. Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization. Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam.

Your particular focus, team expertise, and strategic outlook help you to determine which Google Cloud services you adopt to support your data analytics needs. For example, Dataflow lets you write complex transformations in a serverless approach, but you must rely on an opinionated version of configurations for compute and processing needs. Alternatively, Dataproc lets you run the same transformations, but you manage the clusters and fine-tune the jobs yourself.

In your system design, think about which processing strategy your teams use, such as extract, transform, load (ETL) or extract, load, transform (ELT). Your system design should also consider whether you need to process batch analytics or streaming analytics. Google Cloud provides a unified data platform, and it lets you build a data lake or a data warehouse to meet your business needs.

Key services

The following table provides a high-level overview of Google Cloud analytics services:

Google Cloud serviceDescription
Pub/SubSimple, reliable, and scalable foundation for stream analytics and event-driven computing systems.
DataflowA fully managed service to transform and enrich data in stream (real time) and batch (historical) modes.
Dataprep by TrifactaIntelligent data service to visually explore, clean, and prepare structured and unstructured data for analysis.
DatalabInteractive tool to explore, analyze, transform, and visualize data and build machine-learning models on Google Cloud.
DataprocFast, easy-to-use, and fully managed cloud service to run Apache Spark and Apache Hadoop clusters.
Cloud Data FusionFully managed, data integration service that’s built for the cloud and lets you build and manage ETL/ELT data pipelines. Cloud DataFusion provides a graphical interface and a broad open source library of preconfigured connectors and transformations.
BigQueryFully managed, low-cost, serverless data warehouse that scales with your storage and compute power needs. BigQuery is a columnar and ANSI SQL database that can analyze terabytes to petabytes of data.
Cloud ComposerFully managed workflow orchestration service that lets you author, schedule, and monitor pipelines that span clouds and on-premises data centers.
Data CatalogFully managed and scalable metadata management service that helps you discover, manage, and understand all your data.
Google Data StudioFully managed visual analytics service that can help you unlock insights from data through interactive dashboards.
LookerEnterprise platform that connects, analyzes, and visualizes data across multi-cloud environments.
DataformFully managed product to help you collaborate, create, and deploy data pipelines, and ensure data quality.
Dataplex (preview)Managed data lake service that ceentrally manages, monitors, and governs data across data lakes, data warehouses, and data marts using consistent controls.
AnalyticsHub (preview)Platform that efficiently and securely exchanges data analytics assets across your organization to address challenges of data reliability and cost.

Data lifecycle

When you create your system design, you can group the Google Cloud data analytics services around the general data movement in any system, or around the data lifecycle.

The data lifecycle includes the following stages and example services:

The following stages and services run across the entire data lifecycle:

  • Data integration includes services such as Data Fusion.
  • Metadata management and governance includes services such as Data Catalog.
  • Workflow management includes services such as Cloud Composer.

Data ingestion

Apply the following data ingestion best practices to your own environment.

Determine the data source for ingestion

Data typically comes from another cloud provider or service, or from an on-premises location:

Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.

Identify streaming or batch data sources

Consider how you need to use your data and identify where you have streaming or batch use cases. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery.

If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template. For batch workloads, the first step is usually to ingest data into Cloud Storage. Use the gsutil tool or Storage Transfer Service to ingest data.

Ingest data with automated tools

Manually moving data from other systems into the cloud can be a challenge. If possible, use tools that let you automate the data ingestion processes. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.

Use migration tools to ingest from another data warehouse

If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance. The BigQuery Data Transfer Service also provides third-party transfers that help you ingest data on a schedule from external sources. For more information, see the detailed migration approaches for each data warehouse.

Estimate your data ingestion needs

The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second. Capacity, storage, and regional requirements for your data help you to determine whether Pub/Sub Lite is a better option for your system design. For more information, see Choosing Pub/Sub or Pub/Sub Lite.

For batch ingestion of data, estimate how much data you want to transfer in total, and how quickly you want to do it. Review the available migration options, including an estimate on time and comparison of online versus offline transfers.

Use appropriate tools to regularly ingest data on a schedule

Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs. For fine-grain control of the timing of ingestion or the source and destination system, use a workflow-management system like Cloud Composer. If you want a more manual approach, you can use Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
If you want to manage the Compute infrastructure, you can use the gsutil command with cron for data transfer of up to 1 TB. If you use this manual approach instead of Cloud Composer, follow the best practices to script production transfers.

Identify data ingestion needs from IoT devices

If you need to ingest data from IoT devices, use IoT Core to connect to devices and store their data in Google Cloud.