Amazon Redshift - data

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift Serverless lets you access and analyze data without all of the configurations of a provisioned data warehouse. Resources are automatically provisioned and data warehouse capacity is intelligently scaled to deliver fast performance for even the most demanding and unpredictable workloads. You don't incur charges when the data warehouse is idle, so you only pay for what you use.

You can import data from Amazon Redshift into Blueshift using either of Blueshift’s Reverse ETL partners:

Via Amazon S3
Via Blueshift’s Reverse ETL partners
- Census
- RudderStack

Integration via Amazon S3

You can use Redshift Query Editor v2 to export data from Amazon Redshift to Amazon Simple Storage Service (Amazon S3).

You can set up a full export or an incremental export to S3.
You can also set up recurring export job.
Once you have the data in Amazon S3, you can use Blueshift’s direct integration with S3 to import your data into Blueshift. You can import the following data into Blueshift:

Prerequisites

Before you can start with the integration, you must complete the following steps:

Create an account in Amazon Redshift.
Create an Amazon Redshift workspace with tables/views containing customer or events or catalog data.
Obtain AWS S3 credentials having read/write access to the S3 bucket in which you will transfer data.
If you are using the bucket provided by Blueshift, you can find the details for the S3 bucket and the credentials in the Account Settings > API keys tab in Blueshift.
You must have access to Amazon Redshift Query Editor v2.

Set up full data export from Amazon Redshift to Amazon S3

To set up a full export from Amazon Redshift to Amazon S3, complete the following steps:

Create a new Query Editor tab with the name BSFT S3 Export Trigger.
Use the following UNLOAD command to export the table data.
- This command exports all the data from the Amazon Redshift table to S3. To export specific columns, SELECT only the columns that you want to export.
- The data is export in the form of compressed files in GZIP format to S3 for efficient network and storage utilization.
- The data is in CSV format.
- PARALLEL export is set as OFF. This prevents the out-of-sequence export of event data to S3.
- Max file size is set to 1GB for efficient processing by Blueshift data ingestion service.

Remember to replace the placeholders in the query for table name, S3 bucket, S3 path, the access key and the secret key for your S3 bucket with the values specific to your integration.

UNLOAD ('SELECT * FROM <customers_table_name>')
TO 's3://<bsft_bucket_name>/<path_to_location_inside_bsft_bucket>/<prefix>_' 
-- Ex: s3://bsft-customers/site_name.com/customers/customer_
CREDENTIALS 'aws_access_key_id=<AWS_ACCESS_KEY_ID>;aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>'
FORMAT CSV
DELIMITER ','
HEADER
GZIP
EXTENSION 'csv.gz'
PARALLEL OFF
MAXFILESIZE 1 GB;

You can also set up a recurring job to export the data by using the Schedule Query option in the Query Editor.

Set up incremental data export from Amazon Redshift to Amazon S3

Step 1: Add a timestamp or unique identifier for the data

To enable incremental data export to S3, it is recommended that you include a timestamp or a unique identifier in the data that gets incremented with each new row or with an update operation. This helps to identify new data or changed data since the last export.

Create a new Query Editor tab with the name BSFT Export Setup.
If the database table does not have a timestamp column, use the following query to add a new timestamp column. You could update the last_modified value as a part of the row update to make that row applicable for the next export operation.
Remember to replace the placeholder in the query for table name with the value specific to your database.

ALTER TABLE <customers_table_name> ADD COLUMN last_modified TIMESTAMP DEFAULT GETDATE();

It is advisable to create another metadata table in Amazon Redshift to keep track of the last exported timestamp. You can refer to this time stamp when you export the delta as a part of incremental exports. Use the following query to create a new metadata table to track the incremental export.

CREATE TABLE bsft_export_tracking (
  table_name VARCHAR(255) PRIMARY KEY,
  last_exported TIMESTAMP
);

Set the base timestamp value for the last_exported timestamp in the table. All data from this date and time onwards will be included in the first export.
For example, set last_exported to last month's timestamp if you want to export the data from the past month onwards.

insert into bsft_export_tracking (table_name, last_exported)
VALUES('<customers_table_name>', DATEADD(month, -1, GETDATE()));

Step 2: Use the UNLOAD command to export incremental data

Create a new Query Editor tab with the name BSFT S3 Export Trigger.
Use the following UNLOAD command to export the incremental data.
- This command exports incremental data from the Amazon Redshift table to S3. To export specific columns, SELECT only the columns that you want to export.
- The data is export in the form of compressed files in GZIP format to S3 for efficient network and storage utilization.
- The data is in CSV format.
- PARALLEL export is set as OFF. This prevents the out-of-sequence export of event data to S3.
- Max file size is set to 1GB for efficient processing by Blueshift data ingestion service.
- The tracking table entry is updated with the current date to get new/changed data during the next export.

Remember to replace the placeholders in the query for table name, S3 bucket, S3 path, the access key and the secret key for your S3 bucket with the values specific to your integration.

UNLOAD ('SELECT * FROM <customers_table_name>
  WHERE last_modified >= (SELECT last_exported FROM bsft_export_tracking
  WHERE table_name = \'<customers_table_name>\')')
TO 's3://<bsft_bucket_name>/<path_to_location_inside_bsft_bucket>/<prefix>_' 
-- Ex: s3://bsft-customers/site_name.com/customers/customer_
CREDENTIALS 'aws_access_key_id=<AWS_ACCESS_KEY_ID>;aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>'
FORMAT CSV
DELIMITER ','
HEADER
GZIP
EXTENSION 'csv.gz'
PARALLEL OFF
MAXFILESIZE 1 GB;

-- Set last_exported to (current datetime - 1 min) to avoid missing records during next export due to race condition. 
-- The subtracted minutes can be configured based on duration of UNLOAD query execution.
UPDATE bsft_export_tracking
SET last_exported = DATEADD(minute, -1, GETDATE())
WHERE table_name = '<customer_table_name>';

You can also set up a recurring job to export the data by using the Schedule Query option in the Query Editor.

Schedule a recurring export job

You can convert both the full Import and Incremental Import queries into recurring jobs using the Schedule Query option in Query Editor.

Go to the BSFT S3 Export Trigger tab in the Query Editor.
Set the query execution frequency in the scheduler based on your specific use case.
You will also have to complete some IAM configurations, such as attaching Redshift and Event Bridge-related policies to the Redshift cluster role, and so on. For more information, see the Amazon Redshift documentation for scheduling a query with query editor v2.

Import data from Amazon S3 into Blueshift

After the data is migrated to Amazon S3, you can set up a task in Blueshift to import data from Amazon S3 into Blueshift. Follow the instructions in the Blueshift documentation to import data into Blueshift.

Integration via the Census platform

Census is a Reverse ETL platform that makes it easy to connect your data warehouse into sales, marketing, and other customer facing tools that drive your business. Using Census, you can easily copy data from Amazon Redshift into Blueshift.

Using Census you can import the following data into Blueshift:

Customer data
Event data (Standard events and Custom events)

To integrate with Blueshift via Census, complete the following steps:

Set up a data warehouse and data model in Census for the data that you want to import into Blueshift. For more information, see the Census documentation for Setting up a Amazon Redshift connection.
Set up a connection to Blueshift as the destination in Census.
Create a Sync in Census with Amazon Redshift as the source and Blueshift as the destination.

Integration via the RudderStack platform

RudderStack is an open-source Customer Data Platform (CDP), providing data pipelines that allow you to easily collect data from every application, website, and SaaS platform to activate in your warehouse and business tools. Using RudderStack, you can easily copy data from Amazon Redshift into Blueshift.

Using RudderStack you can import the following data into Blueshift:

Customer data
Event data (Standard events and Custom events)

To integrate with Blueshift via RudderStack, complete the following steps:

Set up Amazon Redshift as a Reverse ETL source in RudderStack.
Set up Blueshift as a destination in RudderStack.
Create a Connection in Rudderstack with Amazon Redshift as the source and Blueshift as the destination.

Amazon Redshift - data

Comments