Blog

Akava is a technology transformation consultancy delivering

delightful digital native, cloud, devops, web and mobile products that massively scale.

We write about
Current & Emergent Trends,
Tools, Frameworks
and Best Practices for
technology enthusiasts!

Snowflake Backups To Amazon S3 Using Terraform

Snowflake Backups To Amazon S3 Using Terraform

Onel Harrison Onel Harrison
17 minute read

Listen to article
Audio generated by DropInBlog's Blog Voice AI™ may have slight pronunciation nuances. Learn more

Contents

  1. Introduction
  2. Prerequisite setup
  3. Create an Amazon S3 bucket
  4. Create a storage integration
  5. Create an external stage
  6. Create a stored procedure
  7. Schedule stored procedure execution using tasks
  8. Recovery
  9. Conclusion
  10. References

Introduction

Snowflake offers multiple layers of native data recovery features such as Time Travel and Fail Safe, which allow querying data up to 90 days old and Snowflake’s team to recover historical data up to 7 days old, respectively. Time Travel also allows us to undrop tables in case we mistakenly drop a table, and we can configure database replication across regions.

Unloading data via Snowflake’s COPY INTO statement to an object store like Amazon S3 is yet another option to consider when implementing a Snowflake data recovery strategy. In this post, we demonstrate how to do just that using Terraform.

This post is inspired by Chris Herther’s How To Backup Snowflake Data to S3 or GCSso if you are looking for a SQL-centric approach, I recommend you take a look at his post.

Prerequisite setup

Terraform provider configuration

The following code snippet shows the terraform provider configuration we are using to interact with Amazon Web Services and Snowflake.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }

    snowflake = {
      source = "chanzuckerberg/snowflake"
      version = "~> 0.25"
    }
  }
}

provider "aws" {
  region = "us-west-2"
}

provider "snowflake" {
  alias = "account_admin"
  role  = "ACCOUNTADMIN"
}

provider "snowflake" {
  alias = "sys_admin"
  role  = "SYSADMIN"
}

provider "snowflake" {
  alias = "security_admin"
  role  = "SECURITYADMIN"
}

The snowflake.sys_admin provider alias is mainly used to create snowflake resources, the snowflake.security_admin alias to issue GRANTs on those resources, and the snowflake.account_admin alias to perform account-level operations and other operations conveniently executable by the ACCOUNTADMIN snowflake role.

Furthermore, both the Snowflake and AWS terraform providers are additionally configured using environment variables as shown via the following script.

export AWS_PROFILE=<aws-profile>
export SNOWFLAKE_USER=<snowflake-user>
export SNOWFLAKE_PASSWORD=<snowflake-password>
export SNOWFLAKE_ACCOUNT=<snowflake-account>
export SNOWFLAKE_REGION=<snowflake-region>

Once the above environment variables are populated, we can source the file and initialize terraform.

source env.sh
terraform init

Basic snowflake environment setup

Before we configure and perform any kind of backup procedure, we need some Snowflake resources that are typically already defined in an organization. Using the terraform code that follows, we perform operations that lay the foundation off which we will configure backups to Amazon S3.

  • CREATE a database named SANDBOX
  • CREATE a schema named ACTIVITY
  • CREATE tables named USERS and EVENTS
  • CREATE a role named SANDBOX_RW for performing read-write operations in the SANDBOX database
  • CREATE a virtual warehouse
  • Apply appropriate GRANTs to each resource
locals {
  sys_admin_role = "SYSADMIN"
  snowflake_user = "SANDBOX_USER"
}

resource "snowflake_database" "sandbox" {
  provider = snowflake.sys_admin
  name = "SANDBOX"
}

resource "snowflake_role" "sandbox_rw" {
  provider = snowflake.security_admin
  name = "SANDBOX_RW"
}

resource "snowflake_role_grants" "sandbox_rw" {
  provider = snowflake.security_admin
  role_name = snowflake_role.sandbox_rw.name

  roles = [
    local.sys_admin_role
  ]

  users = [
    local.snowflake_user
  ]
}

resource "snowflake_schema" "sandbox_activity" {
  provider = snowflake.sys_admin
  database = snowflake_database.sandbox.name
  name     = "ACTIVITY"
}

resource "snowflake_table" "sandbox_activity_users" {
  provider = snowflake.sys_admin
  database = snowflake_database.sandbox.name
  schema   = snowflake_schema.sandbox_activity.name

  name     = "USERS"

  column {
    name = "ID"
    type = "STRING"
    nullable = false
  }

  column {
    name = "NAME"
    type = "STRING"
    nullable = false
  }
}

resource "snowflake_table" "sandbox_activity_events" {
  provider = snowflake.sys_admin
  database = snowflake_database.sandbox.name
  schema   = snowflake_schema.sandbox_activity.name

  name     = "EVENTS"

  column {
    name = "ID"
    type = "STRING"
    nullable = false
  }

  column {
    name = "USER_ID"
    type = "STRING"
    nullable = false
  }

  column {
    name = "EVENT_TYPE"
    type = "STRING"
    nullable = false
  }

  column {
    name = "EVENT_TS"
    type = "TIMESTAMP"
    nullable = false
  }
}

resource "snowflake_database_grant" "usage_sandbox_database" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]
}

resource "snowflake_schema_grant" "usage_sandbox_activity" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  schema_name   = snowflake_schema.sandbox_activity.name
  privilege     = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]
}

resource "snowflake_table_grant" "select_sandbox_tables" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "SELECT"

  roles = [
    snowflake_role.sandbox_rw.name
  ]

  on_future = true
}

resource "snowflake_table_grant" "insert_sandbox_tables" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "INSERT"

  roles = [
    snowflake_role.sandbox_rw.name
  ]

  on_future = true
}

resource "snowflake_table_grant" "update_sandbox_tables" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "UPDATE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]

  on_future = true
}

resource "snowflake_table_grant" "truncate_sandbox_tables" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "TRUNCATE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]

  on_future = true
}

resource "snowflake_table_grant" "delete_sandbox_tables" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "DELETE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]

  on_future = true
}

resource "snowflake_warehouse" "warehouse" {
  provider = snowflake.sys_admin
  name = "COMPUTE_WH"
  warehouse_size = "XSMALL"
}

resource "snowflake_warehouse_grant" "usage_warehouse" {
  provider = snowflake.security_admin
  warehouse_name = snowflake_warehouse.warehouse.name

  privilege = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]
}

After applying the above terraform code, the highlighted resources in the image below should be visible, usable, and queryable in the Snowflake web console.

Snowflake web console showing usable and queriable resources (Image credit: Author)

Run the following statement to add a bit of data to the USERS table so that we will have something to backup. We only insert records into the USERS table to later demonstrate that the COPY INTO <location> FROM <table> statement does not unload empty tables like our EVENTS table.

INSERT INTO sandbox.activity.users VALUES ('1', 'Jenkins');

Create an Amazon S3 bucket

We will need some place to store the database backup files, and an object store like Amazon S3 is a cheap enough and reasonable place to store them.

resource "random_id" "aws_resource_id" {
  byte_length  = 4
}

resource "aws_s3_bucket" "snowflake_backups_bucket" {
  bucket = "snowflake-backups-${lower(random_id.aws_resource_id.id)}"

  acl    = "private"

  versioning {
    enabled = true
  }

  lifecycle_rule {
    id      = "backups"
    enabled = true

    tags = {
      rule      = "backups"
      autoclean = "true"
    }

    transition {
      days          = 7
      storage_class = "GLACIER"
    }

    expiration {
      days = 30
    }
  }

  lifecycle {
    prevent_destroy = true
  }
}

In the code above, we create an Amazon S3 bucket. We also ensure that versioning is enabled, week old files are transitioned to cheap Amazon Glacier storage, and 30-day-old files are expired. Consider your own versioning, transition, and expiration policies and make the necessary adjustments.

Next, we create a Snowflake storage integration and external stage to access Amazon S3.

Snowflake storage integration, external stage, and s3 bucket (Image credit: Snowflake)

Create a storage integration

The terraform code that follows creates and configures an IAM role with sufficient permissions for Snowflake to interact with the S3 bucket we created.

It also creates a TOOLS schema in the SANDBOX database that will house Snowflake resources such as storage integrations, stored procedures, and tasks as you will see later on.

data "snowflake_current_account" "this" {}

resource "snowflake_schema" "sandbox_tools" {
  provider = snowflake.sys_admin
  database = snowflake_database.sandbox.name
  name     = "TOOLS"
}

resource "snowflake_schema_grant" "usage_sandbox_tools" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  schema_name   = snowflake_schema.sandbox_tools.name
  privilege     = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name,
  ]
}

resource "snowflake_storage_integration" "snowflake_s3_backup" {
  provider = snowflake.account_admin
  name    = "SNOWFLAKE_S3_BACKUP"
  comment = "A storage integration for handling Snowflake S3 backups"
  type    = "EXTERNAL_STAGE"

  enabled = true

  storage_allowed_locations = ["s3://${aws_s3_bucket.snowflake_backups_bucket.bucket}/"]

  storage_provider      = "S3"
  storage_aws_role_arn  = aws_iam_role.snowflake.arn
}

resource "snowflake_integration_grant" "usage_snowflake_s3_backup_integration" {
  provider = snowflake.security_admin
  integration_name = snowflake_storage_integration.snowflake_s3_backup.name

  privilege = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]
}

// AWS

resource "aws_iam_role" "snowflake" {
  name = "snowflake"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = ""
        Effect = "Allow"
        Action = "sts:AssumeRole"
        Principal = { "AWS": "*" }
        Condition = {
          "StringLike": {
            "sts:ExternalId": "${data.snowflake_current_account.this.account}_SFCRole=*"
          }
        }
      }
    ]
  })
}

resource "aws_iam_policy_attachment" "snowflake_s3_backups" {
  name       = "snowflake_s3_backups_policy_attachment"
  roles      = [aws_iam_role.snowflake.name]
  policy_arn = aws_iam_policy.snowflake_s3_backups_policy.arn
}

resource "aws_iam_policy" "snowflake_s3_backups_policy" {
  name   = "snowflake_s3_backups_policy"
  policy = data.aws_iam_policy_document.snowflake_s3_backups_policy_doc.json
}

data "aws_iam_policy_document" "snowflake_s3_backups_policy_doc" {
  statement {
    effect  = "Allow"
    actions = [
      "s3:PutObject",
      "s3:DeleteObject",
      "s3:GetObject",
      "s3:GetObjectVersion",
    ]

    resources = [
      "arn:aws:s3:::${aws_s3_bucket.snowflake_backups_bucket.bucket}/*"
    ]
  }

  statement {
    effect  = "Allow"
    actions = [
      "s3:ListBucket",
      "s3:GetBucketLocation"
    ]

    resources = [
      "arn:aws:s3:::${aws_s3_bucket.snowflake_backups_bucket.bucket}"
    ]
  }
}

Inspect the created storage integration

DESCRIBE INTEGRATION SNOWFLAKE_S3_BACKUP;

Create an external stage

Here we create an external stage connected to our S3 bucket via the storage integration. The files in the stage will be GZIP compressed CSV files.

resource "snowflake_stage" "snowflake_s3_backup" {
  provider    = snowflake.account_admin
  name        = "SNOWFLAKE_S3_BACKUP"
  url         = "s3://${aws_s3_bucket.snowflake_backups_bucket.bucket}/"
  database    = snowflake_database.sandbox.name
  schema      = snowflake_schema.sandbox_tools.name
  file_format = "TYPE=CSV COMPRESSION=GZIP FIELD_OPTIONALLY_ENCLOSED_BY= '\"' SKIP_HEADER=1"
  storage_integration = snowflake_storage_integration.snowflake_s3_backup.name
}

resource "snowflake_stage_grant" "snowflake_s3_backup" {
  provider      = snowflake.security_admin
  database_name = snowflake_stage.snowflake_s3_backup.database
  schema_name   = snowflake_stage.snowflake_s3_backup.schema
  roles         = [
    snowflake_role.sandbox_rw.name,
  ]
  privilege     = "USAGE"
  stage_name    = snowflake_stage.snowflake_s3_backup.name
}

Inspect the created stage

DESCRIBE STAGE SANDBOX.TOOLS.SNOWFLAKE_S3_BACKUP;

Create a stored procedure

resource "snowflake_procedure" "backup_database" {
  provider = snowflake.sys_admin
  name     = "SPROC_BACKUP_DATABASE"
  database = snowflake_database.sandbox.name
  schema   = snowflake_schema.sandbox_tools.name

  arguments {
    name = "DATABASE"
    type = "VARCHAR"
  }

  comment             = "Procedure for backing up a database to S3"
  return_type         = "VARCHAR"
  execute_as          = "CALLER"
  return_behavior     = "IMMUTABLE"
  null_input_behavior = "RETURNS NULL ON NULL INPUT"
  statement           = file("${path.module}/stored_procedures/sproc_backup_database.js")
}

resource "snowflake_procedure_grant" "usage_sproc_backup_database" {
  provider = snowflake.security_admin
  database_name   = snowflake_database.sandbox.name
  schema_name     = snowflake_schema.sandbox_tools.name
  procedure_name  = snowflake_procedure.backup_database.name

  arguments {
    name = "DATABASE"
    type = "VARCHAR"
  }

  return_type = "VARCHAR"
  privilege   = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name
  ]
}

The body of the stored procedure is written in JavaScript as shown below.

let schemaTablePairResult = getSchemaTablePairResult(DATABASE);

while (schemaTablePairResult.next()) {
  let schema = schemaTablePairResult.getColumnValue(1);
  let table = schemaTablePairResult.getColumnValue(2);

  copyToS3(DATABASE, schema, table);
}

return `Database '${DATABASE}' successfully backed up to S3.`;

function getCurrentDate() {
  let sqlResult = snowflake.execute({
    sqlText: `SELECT TO_VARCHAR(CURRENT_DATE());`,
  });
  sqlResult.next();
  return sqlResult.getColumnValue(1);
}

function getExternalStageName(database) {
  let sqlResult = snowflake.execute({
    sqlText: `SELECT s.stage_name
		FROM ${database}.information_schema.stages s
		WHERE s.stage_name = 'SNOWFLAKE_S3_BACKUP'
		AND s.stage_type = 'External Named'
		`,
  });
  sqlResult.next();
  return sqlResult.getColumnValue(1);
}

function getSchemaTablePairResult(database) {
  return snowflake.execute({
    sqlText: `SELECT t.table_schema, t.table_name
    FROM ${database}.information_schema.tables t
    WHERE t.table_type = 'BASE TABLE'
    `,
  });
}

function copyToS3(database, schema, table) {
  snowflake.execute({
    sqlText: `COPY INTO @SANDBOX.TOOLS.SNOWFLAKE_S3_BACKUP/${getCurrentDate()}/${database}/${schema}/${table}/data_
	FROM ${database}.${schema}.${table}
	HEADER = true
	OVERWRITE = true
	MAX_FILE_SIZE = 104857600
	`,
  });
}

There are a few things to note about this stored procedure.

  • The capitalization of the DATABASE procedure parameter is required.
  • Only base tables are unloaded.
  • Unloaded tables are split into one or more files matching the data_.*[.]csv.gz pattern, each of which are at most 100 MB in size per Snowflake’s File Sizing Recommendations.
  • Unloaded files contain a header row and are overwritten each time the procedure is executed if the procedure is executed multiple times in a day.
  • The COPY INTO statement does not unload any data for empty tables.

Inspect the created stored procedure

DESCRIBE PROCEDURE SANDBOX.TOOLS.SPROC_BACKUP_DATABASE(VARCHAR);

Test the created stored procedure

CALL SANDBOX.TOOLS.SPROC_BACKUP_DATABSE('SANDBOX');

You will know that the stored procedure works if you find the relevant files in the S3 bucket. In this case, there should be files located in S3 that match the following pattern.

s3://<bucket>/<YYYY-MM-DD>/<db>/<schema>/<table>/data.*[.]csv.gz

Schedule stored procedure execution using tasks

Here we create a TASKADMIN role for managing and operating tasks, a task that calls the stored procedure we created with the SANDBOX database as a parameter on a cron schedule, and update exiting GRANTs to include the TASKADMIN role.

The code is effectively a terraform translation of the SQL found on Snowflake’s Introduction to Tasks — Creating a Task Administrator Role documentation page.

resource "snowflake_database_grant" "usage_sandbox_database" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  privilege     = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name,
    snowflake_role.task_admin.name # UPDATE
  ]
}

resource "snowflake_schema_grant" "usage_sandbox_tools" {
  provider = snowflake.security_admin
  database_name = snowflake_database.sandbox.name
  schema_name   = snowflake_schema.sandbox_tools.name
  privilege     = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name,
    snowflake_role.task_admin.name # UPDATE
  ]
}

resource "snowflake_warehouse_grant" "usage_warehouse" {
  provider = snowflake.security_admin
  warehouse_name = snowflake_warehouse.warehouse.name

  privilege = "USAGE"

  roles = [
    snowflake_role.sandbox_rw.name,
    snowflake_role.task_admin.name # UPDATE
  ]
}

resource "snowflake_role" "task_admin" {
  provider = snowflake.security_admin
  name     = "TASKADMIN"
}

resource "snowflake_role_grants" "task_admin_grants" {
  provider = snowflake.security_admin
  role_name = snowflake_role.task_admin.name

  roles = [
    local.sys_admin_role
  ]

  users = [
    local.snowflake_user
  ]
}

resource "snowflake_account_grant" "execute_task" {
  provider  = snowflake.account_admin
  roles     = [
    snowflake_role.task_admin.name
  ]
  privilege = "EXECUTE TASK"
}

resource "snowflake_account_grant" "execute_managed_task" {
  provider  = snowflake.account_admin
  roles     = [
    snowflake_role.task_admin.name
  ]
  privilege = "EXECUTE MANAGED TASK"
}

resource "snowflake_task" "backup_sandbox_database" {
  provider  = snowflake.account_admin # owner role needs to have execute task privilege
  database  = snowflake_database.sandbox.name
  schema    = snowflake_schema.sandbox_tools.name
  enabled   = true
  name      = upper("task_call_${snowflake_procedure.backup_database.name}")
  warehouse = snowflake_warehouse.warehouse.name

  schedule      = "USING CRON 0 2 */30 * * UTC" # Every 30 days at 2am UTC/9pm Eastern
  sql_statement = "CALL ${snowflake_procedure.backup_database.name}('${snowflake_database.sandbox.name}')"
}

resource "snowflake_task_grant" "operate_task_backup_database" {
  provider      = snowflake.account_admin
  database_name = snowflake_database.sandbox.name
  schema_name   = snowflake_schema.sandbox_tools.name
  task_name     = snowflake_task.backup_sandbox_database.name

  privilege = "OPERATE"

  roles = [
    snowflake_role.task_admin.name
  ]
}

Inspect the created task

USE ROLE TASKADMIN;
DESCRIBE TASK SANDBOX.TOOLS.TASK_CALL_SPROC_BACKUP_DATABASE;

View a history of task execution

The account usage TASK_HISTORY view allows us to view the history of task executions.

USE ROLE ACCOUNTADMIN;SELECT *
FROM snowflake.account_usage.task_history
ORDER BY completed_time DESC
LIMIT 10;

It would have been nice to have the ability to manually trigger a task to test that it is configured correctly, but there’s no way to do that (at the time of writing). One workaround is to adjust the specified cron schedule to run the task every minute and review the TASK_HISTORY view and the resulting backup files in S3.

⚠️ If you use this workaround, remember to readjust the cron schedule to something more reasonable to avoid unwanted Snowflake credit usage.

Recovery

We can view the backup files available in our stage — i.e., in the S3 bucket — using the LIST statement. Notice that there aren’t any files for the empty EVENTS table as previously alluded to.

LIST @SANDBOX.TOOLS.SNOWFLAKE_S3_BACKUP;

Here we see that there are backups of the USERS table for 2022–01–14 and 2022–01–15. Let’s query the data in the latest backup.


SELECT
$1 as ID
, $2 as NAME
FROM @SANDBOX.TOOLS.SNOWFLAKE_S3_BACKUP/2022-01-15/SANDBOX/ACTIVITY/USERS

We can then create a table to store and inspect the recovered data more closely using a CTAS statement.

CREATE TABLE sandbox.activity.users_recovered AS
SELECT
$1 as ID
, $2 as NAME
FROM @SANDBOX.TOOLS.SNOWFLAKE_S3_BACKUP/2022-01-15/SANDBOX/ACTIVITY/USERS

Having to specify an alias for each column may be tedious for wider tables, so there is a two-step approach we can use that includes a CTL and INSERT statement.

CREATE TABLE sandbox.activity.users_recovered LIKE sandbox.activity.users;INSERT INTO sandbox.activity.users_recovered 
SELECT $1, $2
FROM @SANDBOX.TOOLS.SNOWFLAKE_S3_BACKUP/2022-01-15/SANDBOX/ACTIVITY/USERS

Once we have inspected the recovered table to ensure that things are as expected, we can replace the original table with the recovered table.

ALTER TABLE sandbox.activity.users RENAME TO sandbox.activity.users_old;ALTER TABLE sandbox.activity.users_recovered RENAME TO sandbox.activity.users;DROP TABLE sandbox.activity.users_old;

Conclusion

Some teams have opted to manage a portion of their Snowflake environment via infrastructure-as-code tools such as Terraform. Such a decision may lead to data recovery solutions similar to the one presented in this post.

However, this method has some limitations such as not accurately reflecting the full database in the backup files since empty tables are not unloaded. (Although, there is a workaround for this should you want to have an unloaded file with only headers for empty tables.)

It’s also reasonable to expect that you may want to exclude certain schema or particular tables from being unloaded. While the solution presented in this post does not allow for this, you can further parametrize and modify the backup stored procedure to add this functionality.

Additionally, because the Snowflake API for stored procedures is synchronous, an executing query must be completed before another can begin. This means the stored procedure used in this post may be sufficient for backing up a database with relatively small tables while backing up databases with large tables may consume significantly more compute credits than is desirable. A more production-ready solution would involve writing client code that uses a Snowflake connector to run the COPY INTO queries asynchronously so that they are executed in parallel.

Finally, restoring data from backups may not be as straightforward as presented here as there may be schema changes between when a backup was created and the time of recovery. Schema information such as what’s found in the information_schema.columns in each database may be helpful to understand the differences in schema between the backups and current tables. The stored procedure can be modified to unload a database’s information_schema.columns table alongside the other tables in the database as well.

You may view the complete source code used in this post on GitHub.

References


Akava would love to help your organization adapt, evolve and innovate your modernization initiatives. If you’re looking to discuss, strategize or implement any of these processes, reach out to [email protected] and reference this post.

« Back to Blog