AWS实战:Dynamodb到Redshift数据同步
AWS Dynamodb简介
- Amazon DynamoDB 是一种完全托管式、无服务器的 NoSQL 键值数据库,旨在运行任何规模的高性能应用程序。
- DynamoDB能在任何规模下实现不到10毫秒级的一致响应,并且它的存储空间无限,可在任何规模提供可靠的性能。
- DynamoDB 提供内置安全性、连续备份、自动多区域复制、内存缓存和数据导出工具。
Redshift简介
- Amazon Redshift是一个快速、功能强大、完全托管的PB级别数据仓库服务。用户可以在刚开始使用几百GB的数据,然后在后期扩容到PB级别的数据容量。
- Redshift是一种联机分析处理OLAP(Online Analytics Processing)的类型,支持复杂的分析操作,侧重决策支持,并且能提供直观易懂的查询结果。
资源准备
VPC
- vpc
- cird block: 10.10.0.0/16
- internet gateway
- elastic ip address
- nat gateway:使用elastic ip address作为public ip
- public subnet
- 三个Availability Zone
- private subnet
- 三个Availability Zone
- public route table:public subnet关联的route table
- destination: 0.0.0.0/0 target: internet-gateway-id(允许与外界进行通信)
- destination:10.10.0.0/16 local(内部通信)
- private route table:private subnet关联的route table
- destination:10.10.0.0/16 local(内部通信)
- destination: 0.0.0.0/0 target: nat-gateway-id(允许内部访问外界)
- web server security group
- 允许任意ip对443端口进行访问
- 允许自己的ipdui22端口进行访问,以便ssh到服务器上向数据库插入数据
- glue redshift connection security group
- 只包含一条self-referencing rule ,允许同一个security group对所有tcp端口进行访
- 创建Glue connection时需要使用该security group:
- Reference: glue connection security group must have a self-referencing rule to allow to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.
- private redshift security group
- 允许vpc内部(10.10.0.0/24)对5439端口进行访问
- 允许glue connection security group对5439端口进行访问
- public redshift security group
- 允许vpc内部(10.10.0.0/24)对5439端口进行访问
- 允许kenisis firehose所在region的public ip 对5439端口进行访问
-
13.58.135.96/27
for US East (Ohio) -
52.70.63.192/27
for US East (N. Virginia) -
13.57.135.192/27
for US West (N. California) -
52.89.255.224/27
for US West (Oregon) -
18.253.138.96/27
for AWS GovCloud (US-East) -
52.61.204.160/27
for AWS GovCloud (US-West) -
35.183.92.128/27
for Canada (Central) -
18.162.221.32/27
for Asia Pacific (Hong Kong) -
13.232.67.32/27
for Asia Pacific (Mumbai) -
13.209.1.64/27
for Asia Pacific (Seoul) -
13.228.64.192/27
for Asia Pacific (Singapore) -
13.210.67.224/27
for Asia Pacific (Sydney) -
13.113.196.224/27
for Asia Pacific (Tokyo) -
52.81.151.32/27
for China (Beijing) -
161.189.23.64/27
for China (Ningxia) -
35.158.127.160/27
for Europe (Frankfurt) -
52.19.239.192/27
for Europe (Ireland) -
18.130.1.96/27
for Europe (London) -
35.180.1.96/27
for Europe (Paris) -
13.53.63.224/27
for Europe (Stockholm) -
15.185.91.0/27
for Middle East (Bahrain) -
18.228.1.128/27
for South America (São Paulo) -
15.161.135.128/27
for Europe (Milan) -
13.244.121.224/27
for Africa (Cape Town) -
13.208.177.192/27
for Asia Pacific (Osaka) -
108.136.221.64/27
for Asia Pacific (Jakarta) -
3.28.159.32/27
for Middle East (UAE) -
18.100.71.96/27
for Europe (Spain) -
16.62.183.32/27
for Europe (Zurich) -
18.60.192.128/27
for Asia Pacific (Hyderabad)
-
VPC全部资源的serverless文件:
- custom:bucketNamePrefix 替换为自己的创建的bucket
-
service: dynamodb-to-redshift-vpccustom:bucketNamePrefix: "jessica"provider:name: awsregion: ${opt:region, "ap-southeast-1"}stackName: ${self:service}deploymentBucket:name: com.${self:custom.bucketNamePrefix}.deploy-bucketserverSideEncryption: AES256resources:Parameters:VpcName:Type: StringDefault: "test-vpc"Resources:VPC:Type: "AWS::EC2::VPC"Properties:CidrBlock: "10.10.0.0/16"EnableDnsSupport: trueEnableDnsHostnames: trueInstanceTenancy: defaultTags:- Key: NameValue: !Sub "VPC_${VpcName}"# Internet GatewayInternetGateway:Type: "AWS::EC2::InternetGateway"Properties:Tags:- Key: NameValue: !Sub "VPC_${VpcName}_InternetGateway"VPCGatewayAttachment:Type: "AWS::EC2::VPCGatewayAttachment"Properties:VpcId: !Ref VPCInternetGatewayId: !Ref InternetGateway# web server security groupWebServerSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow access from publicVpcId: !Ref VPCSecurityGroupIngress:- IpProtocol: tcpFromPort: 443ToPort: 443CidrIp: "0.0.0.0/0"Tags:- Key: NameValue: !Sub "VPC_${VpcName}_WebServerSecurityGroup"# public route tableRouteTablePublic:Type: "AWS::EC2::RouteTable"Properties:VpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_RouteTablePublic"RouteTablePublicInternetRoute:Type: "AWS::EC2::Route"DependsOn: VPCGatewayAttachmentProperties:RouteTableId: !Ref RouteTablePublicDestinationCidrBlock: "0.0.0.0/0"GatewayId: !Ref InternetGateway# public subnetSubnetAPublic:Type: "AWS::EC2::Subnet"Properties:AvailabilityZone: !Select [0, !GetAZs ""]CidrBlock: "10.10.0.0/24"MapPublicIpOnLaunch: trueVpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_SubnetAPublic"RouteTableAssociationAPublic:Type: "AWS::EC2::SubnetRouteTableAssociation"Properties:SubnetId: !Ref SubnetAPublicRouteTableId: !Ref RouteTablePublicSubnetBPublic:Type: "AWS::EC2::Subnet"Properties:AvailabilityZone: !Select [1, !GetAZs ""]CidrBlock: "10.10.32.0/24"MapPublicIpOnLaunch: trueVpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_SubnetBPublic"RouteTableAssociationBPublic:Type: "AWS::EC2::SubnetRouteTableAssociation"Properties:SubnetId: !Ref SubnetBPublicRouteTableId: !Ref RouteTablePublicSubnetCPublic:Type: "AWS::EC2::Subnet"Properties:AvailabilityZone: !Select [2, !GetAZs ""]CidrBlock: "10.10.64.0/24"MapPublicIpOnLaunch: trueVpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_SubnetCPublic"RouteTableAssociationCPublic:Type: "AWS::EC2::SubnetRouteTableAssociation"Properties:SubnetId: !Ref SubnetCPublicRouteTableId: !Ref RouteTablePublic# redshift security groupPrivateRedshiftSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow access from inside vpcVpcId: !Ref VPCSecurityGroupIngress:- IpProtocol: tcpFromPort: 5439ToPort: 5439CidrIp: 10.10.0.0/24- IpProtocol: tcpFromPort: 5439ToPort: 5439SourceSecurityGroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupIdTags:- Key: NameValue: !Sub "VPC_${VpcName}_PrivateRedshiftSecurityGroup"# redshift security groupPublicRedshiftSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow access from inside vpc and Kinesis Data Firehose CIDR blockVpcId: !Ref VPCSecurityGroupIngress:- IpProtocol: tcpFromPort: 5439ToPort: 5439CidrIp: 10.10.0.0/24- IpProtocol: tcpFromPort: 5439ToPort: 5439CidrIp: 13.228.64.192/27Tags:- Key: NameValue: !Sub "VPC_${VpcName}_PublicRedshiftSecurityGroup"GlueRedshiftConnectionSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow self referring for all tcp portsVpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_GlueRedshiftConnectionSecurityGroup"GlueRedshiftConnectionSecurityGroupSelfReferringInboundRule:Type: "AWS::EC2::SecurityGroupIngress"Properties:GroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupIdIpProtocol: tcpFromPort: 0ToPort: 65535SourceSecurityGroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupIdSourceSecurityGroupOwnerId: !Sub "${aws:accountId}"# nat gatewayEIP:Type: "AWS::EC2::EIP"Properties:Domain: vpcNatGateway:Type: "AWS::EC2::NatGateway"Properties:AllocationId: !GetAtt "EIP.AllocationId"SubnetId: !Ref SubnetAPublic# private route tableRouteTablePrivate:Type: "AWS::EC2::RouteTable"Properties:VpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_RouteTablePrivate"RouteTablePrivateRoute:Type: "AWS::EC2::Route"Properties:RouteTableId: !Ref RouteTablePrivateDestinationCidrBlock: "0.0.0.0/0"NatGatewayId: !Ref NatGateway# private subnetSubnetAPrivate:Type: "AWS::EC2::Subnet"Properties:AvailabilityZone: !Select [0, !GetAZs ""]CidrBlock: "10.10.16.0/24"VpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_SubnetAPrivate"RouteTableAssociationAPrivate:Type: "AWS::EC2::SubnetRouteTableAssociation"Properties:SubnetId: !Ref SubnetAPrivateRouteTableId: !Ref RouteTablePrivateSubnetBPrivate:Type: "AWS::EC2::Subnet"Properties:AvailabilityZone: !Select [1, !GetAZs ""]CidrBlock: "10.10.48.0/24"VpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_SubnetBPrivate"RouteTableAssociationBPrivate:Type: "AWS::EC2::SubnetRouteTableAssociation"Properties:SubnetId: !Ref SubnetBPrivateRouteTableId: !Ref RouteTablePrivateSubnetCPrivate:Type: "AWS::EC2::Subnet"Properties:AvailabilityZone: !Select [2, !GetAZs ""]CidrBlock: "10.10.80.0/24"VpcId: !Ref VPCTags:- Key: NameValue: !Sub "VPC_${VpcName}_SubnetCPrivate"RouteTableAssociationCPrivate:Type: "AWS::EC2::SubnetRouteTableAssociation"Properties:SubnetId: !Ref SubnetCPrivateRouteTableId: !Ref RouteTablePrivateOutputs:VPC:Description: "VPC."Value: !Ref VPCExport:Name: !Sub "${self:provider.stackName}"SubnetsPublic:Description: "Subnets public."Value:!Join [",",[!Ref SubnetAPublic, !Ref SubnetBPublic, !Ref SubnetCPublic],]Export:Name: !Sub "${self:provider.stackName}-PublicSubnets"SubnetsPrivate:Description: "Subnets private."Value:!Join [",",[!Ref SubnetAPrivate, !Ref SubnetBPrivate, !Ref SubnetCPrivate],]Export:Name: !Sub "${self:provider.stackName}-PrivateSubnets"DefaultSecurityGroup:Description: "VPC Default Security Group"Value: !GetAtt VPC.DefaultSecurityGroupExport:Name: !Sub "${self:provider.stackName}-DefaultSecurityGroup"WebServerSecurityGroup:Description: "VPC Web Server Security Group"Value: !Ref WebServerSecurityGroupExport:Name: !Sub "${self:provider.stackName}-WebServerSecurityGroup"PrivateRedshiftSecurityGroup:Description: "The id of the RedshiftSecurityGroup"Value: !Ref PrivateRedshiftSecurityGroupExport:Name: !Sub "${self:provider.stackName}-PrivateRedshiftSecurityGroup"PublicRedshiftSecurityGroup:Description: "The id of the RedshiftSecurityGroup"Value: !Ref PublicRedshiftSecurityGroupExport:Name: !Sub "${self:provider.stackName}-PublicRedshiftSecurityGroup"GlueRedshiftConnectionSecurityGroup:Description: "The id of the self referring security group"Value: !Ref GlueRedshiftConnectionSecurityGroupExport:Name: !Sub "${self:provider.stackName}-GlueSelfRefringSecurityGroup"
Redshift Cluster
- Private Cluster subnet group
- 创建一个包含private subnet的private subnet group
- Private Cluster:用于测试glue job同步数据到redshift,PubliclyAccessible必须设为false,否则glue job无法连接
- ClusterSubnetGroupName
- 使用private subnet group
- VpcSecurityGroupIds
- 使用private redshift security group
- NodeType: dc2.large
- ClusterType: single-node
-
PubliclyAccessible: false
- ClusterSubnetGroupName
- Public Cluster subnet group
- 创建一个包含public subnet的public subnet group
- Public Cluster:用于测试glue job同步数据到redshift,PubliclyAccessible必须设为true,且security group允许kinesis firehose public ip对5439端口进行访问,否则firehose无法连接到redshift
- ClusterSubnetGroupName
- 使用public subnet group
- VpcSecurityGroupIds
- 使用public redshift security group
- NodeType: dc2.large
- ClusterType: single-node
- PubliclyAccessible: true
- ClusterSubnetGroupName
redshift全部资源的serverless文件:
- custom:bucketNamePrefix 替换为自己的创建的bucket
-
service: dynamodb-to-redshift-redshiftcustom:bucketNamePrefix: "jessica"provider:name: awsregion: ${opt:region, "ap-southeast-1"}stackName: ${self:service}deploymentBucket:name: com.${self:custom.bucketNamePrefix}.deploy-bucketserverSideEncryption: AES256resources:Parameters:ServiceName:Type: StringDefault: dynamodb-to-redshiftResources:PrivateClusterSubnetGroup:Type: "AWS::Redshift::ClusterSubnetGroup"Properties:Description: Private Cluster Subnet GroupSubnetIds:Fn::Split:- ","- Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateSubnetsTags:- Key: NameValue: private-subnetPrivateCluster:Type: "AWS::Redshift::Cluster"Properties:ClusterIdentifier: test-data-sync-redshiftClusterSubnetGroupName: !Ref ClusterSubnetGroupVpcSecurityGroupIds:- Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateRedshiftSecurityGroupDBName: devMasterUsername: adminMasterUserPassword: Redshift_admin_2022NodeType: dc2.largeClusterType: single-nodePubliclyAccessible: falsePublicClusterSubnetGroup:Type: "AWS::Redshift::ClusterSubnetGroup"Properties:Description: Public Cluster Subnet GroupSubnetIds:Fn::Split:- ","- Fn::ImportValue: !Sub ${ServiceName}-vpc-PublicSubnetsTags:- Key: NameValue: public-subnetPublicCluster:Type: "AWS::Redshift::Cluster"Properties:ClusterIdentifier: test-data-sync-redshift-publicClusterSubnetGroupName: !Ref PublicClusterSubnetGroupVpcSecurityGroupIds:- Fn::ImportValue: !Sub ${ServiceName}-vpc-PublicRedshiftSecurityGroupDBName: devMasterUsername: adminMasterUserPassword: Redshift_admin_2022NodeType: dc2.largeClusterType: single-nodePubliclyAccessible: trueOutputs:PrivateRedshiftEndpoint:Description: "Redshift endpoint"Value: !GetAtt Cluster.Endpoint.AddressExport:Name: !Sub "${self:provider.stackName}-PrivateRedshiftEndpoint"PrivateRedshiftPort:Description: "Redshift port"Value: !GetAtt Cluster.Endpoint.PortExport:Name: !Sub "${self:provider.stackName}-PrivateRedshiftPort"PublicRedshiftEndpoint:Description: "Public Redshift endpoint"Value: !GetAtt PublicCluster.Endpoint.AddressExport:Name: !Sub "${self:provider.stackName}-PublicRedshiftEndpoint"PublicRedshiftPort:Description: "Public Redshift port"Value: !GetAtt PublicCluster.Endpoint.PortExport:Name: !Sub "${self:provider.stackName}-PublicRedshiftPort"
使用AWS Glue ETL Job进行同步
适用场景
- 一次性整表同步
- 对于典型的时间序列数据(当前的数据写入和读取频率高,越老的数据读写频率越低),通常会采用为每个时间段(每天)创建一张表的方式来合理的分配WCU和RCU。
- 如果在当时时间段结束之后,需要对该时间段内的所有数据进行复杂的分析操作,则需要将dynamodb的整表同步到redshift
架构
优点
- 使用AWS Glue Crawler可以自动管理源表和目标表的scheme,在Glue Job script中可以省去mapping的过程,Glue Job script代码易维护
资源部署
- Dynamodb table: 源数据表
- IAM role for glue crawler,crawler需要连接dynamodb和redshift的权限以读取表的scheme
- Dynamodb glue catalog database:用于存储crawler生成的dynamodb table scheme
- Redshift glue catalog database:用于存储crawler生成的redshift table scheme
- Dynamodb glue crawler:用于读取dynamodb表,生成对应的dynamodb table scheme
- Redshift glue crawler:用于读取redshift表,生成对应的redshift table scheme
- Glue connection:glue job连接redshift需要用到的connection
- IAM role for glue job:Glue job需要
- S3 bucket for glue job
- glue job
如何部署:
-
sls deploy -c glue-etl.yml #replace ${bucketNamePrefix} to your own glue bucket name crate in glue-etl.yml aws s3 cp dynamodb-to-redshift.py s3://com.${bucketNamePrefix}.glue-temp-bucket/script/
部署文件:glue-etl.yml
-
service: dynamodb-to-redshift-glue-etlcustom:bucketNamePrefix: "jessica"provider:name: awsregion: ${opt:region, "ap-southeast-1"}stackName: ${self:service}deploymentBucket:name: com.${self:custom.bucketNamePrefix}.deploy-bucketserverSideEncryption: AES256resources:Parameters:DynamodbTableName:Type: StringDefault: "TestSyncToRedshift"ServiceName:Type: StringDefault: dynamodb-to-redshiftGlueBucketName:Type: StringDefault: com.${self:custom.bucketNamePrefix}.glue-etl-temp-bucketResources:TestTable:Type: AWS::DynamoDB::TableProperties:TableName: !Sub ${DynamodbTableName}BillingMode: PAY_PER_REQUESTAttributeDefinitions:- AttributeName: pkAttributeType: S- AttributeName: skAttributeType: SKeySchema:- AttributeName: pkKeyType: HASH- AttributeName: skKeyType: RANGECrawlerRole:Type: AWS::IAM::RoleProperties:RoleName: CrawlerRoleAssumeRolePolicyDocument:Version: "2012-10-17"Statement:- Effect: "Allow"Principal:Service:- "glue.amazonaws.com"Action:- "sts:AssumeRole"ManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole- arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess- arn:aws:iam::aws:policy/AmazonRedshiftFullAccess- arn:aws:iam::aws:policy/AmazonS3FullAccessDynamodbDatabase:Type: AWS::Glue::DatabaseProperties:CatalogId: !Ref AWS::AccountIdDatabaseInput:Name: "dynamodb-database"DynamodbCrawler:Type: AWS::Glue::CrawlerProperties:Name: "dynamodb-crawler"Configuration:Role: !GetAtt CrawlerRole.ArnDatabaseName: !Ref DynamodbDatabaseTargets:DynamoDBTargets:- Path: !Sub ${DynamodbTableName}SchemaChangePolicy:UpdateBehavior: "UPDATE_IN_DATABASE"DeleteBehavior: "LOG"Schedule:ScheduleExpression: cron(0/10 * * * ? *) # run every 10 minutesGlueRedshiftConnection:Type: AWS::Glue::ConnectionProperties:CatalogId: !Sub "${aws:accountId}"ConnectionInput:Name: ${self:service}-redshift-connectionConnectionType: JDBCMatchCriteria: []PhysicalConnectionRequirements:SecurityGroupIdList:- Fn::ImportValue: !Sub ${ServiceName}-vpc-GlueSelfRefringSecurityGroupSubnetId:Fn::Select:- 1- Fn::Split:- ","- Fn::ImportValue: !Sub "${ServiceName}-vpc-PrivateSubnets"ConnectionProperties:JDBC_CONNECTION_URL:Fn::Join:- ""- - "jdbc:redshift://"- Fn::ImportValue: !Sub ${ServiceName}-redshift-PrivateRedshiftEndpoint- ":"- Fn::ImportValue: !Sub ${ServiceName}-redshift-PrivateRedshiftPort- "/dev"JDBC_ENFORCE_SSL: falseUSERNAME: adminPASSWORD: Redshift_admin_2022RedshiftDatabase:Type: AWS::Glue::DatabaseProperties:CatalogId: !Ref AWS::AccountIdDatabaseInput:Name: "redshift-database"RedshiftCrawler:Type: AWS::Glue::CrawlerProperties:Name: "redshift-crawler"Configuration:Role: !GetAtt CrawlerRole.ArnDatabaseName: !Ref RedshiftDatabaseTargets:JdbcTargets:- ConnectionName: !Ref GlueRedshiftConnectionPath: dev/public/test_sync_to_redshiftSchemaChangePolicy:UpdateBehavior: "UPDATE_IN_DATABASE"DeleteBehavior: "LOG"RedshiftGlueJobRole:Type: AWS::IAM::RoleProperties:RoleName: RedshiftGlueJobRoleAssumeRolePolicyDocument:Version: "2012-10-17"Statement:- Effect: AllowPrincipal:Service:- glue.amazonaws.comAction: sts:AssumeRoleManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole- arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess- arn:aws:iam::aws:policy/AmazonRedshiftFullAccess- arn:aws:iam::aws:policy/AmazonS3FullAccess- arn:aws:iam::aws:policy/CloudWatchLogsFullAccessGlueTempBucket:Type: AWS::S3::BucketProperties:BucketName: !Sub ${GlueBucketName}GlueJob:Type: AWS::Glue::JobProperties:Name: dynamodb-to-redshift-glue-etl-jobRole: !GetAtt RedshiftGlueJobRole.ArnCommand:Name: glueetlScriptLocation: !Sub "s3://${GlueBucketName}/script/dynamodb-to-redshift.py"PythonVersion: 3DefaultArguments:--TempDir: !Sub "s3://${GlueBucketName}/tmp/dynamodb-to-redshift/"WorkerType: G.1XNumberOfWorkers: 2GlueVersion: "3.0"Connections:Connections:- !Ref GlueRedshiftConnection
glue job脚本:dynamodb-to-redshift.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Jobparams = ['JOB_NAME','TempDir',
]args = getResolvedOptions(sys.argv, params)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)DynamoDBtable_node1 = glueContext.create_dynamic_frame.from_catalog(database="dynamodb-database",table_name="testsynctoredshift",transformation_ctx="DynamoDBtable_node1",
)RedshiftCluster_node2 = glueContext.write_dynamic_frame.from_catalog(frame=DynamoDBtable_node1,database="redshift-database",table_name="dev_public_test_sync_to_redshift",redshift_tmp_dir=args["TempDir"],transformation_ctx="RedshiftCluster_node2",
)job.commit()
测试
-
insert some data to dynamodb table with aws web console first, otherwise, the crawler can not detect the table scheme
2. run dynamodb-crawler, after run success, you can see the database and table in [glue console](https://ap-southeast-1.console.aws.amazon.com/glue/home?region=ap-southeast-1#catalog:tab=tables)
3. create redshift table with [Redshift query editor v2](https://ap-southeast-1.console.aws.amazon.com/sqlworkbench/home?region=ap-southeast-1#/client)
```
CREATE TABLE "public"."test_sync_to_redshift"(pk varchar(200) not null, sk varchar(200) NOT NULL, primary key(pk, sk));
```
4. run redshift-crawler, if encounter no valid connection error, please update password in the redshift-connection manually with aws console, don't know why the password is not correct when deploy with cloudformation. After run success, you can see the database and table in [glue console](https://ap-southeast-1.console.aws.amazon.com/glue/home?region=ap-southeast-1#catalog:tab=tables)
5. run glue etl job, after run success, you can check data in redshift table with [Redshift query editor v2](https://ap-southeast-1.console.aws.amazon.com/sqlworkbench/home?region=ap-southeast-1#/client).
This glue etl job will `insert all data in dynamodb table` to redshift table directly, as for redshift, [primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html), so if you run the jon serval times, you will see duplicate data with some primary key in redshift table.
使用AWS Glue Streaming Job进行同步
适用场景
- 持续增量同步
- 表的操作支持插入,更新和删除
架构图
资源部署
- Dynamodb表
- VPC
- Redshift Cluster
- Glue Crawler
- GlueJob
优点
- 可以支持表的插入,更新和删除操作的同步
缺点
使用AWS kinesis Firehose进行同步
适用场景
- 持续增量同步
- 表的操作只支持插入,部分支持更新,不支持删除,比如记录传感器每秒收集的数据,记录网站用户的行为事件
- 由于kinesis firehose是通过Redshift COPY命令与redshift进行集成的,而redshift是不保证primary key的唯一性,对redshift来说,primary key只是提供信息,并没有保证primary key的唯一性,如果在COPY命令的源数据中包含多条primary key相同的数据(比如对一条数据进行多次修改),则会导致redshift表出现多条primary key相同的数据。
- 部分支持更新的意思就是如果多条primary key相同的数据对你的业务逻辑没有影响,那也可以使用AWS kinesis Firehose进行同步,如果多条primary key对业务逻辑有影响,那就不可以使用
- 由于kinesis firehose是通过Redshift COPY命令与redshift进行集成的,COPY命令是不支持删除的
架构图
资源部署
Reference
Setting up networking for development for AWS Glue - AWS Glue
相关文章:
AWS实战:Dynamodb到Redshift数据同步
AWS Dynamodb简介 Amazon DynamoDB 是一种完全托管式、无服务器的 NoSQL 键值数据库,旨在运行任何规模的高性能应用程序。DynamoDB能在任何规模下实现不到10毫秒级的一致响应,并且它的存储空间无限,可在任何规模提供可靠的性能。DynamoDB 提…...
机器学习评估指标的十个常见面试问题
评估指标是用于评估机器学习模型性能的定量指标。它们提供了一种系统和客观的方法来比较不同的模型并衡量它们在解决特定问题方面的成功程度。通过比较不同模型的结果并评估其性能可以对使用哪些模型、如何改进现有模型以及如何优化给定任务的性能做出正确的决定,所…...
常见的安全问题汇总 学习记录
声明 本文是学习2017中国网站安全形势分析报告. 而整理的学习笔记,分享出来希望更多人受益,如果存在侵权请及时联系我们 2017年重大网站安全漏洞 CVE-2017-3248 :WebLogic 远程代码执行 2017年1月27日,WebLogic官方发布了一个编号为CVE-2017-3248 的…...
元宵晚会节目预告没有岳云鹏,是不敢透露还是另有隐情
在刚刚结束的元宵节晚会上,德云社的岳云鹏,再一次参加并引起轰动,并获得了观众朋友们的一致好评。 不过有细心的网友发现,早前央视元宵晚会节目预告,并没有看到小岳岳,难道是不敢提前透露,怕公布…...
计算机视觉 吴恩达 week 10 卷积
文章目录一、边缘检测二、填充 padding1、valid convolution2、same convolution三、卷积步长 strided convolution四、三维卷积五、池化层 pooling六、 为什么要使用卷积神经网络一、边缘检测 可以通过卷积操作来进行 原图像 n✖n 卷积核 f✖f 则输出的图像为 n-f1 二、填充…...
JavaScript 函数定义
JavaScript 函数定义 函数是 JavaScript 中的基本组件之一。一个函数是 JavaScript 过程 — 一组执行任务或计算值的语句。要使用一个函数,你必须将其定义在你希望调用它的作用域内。 一个 JavaScript 函数用function关键字定义,后面跟着函数名和圆括号…...
设计模式:建造者模式教你创建复杂对象
一、问题场景 当我们需要创建资源池配置对象的时候,资源池配置类里面有以下成员变量: 如果我们使用new关键字调用构造函数,构造函数参数列表就会太长。 如果我们使用set方法设置字段值,那minIdle<maxIdle<maxTotal的约束逻辑就没地方…...
在C++中将引用转换为指针表示
在C中将引用转换为指针表示 有没有办法在c 中"转换"对指针的引用?在下面的例子,func2已经定义了原型和我不能改变它,但func是我的API,我想为pass两个参数,或一(组和第二组,以NULL)或既不(均设置为NULL): void func2(some1 *p1, some2 *p2); func(some1…...
PS快速入门系列
01-界面构成 1菜单栏 2工具箱 3工县属性栏 4悬浮面板 5画布 ctr1N新建对话框(针对画布进行设置) 打开对话框:ctrl0(字母) 画布三种显示方式切换:F 隐藏工具箱,工具属性栏,悬浮面板…...
ASP.NET CORE 3.1 MVC“指定的网络名不再可用\企图在不存在的网络连接上进行操作”的问题解决过程
ASP.NET CORE 3.1 MVC“指定的网络名不再可用\企图在不存在的网络连接上进行操作”的问题解决过程 我家里的MAC没这个问题。这个是在windows上发生的。 起因很简单我用ASP.NET CORE 3.1 MVC做个项目做登录将数据从VIEW post到Controller上结果意外的报了错误。 各种百度都说…...
JVM从看懂到看开Ⅲ -- 类加载与字节码技术【下】
文章目录编译期处理默认构造器自动拆装箱泛型集合取值可变参数foreach 循环switch 字符串switch 枚举枚举类try-with-resources方法重写时的桥接方法匿名内部类类加载阶段加载链接初始化相关练习和应用类加载器类与类加载器启动类加载器拓展类加载器双亲委派模式自定义类加载器…...
服务器常用的41个状态码及其对应的含义
服务器常用的状态码及其对应的含义如下: 100——客户必须继续发出请求 101——客户要求服务器根据请求转换HTTP协议版本 200——交易成功 201——提示知道新文件的URL 202——接受和处理、但处理未完成 203——返回信息不确定或不完整 204——请求收到&#…...
万里数据库加入龙蜥社区,打造基于“龙蜥+GreatSQL”的开源技术底座
近日,北京万里开源软件有限公司(以下简称“万里数据库”)及 GreatSQL 开源社区签署了 CLA(Contributor License Agreement,贡献者许可协议),正式加入龙蜥社区(OpenAnolis)…...
为什么不推荐使用CSDN?
CSDN粪坑 94%的讲得乱七八糟前言不搭后语互相矛盾的垃圾(还包含直接复制粘贴其他源的内容)3%的纯搬运(偷窃)2%个人日记 (以上99%中还夹杂着很多明明都是盗版资源还要上传卖钱的 ) 1%黄金程序员时间有限&am…...
apisix 初体验
文章目录前言一、参考资料二、安装1.安装依赖2.安装apisix 2.53.apisix dashboard三、小试牛刀3.1 上游(upstream)3.2 路由(route)四、遇到的问题前言 APISIX 是一个微服务API网关,具有高性能、可扩展性等优点。它基于…...
time时间模块
time时间模块 目录time时间模块1.概述2.查看不同类型的时钟3.墙上时钟time3.1.time()当前时间戳3.2.ctime()格式化时间4.单调时钟计算测量时间5.cpu处理器时钟时间6.性能计数器7.时间组成8.处理时区9.解析和格式化时间1.概述 time模块允许访问多种类型的时钟,分别用…...
如何判断反馈电路的类型-反馈类型-三极管
如何判断反馈电路的类型 反馈电路类型很多,可根据不同的标准分类: ①根据反馈的极性分:有正反馈和负反馈。 ②根据反馈信号和输出信号的关系分:有电压反馈和电流反馈。 ③根据反馈信号和输入信号的关系分:有串联反…...
C++ 实现生命游戏 Live Game
#include"stdlib.h" #include"time.h" #include"unistd.h" using namespace std; #define XSIZE 80 #define YSIZE 30 #include"iostream" using namespace std ; // 初始化生命 void initLive(int a[YSIZE][XSIZE]) { // …...
什么是QoS?QoS是如何工作的?QoS的实验配置如何进行?
QoS(Quality of Service)是服务质量的简称。对于网络业务来说,服务质量包括哪些方面呢? 从传统意义上来讲,无非就是传输的带宽、传送的时延、数据的丢包率等,而提高服务质量无非也就是保证传输的带宽&…...
AcWing 840. 模拟散列表
题目描述 餐前小菜: 在讨论本题目之前先看一个简单的问题:给出 NNN 个正整数 (a1,a2,...,an)(a_1,a_2,...,a_n)(a1,a2,...,an),再给出 MMM 个正整数 (x1,x2,...,xm)(x_1,x_2,...,x_m)(x1,x2,...,xm),问这 MMM 个数中…...
【网络工程】常见HTTP响应状态码
前言 什么是HTTP响应状态码? HTTP状态码(HTTP Status Code)是表示网页服务器超文本传输协议响应状态的3位数字代码 HTTP响应码被分为五大类 信息响应(100~199)成功响应(200~299)重定向响应&am…...
Python之ruamel.yaml模块详解(二)
Python之ruamel.yaml模块详解(二)4 将YAML解析为Python对象并修改5 使用旧API将YAML解析为Python对象并修改6 使用[]和.get()访问合并的键:7 使用insert()方法插入内容8 使用yaml.indent()更改默认缩进9 使用yaml.compact()隔行显示10 同一数…...
若依框架 --- 偶发的el-select无法选择的问题
👏作者简介:大家好,我是小童,Java开发工程师,CSDN博客博主,Java领域新星创作者 📕系列专栏:前端、Java、Java中间件大全、微信小程序、微信支付、若依框架、Spring全家桶 Ǵ…...
【Linux】tmpfile 使用介绍
tmpfile 使用介绍 1 介绍 很多情况下,需要系统自动识别/tmp、/var/tmp下的临时目录,并将其自动清理其中的过期文件。这个工具就是systemd-tmpfiles。 网上很多博客使用tmpwatchcron的方法来管理临时文件和临时存放文件的目录,在后期的版本…...
实现光线追踪重投影的方法
光线追踪重投影方法 重投影这项技术一般用于时间性帧复用技术上,例如TAA(Temporal Anti-Aliasing)反走样或者抗锯齿技术。读这篇文章最好先对TAA这类技术的算法流程有了解。 1.TAA抗锯齿技术简介 先简单介绍下TAA抗锯齿的原理,在游戏中,当前…...
Hyperbolic Representation Learning for CV
Contents Hyperbolic geometry[CVPR 2020] Hyperbolic visual embedding learning for zero-shot recognitionIntroductionApproachHyperbolic Label Embedding LearningHyperbolic Image Embedding LearningExperiment[CVPR 2020] Hyperbolic Image EmbeddingsIntroduction...
In Context Learning 相关分享
个人知乎详见 https://zhuanlan.zhihu.com/p/603650082/edit 1. 前言 随着大模型(GPT3,Instruction GPT,ChatGPT)的横空出世,如何更高效地提示大模型也成了学术界与工业界的关注,因此In-context learning…...
【前端笔试题一】:解析url路径中的query参数
前言 本文记录下在笔试过程中的前端笔试编程题目,会持续更新 1. 题目: 解析 url 路径中的 query 参数,比如:‘http://building/#/skeleton?serialNumber2023020818332821073&jobNo210347&target%7B%22a%22%3A%22b%22%2C…...
K_A12_001 基于STM32等单片机采集火光火焰传感参数串口与OLED0.96双显示
K_A12_001 基于STM32等单片机采集火光火焰传感参数串口与OLED0.96双显示一、资源说明二、基本参数参数引脚说明三、驱动说明IIC地址/采集通道选择/时序对应程序:四、部分代码说明1、接线引脚定义1.1、STC89C52RC火光火焰模块1.2、STM32F103C8T6火光火焰模块五、基础知识学习与相…...
Java基础42 枚举与注解
枚举与注解一、枚举(enumeration)1.1 自定义类实现枚举1.2 enum关键字实现枚举1.2.1 enum的注意事项1.2.2 enum的使用练习1.2.3 enum的常用方法1.2.4 enum的使用细节及注意事项1.2.5 enum练习二、注解(Annotation)2.1 Override&am…...
麻城做网站/免费刷粉网站推广免费
微信公众号“中学数学教与学”教师群公告微信QQ教师群入群方式及介绍高中数学教与学★教师QQ群【324623715】初中数学教与学★教师QQ群【460287009】中学数学教与学★学生QQ群【837494287】本文作者,David Matthews,《自然》特约作者。翻译作者ÿ…...
南京网站制作报价/网络广告推广方式
深度学习编译器综合研究报告 本文主要参考了: The Deep Learning Compiler: A Comprehensive Survey 本文主要回答以下几个问题: 为什么需要dl compiler当下流行的dl framwwork有哪些深度学习硬件有三类 都有哪些dl compiler的关键组件和技术流行的dl c…...
专业制作网站多少钱/大数据分析网站
前言 我们平时项目开发中,经常会有很多类似的代码文件,而我们在使用的时候也会经常的去复制粘贴。为此我之前也写过一篇文章,探讨过提高开发效率的方法,但是说实话,也并不是很好用。 看如今火热的前端框架࿰…...
政务网站建设管理工作总结/网络营销比较好的企业
解决方法: sudo apt-get install -f 转载于:https://www.cnblogs.com/wulinmenghuantejing/p/8378005.html...
怎么自己做个网站做链接跳转/seo算法培训
---------------------- Windows Phone 7手机开发、.Net培训、期待与您交流! ---------------------- 一、javascript调用C#代码解决方案 // 第一种情况: 1.后台方法: protected string CsharpVoid(string strCC) { return strCC; …...
响应式网站和自适应网站区别/新华传媒b2b商务平台
css(cascading style sheets)美化样式 css通常称为css样式表或层叠样式表(级联样式表),主要用于设置HTML中的文本内容(字体、大小、对齐方式)、图片的外形(宽高、边距、边框样式等&a…...