ETL ON AWS: PART 4 (INGESTING STREAMING DATA WITH AMAZON KINESIS (similar to Apache Kafka))
In this Era of Big Data where there is an explosion of different types of data in vast volumes and variety from several sources, there arises the need to tame some of such data in real time, popularly called streaming data analytics.
Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.
This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Information derived from such analysis gives companies visibility into many aspects of their business and customer activity such as –service usage (for metering/billing), server activity, website clicks, and geo-location of devices, people, and physical goods –and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams, and respond in a timely fashion as the necessity arises.
Some Examples of streaming data
- Sensors in transportation vehicles, industrial equipment, and farm machinery send data to a streaming application. The application monitors performance, detects any potential defects in advance, and places a spare part order automatically preventing equipment down time.
- A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.
- A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-time property recommendations of properties to visit based on their geo-location.
In the hands-on for this article, we use Amazon Kinesis to demonstrate how to handle streaming data
BACKGROUND
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application. With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.
Amazon Kinesis service has four (4) major components:
- Kinesis Video Streams: for streaming video from connected devices for analytics, ML, etc
- Kinesis Data Streams: A scalable real-time data streaming service that can continuously capture lots of data from different sources in gigabytes per second
- Kinesis Data Firehose: for loading and transforming data streams into AWS data stores for near real-time analytics
- Kinesis Data Analytics: for processing data streams in real time using the popular SQL language or Apache Flink
We will be using Kinesis Data Firehose in this session
Benefits of Kinesis Data Firehose Load real-time data. Load streaming data into data lakes, data stores, and analytics tools for:
- Log and event analytics
- IoT data analytics
- Clickstream analytics
- Security monitoring
PROCEDURE:
- Amazon Kinesis Data Firehose to ingest streaming data, and write the data out to Amazon S3
- Amazon Kinesis Data Generator (KDG) to generate source of streaming data
DELIVERY STREAM
- s3 bucket prefix: streaming/!{timestamp:yyyy/MM/}
- s3 bucket error prefix: !{firehose:error-output-type}/!{timestamp:yyyy/MM/}
- For this Use case we set buffer size to 1 MB and buffer interval to 60 seconds
STREAMING
The Sakila Database from the part 2 of this series, was for a company that produced classic movies. Now movies are available for rental through various streaming platforms. The company receives information about their classic movies being streamed from their distribution partners in real time, in a standard format. Using KDG, the streaming data information includes the following;
- Streaming timestamp
- Whether the customer rented, purchased, or watched the trailer
- film_id that matches the sakila film database
- the distribution partner name
- Straming platform
- the state the movie was streamed in
Amazon Kinesis Data Generator (KDG) is an open source tool that makes it easy to send data to Kinesis Data Streams and Kinesis Firehose
KDG Cloudformation template
The CloudFormation template will create the following resources in your AWS account:
- An IAM role that gives the Lambda function permission to create Cognito resources.
- An IAM role that is assigned to authenticated Cognito users. This role has only enough permission to use the KDG.
- An IAM role that is assigned to unauthenticated Cognito users. This role has only enough permission to create Cognito analytics events.
- A Lambda function.
The Lambda function will create the following resources in your AWS account:
- A Cognito User Pool.
- A Cognito Federated Identity Pool.
- A Cognito User, with the username and password specified by you when you created the CloudFormation stack.
- The necessary relationships between the roles, users and pools.
KDG URL: <my-url>
Region: us-east-1
Delivery Stream: etl3-delivery
Records per second: 10
Record template
{
"timestamp":"{{date.now}}",
"eventType":"{{random.weightedArrayElement(
{
"weights": [0.3, 0.1, 0.6],
"data":["rent","buy","trailer"]
}
)}}",
"film_id":{{random.number(
{
"min":1,
"max":1000
}
)}},
"distributor":"{{random.arrayElement(
["amazon prime", "google play", "apple itunes", "twitter",
"microsoft", "meta", "youtube"]
)}}",
"platform": "{{random.arrayElement(
["ios", "android", "xbox", "social-media", "tech", "other"]
)}}",
"state":"{{address.state}}"
}