Open In Colab

Lesson 3 Storage

Pragmatic AI Labs

alt text

This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:

3.1 Determine and optimize the operational characteristics of the storage solution

The Three “Vs” of Big Data: Variety, Velocity and Volume

Big Data Challenges

Variety

Velocity

Volume

3.2 Determine data access and retrieval patterns

Batch vs Streaming Data

Impact on ML Pipeline

  • More control of model training in batch (can decide when to retrain)
  • Continuously retraining model could provide better prediction results or worse results
  • Did input stream suddenly get more users or less users?
  • Is there an A/B testing scenario?

Batch

  • Data is batched at intervals
  • Simplest approach to create predictions
  • Many Services on AWS Capable of Batch Processing
  • AWS Glue
  • AWS Data Pipeline
  • AWS Batch
  • EMR

Streaming

  • Continously polled or pushed
  • More complex method of prediction
  • Many Services on AWS Capable of Streaming
  • Kinesis
  • IoT

3.3 Evaluate mechanisms for capture, update, and retrieval of catalog entries

[Omnigraffle or Whiteboard Demo]

3.4 Determine appropriate data structure and storage format

[Omnigraffle or Whiteboard Demo]

3.5 Understand Storage & Database Fundamentals

Data Storage Concepts

Database Overview

Database Styles

3.6 Learn S3 - storage

[Demo]

3.7 Understand Glacier - backup & archive

[Demo]

3.8 Create AWS Glue - data catalog

AWS Glue

AWS Glue is fully managed ETL Service

AWS Glue Screen

AWS Glue Workflow

  • Build Data Catalog
  • Generate and Edit Transformations
  • Schedule and Run Jobs

[DEMO] AWS Glue

3.9 Use Dynamodb

Using AWS DynamoDB

https://aws.amazon.com/dynamodb/

alt text

Query Example:

def query_police_department_record_by_guid(guid):
    """Gets one record in the PD table by guid
    
    In [5]: rec = query_police_department_record_by_guid(
        "7e607b82-9e18-49dc-a9d7-e9628a9147ad"
        )
    
    In [7]: rec
    Out[7]: 
    {'PoliceDepartmentName': 'Hollister',
     'UpdateTime': 'Fri Mar  2 12:43:43 2018',
     'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}
    """
    
    db = dynamodb_resource()
    extra_msg = {"region_name": REGION, "aws_service": "dynamodb", 
        "police_department_table":POLICE_DEPARTMENTS_TABLE,
        "guid":guid}
    log.info(f"Get PD record by GUID", extra=extra_msg)
    pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)
    response = pd_table.get_item(
        Key={
            'guid': guid
            }
    )
    return response['Item']

Case Study DynamoDB

casestudy

[Demo] DynamoDB