Open In Colab

Lesson 3 Storage

Pragmatic AI Labs

alt text

This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:

3.1 Determine and optimize the operational characteristics of the storage solution

The Three “Vs” of Big Data: Variety, Velocity and Volume

Big Data Challenges




3.2 Determine data access and retrieval patterns

Batch vs Streaming Data

Impact on ML Pipeline

  • More control of model training in batch (can decide when to retrain)
  • Continuously retraining model could provide better prediction results or worse results
  • Did input stream suddenly get more users or less users?
  • Is there an A/B testing scenario?


  • Data is batched at intervals
  • Simplest approach to create predictions
  • Many Services on AWS Capable of Batch Processing
  • AWS Glue
  • AWS Data Pipeline
  • AWS Batch
  • EMR


  • Continously polled or pushed
  • More complex method of prediction
  • Many Services on AWS Capable of Streaming
  • Kinesis
  • IoT

3.3 Evaluate mechanisms for capture, update, and retrieval of catalog entries

[Omnigraffle or Whiteboard Demo]

3.4 Determine appropriate data structure and storage format

[Omnigraffle or Whiteboard Demo]

3.5 Understand Storage & Database Fundamentals

Data Storage Concepts

Database Overview

Database Styles

3.6 Learn S3 - storage


3.7 Understand Glacier - backup & archive


3.8 Create AWS Glue - data catalog

AWS Glue

AWS Glue is fully managed ETL Service

AWS Glue Screen

AWS Glue Workflow

  • Build Data Catalog
  • Generate and Edit Transformations
  • Schedule and Run Jobs


3.9 Use Dynamodb

Using AWS DynamoDB

alt text

Query Example:

def query_police_department_record_by_guid(guid):
    """Gets one record in the PD table by guid
    In [5]: rec = query_police_department_record_by_guid(
    In [7]: rec
    {'PoliceDepartmentName': 'Hollister',
     'UpdateTime': 'Fri Mar  2 12:43:43 2018',
     'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}
    db = dynamodb_resource()
    extra_msg = {"region_name": REGION, "aws_service": "dynamodb", 
        "guid":guid}"Get PD record by GUID", extra=extra_msg)
    pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)
    response = pd_table.get_item(
            'guid': guid
    return response['Item']

Case Study DynamoDB


[Demo] DynamoDB