Lesson 3 Storage
Pragmatic AI Labs
This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:
- Buying a copy of Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Reading an online copy of Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Watching video Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline on Safari Books Online.
- Watching video AWS Certified Machine Learning-Speciality
- Purchasing video Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video
- Viewing more content at noahgift.com
3.1 Determine and optimize the operational characteristics of the storage solution
The Three “Vs” of Big Data: Variety, Velocity and Volume
Variety
Velocity
Volume
3.2 Determine data access and retrieval patterns
Batch vs Streaming Data
Impact on ML Pipeline
- More control of model training in batch (can decide when to retrain)
- Continuously retraining model could provide better prediction results or worse results
- Did input stream suddenly get more users or less users?
- Is there an A/B testing scenario?
Batch
- Data is batched at intervals
- Simplest approach to create predictions
- Many Services on AWS Capable of Batch Processing
- AWS Glue
- AWS Data Pipeline
- AWS Batch
- EMR
Streaming
- Continously polled or pushed
- More complex method of prediction
- Many Services on AWS Capable of Streaming
- Kinesis
- IoT
3.3 Evaluate mechanisms for capture, update, and retrieval of catalog entries
[Omnigraffle or Whiteboard Demo]
3.4 Determine appropriate data structure and storage format
[Omnigraffle or Whiteboard Demo]
3.5 Understand Storage & Database Fundamentals
Data Storage Concepts
Database Overview
3.6 Learn S3 - storage
[Demo]
3.7 Understand Glacier - backup & archive
[Demo]
3.8 Create AWS Glue - data catalog
AWS Glue
AWS Glue is fully managed ETL Service
AWS Glue Workflow
- Build Data Catalog
- Generate and Edit Transformations
- Schedule and Run Jobs
[DEMO] AWS Glue
3.9 Use Dynamodb
Using AWS DynamoDB
https://aws.amazon.com/dynamodb/
Query Example:
def query_police_department_record_by_guid(guid):
"""Gets one record in the PD table by guid
In [5]: rec = query_police_department_record_by_guid(
"7e607b82-9e18-49dc-a9d7-e9628a9147ad"
)
In [7]: rec
Out[7]:
{'PoliceDepartmentName': 'Hollister',
'UpdateTime': 'Fri Mar 2 12:43:43 2018',
'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}
"""
db = dynamodb_resource()
extra_msg = {"region_name": REGION, "aws_service": "dynamodb",
"police_department_table":POLICE_DEPARTMENTS_TABLE,
"guid":guid}
log.info(f"Get PD record by GUID", extra=extra_msg)
pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)
response = pd_table.get_item(
Key={
'guid': guid
}
)
return response['Item']