Lesson 5: Python Data Structures
Pragmatic AI Labs
This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:
- Buying a copy of Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Reading an online copy of Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Watching video Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline on Safari Books Online.
- Watching video AWS Certified Machine Learning-Speciality
- Purchasing video Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video
- Viewing more content at noahgift.com
5.1 Use lists and tuples
Sequences
Lists, tuples, and strings are all Python sequences, and share many of the same methods.
Creating an empty list
empty = []
empty
[]
Using square brackets with initial values
numbers = [1, 2, 3]
numbers
[1, 2, 3]
Casting an iterable
Any iterable can be cast to a list
numbers = list(range(10))
numbers
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Creating using multiplication
num_players = 10
scores = [0] * num_players
scores
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Mixing data types
Lists can contain multple data types
mixed = ['a', 1, 2.0, [13], {}]
mixed
['a', 1, 2.0, [13], {}]
Indexing
Items in lists can be accessed using indices in a similar fashion to strings.
Access first item
numbers[0]
0
Access last item
numbers[-2]
8
Access any item
numbers[4]
4
Adding to a list
Append to the end of a list
letters = ['a']
letters.append('c')
letters
['a', 'c']
Insert at beginning of list
letters.insert(0, 'b')
letters
['b', 'a', 'c']
Insert at arbitrary position
letters.insert(2, 'c')
letters
['b', 'a', 'c', 'c']
Extending with another list
more_letters = ['e', 'f', 'g']
letters.extend(more_letters)
letters
['b', 'a', 'c', 'c', 'e', 'f', 'g']
Change item at some position
letters[3] = 'd'
letters
['b', 'a', 'c', 'd', 'e', 'f', 'g']
Swap two items
letters[0], letters[1] = letters[1], letters[0]
letters
['a', 'b', 'c', 'd', 'e', 'f', 'g']
Removing items from a list
Pop from the end
letters = ['a', 'b', 'c', 'd', 'e', 'f']
letters.pop()
letters
['a', 'b', 'c', 'd', 'e']
Pop by index
letters.pop(2)
letters
['a', 'b', 'd', 'e']
Remove specific item
letters.remove('d')
letters
['a', 'b', 'e']
Create tuple using brackets
tup = (1, 2, 3)
tup
(1, 2, 3)
Create tuple with commas
tup = 1, 2, 3
tup
(1, 2, 3)
Create empty tuple
tup = ()
tup
()
Create tuple with single item
tup = 1,
tup
(1,)
Behaviours shared by lists and tuples
The following sequence behaviors are shared by lists and tuples
Check item in sequence
3 in (1, 2, 3, 4, 5)
True
Check item not in sequence
'a' not in [1, 2, 3, 4, 5]
True
Slicing
Setting start, slice to the end
letters = 'a', 'b', 'c', 'd', 'e', 'f'
letters[3:4]
('d',)
Set end, slice from beginning
letters[:4]
('a', 'b', 'c', 'd')
Index from end of sequence
letters[-4:]
('c', 'd', 'e', 'f')
Setting step
letters[1::-2]
('b',)
Unpacking
first, middle = [1, 2, 3]
f"first = {first}, middle = {middle}, last = {last}"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-c24a37f354b9> in <module>()
----> 1 first, middle = [1, 2, 3]
2
3 f"first = {first}, middle = {middle}, last = {last}"
ValueError: too many values to unpack (expected 2)
Extended unpacking
first, *middle, last = (1, 2, 3, 4, 5)
f"first = {first}, middle = {middle}, last = {last}"
'first = 1, middle = [2, 3, 4], last = 5'
Using list as Stack
A stack is a LIFO (last in, first out) data structure which can be simulated using a list
Push onto the stack using append
stack = []
stack.append('first on')
stack.append('second on')
stack.append('third on')
stack
['first on', 'second on', 'third on']
Retrieve items, last one first using pop
f"Retrieved first: {stack.pop()!r}, retrieved second: {stack.pop()!r}, retrieved last: {stack.pop()!r}"
"Retrieved first: 'third on', retrieved second: 'second on', retrieved last: 'first on'"
5.2 Explore dictionaries
Dictionaries are mappings of key value pairs.
Create an empty dict using constructor
dictionary = {}
dictionary
{}
Create a dictionary based on key/value pairs
key_values = [['key-1','value-1'], ['key-2', 'value-2']]
dictionary = dict(key_values)
dictionary
{'key-1': 'value-1', 'key-2': 'value-2'}
Create an empty dict using curley braces
dictionary = {}
dictionary
{}
Use curley braces to create a dictionary with initial key/values
dictionary = {'key-1': 'value-1',
'key-2': 'value-2'}
dictionary
{'key-1': 'value-1', 'key-2': 'value-2'}
Access value using key
dictionary['key-1']
'value-1'
Add a key/value pair to an existing dictionary
dictionary['key-3'] = 'value-3'
dictionary
{'key-1': 'value-1', 'key-2': 'value-2', 'key-3': 'value-3'}
Update value for existing key
dictionary['key-2'] = 'new-value-2'
dictionary['key-2']
'new-value-2'
Get keys
list(dictionary.keys())
['key-1', 'key-2', 'key-3']
Get values
dictionary.values()
dict_values(['value-1', 'new-value-2', 'value-3'])
Get iterable keys and items
dictionary.items()
dict_items([('key-1', 'value-1'), ('key-2', 'new-value-2'), ('key-3', 'value-3')])
Use items in for loop
for key, value in dictionary.items():
print(f"{key}: {value}")
key-1: value-1
key-2: new-value-2
key-3: value-3
Check if dictionary has key
The ‘in’ syntax we used with sequences checks the dicts keys for membership.
'key-5' in dictionary
False
Get method
dictionary.get("bad key", "default value")
'default value'
Remove item
del(dictionary['key-1'])
dictionary
{'key-2': 'new-value-2', 'key-3': 'value-3'}
Keys must be immutable
List as key
Lists are mutable and not hashable
items = ['item-1', 'item-2', 'item-3']
map = {}
map[items] = "some-value"
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-66-25faa77a670a> in <module>()
3 map = {}
4
----> 5 map[items] = "some-value"
TypeError: unhashable type: 'list'
Tuple as a key
Tuples are immutable and hence hashable
items = 'item-1', 'item-2', 'item-3'
map = {}
map[items] = "some-value"
map
{('item-1', 'item-2', 'item-3'): 'some-value'}
5.3 Dive into sets
Create set from tuple or list
letters = 'a', 'a', 'a', 'b', 'c'
unique_letters = set(letters)
unique_letters
{'a', 'b', 'c'}
Create set from a string
unique_chars = set('mississippi')
unique_chars
{'i', 'm', 'p', 's'}
Create set using curley braces
unique_num = {1, 1, 2, 3, 4, 5, 5}
unique_num
{1, 2, 3, 4, 5}
Adding to a set
unique_num.add(6)
unique_num
{1, 2, 3, 4, 5, 6}
Popping from a set
Pop method removes and returns a random element of the set
unique_num.pop()
2
Indexing
Sets have no order, and hence cannot be accessed via indexing
unique_num[4]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-75-c928415e5703> in <module>()
----> 1 unique_num[4]
TypeError: 'set' object does not support indexing
Checking membership
3 in unique_num
True
Set operations
s1 = { 1 ,2 ,3 ,4, 5, 6, 7}
s2 = { 0, 2, 4, 6, 8 }
Items in first set, but not in the second
s1 - s2
{1, 3, 5, 7}
Items in either or both sets
s1 | s2
{0, 1, 2, 3, 4, 5, 6, 7, 8}
Items in both sets
s1 & s2
{2, 4, 6}
Items in either set, but not both
s1 ^ s2
{0, 1, 3, 5, 7, 8}
5.4 Work with the numpy array
Numpy is an opened source numerical computing libary for python. The numpy array is a datastructure representing multidimension arrays which is optimized for both memory and performance.
Create a numpy array from a list of lists
import numpy as np
list_of_lists = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
np_array = np.array(list_of_lists)
np_array
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
Initialize an array of zeros
zeros_array = np.zeros( (4, 5) )
zeros_array
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
Initialize and array of ones
ones_array = np.ones( (6, 6) )
ones_array
array([[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]])
Using arrange
nine = np.arange( 9 )
nine
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
Using reshape
nine.reshape(3,3)
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
Introspection
Get the data type
np_array.dtype
dtype('int64')
Get the array’s shape
np_array.shape
(4, 4)
Get the number of items in the array
np_array.size
16
Get the size of the array in bytes
np_array.nbytes
128
Setting the data type
dtype parameter
np_array = np.array(list_of_lists, dtype=np.int8)
np_array
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]], dtype=int8)
Size reduction
np_array.nbytes
16
The data type setting is immutible
Data may be truncated if the data type is restrictive.
np_array[0][0] = 1.7344567
np_array[0][0]
1
Array Slicing
- Slicing can be used to get a view reprsenting a sub-array.
- The slice is a view to the original array, the data is not copied to a new data structure
- The slice is taken in the form: array[ rows, columns ]
np_array
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]], dtype=int8)
np_array[2:, :3]
array([[ 9, 10, 11],
[13, 14, 15]], dtype=int8)
Math operations
- Unlike a unlike nested lists, matrix operations perform mathimatical operations on data
Create two 3 x 3 arrays
np_array_1 = np.arange(9).reshape(3,3)
np_array_1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
np_array_2 = np.arange(10, 19).reshape(3,3)
np_array_2
array([[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
Multiply the arrays
np_array_1 * np_array_2
array([[ 0, 11, 24],
[ 39, 56, 75],
[ 96, 119, 144]])
Add the arrays
np_array_1 + np_array_2
array([[10, 12, 14],
[16, 18, 20],
[22, 24, 26]])
Matrix operations
Transpose
np_array.T
array([[ 1, 5, 9, 13],
[ 2, 6, 10, 14],
[ 3, 7, 11, 15],
[ 4, 8, 12, 16]], dtype=int8)
Dot product
np_array_1.dot(np_array_2)
array([[ 45, 48, 51],
[162, 174, 186],
[279, 300, 321]])
5.5 Use the Pandas DataFrame
- One of the most highly leveraged data structures for data science
- A table-like two dimensional data structure.
Create a DataFrame
import pandas as pd
first_names = ['henry', 'rolly', 'molly', 'frank', 'david', 'steven', 'gwen', 'arthur']
last_names = ['smith', 'brocker', 'stein', 'bach', 'spencer', 'de wilde', 'mason', 'davis']
ages = [43, 23, 78, 56, 26, 14, 46, 92]
df = pd.DataFrame({ 'first': first_names, 'last': last_names, 'age': ages})
df
age | first | last | |
---|---|---|---|
0 | 43 | henry | smith |
1 | 23 | rolly | brocker |
2 | 78 | molly | stein |
3 | 56 | frank | bach |
4 | 26 | david | spencer |
5 | 14 | steven | de wilde |
6 | 46 | gwen | mason |
7 | 92 | arthur | davis |
Head - looking at the top
df.head(10)
age | first | last | |
---|---|---|---|
0 | 43 | henry | smith |
1 | 23 | rolly | brocker |
2 | 78 | molly | stein |
3 | 56 | frank | bach |
4 | 26 | david | spencer |
5 | 14 | steven | de wilde |
6 | 46 | gwen | mason |
7 | 92 | arthur | davis |
Setting number of rows returned with head
df.head(3)
Tail - looking at the bottom
df.tail(2)
age | first | last | |
---|---|---|---|
6 | 46 | gwen | mason |
7 | 92 | arthur | davis |
Describe - descriptive statistics
df.describe()
age | |
---|---|
count | 8.000000 |
mean | 47.250000 |
std | 27.227874 |
min | 14.000000 |
25% | 25.250000 |
50% | 44.500000 |
75% | 61.500000 |
max | 92.000000 |
Access one column
df['first']
0 henry
1 rolly
2 molly
3 frank
4 david
5 steven
6 gwen
7 arthur
Name: first, dtype: object
Slice a column
df['first'][4:]
4 david
5 steven
6 gwen
7 arthur
Name: first, dtype: object
Use conditions to filter
df[df['age'] > 50]
age | first | last | |
---|---|---|---|
2 | 78 | molly | stein |
3 | 56 | frank | bach |
7 | 92 | arthur | davis |
5.6 Use the pandas Series
- A one dimensional labeled array
- Contains data of only one type
- Similar to a column in a spreedsheet
Create a series
pd_series = pd.Series( [1, 2, 3 ] )
pd_series
0 1
1 2
2 3
dtype: int64
Series introspection methods
f"This series is made up of {pd_series.size} items whose data type is {pd_series.dtype}"
'This series is made up of 3 items whose data type is int64'
A Pandas DataFrame is composed of Pandas Series.
age = df.age
type( age )
pandas.core.series.Series
Some useful helper methods of a Series
mean
pd_series = pd.Series([ 1, 2, 3, 5, 6, 6, 6, 7, 8])
pd_series.mean()
4.888888888888889
Unique
pd_series.unique()
array([1, 2, 3, 5, 6, 7, 8])
Max
pd_series.min()
1