Lesson 5: Python Data Structures

Pragmatic AI Labs

alt text

This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:

Buying a copy of Pragmatic AI: An Introduction to Cloud-Based Machine Learning
Reading an online copy of Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning
Watching video Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline on Safari Books Online.
Watching video AWS Certified Machine Learning-Speciality
Purchasing video Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video
Viewing more content at noahgift.com

5.1 Use lists and tuples

Sequences

Lists, tuples, and strings are all Python sequences, and share many of the same methods.

Creating an empty list

empty = []
empty

[]

Using square brackets with initial values

numbers = [1, 2, 3]
numbers

[1, 2, 3]

Casting an iterable

Any iterable can be cast to a list

numbers = list(range(10))
numbers

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Creating using multiplication

num_players = 10
scores = [0] * num_players
scores

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Mixing data types

Lists can contain multple data types

mixed = ['a', 1, 2.0, [13], {}]
mixed

['a', 1, 2.0, [13], {}]

Indexing

Items in lists can be accessed using indices in a similar fashion to strings.

Access first item

numbers[0]

Access last item

numbers[-2]

Access any item

numbers[4]

Adding to a list

Append to the end of a list

letters = ['a']
letters.append('c')
letters

['a', 'c']

Insert at beginning of list

letters.insert(0, 'b')
letters

['b', 'a', 'c']

Insert at arbitrary position

letters.insert(2, 'c')
letters

['b', 'a', 'c', 'c']

Extending with another list

more_letters = ['e', 'f', 'g']
letters.extend(more_letters)
letters

['b', 'a', 'c', 'c', 'e', 'f', 'g']

Change item at some position

letters[3] = 'd'
letters

['b', 'a', 'c', 'd', 'e', 'f', 'g']

Swap two items

letters[0], letters[1] = letters[1], letters[0]
letters

['a', 'b', 'c', 'd', 'e', 'f', 'g']

Removing items from a list

Pop from the end

letters = ['a', 'b', 'c', 'd', 'e', 'f']
letters.pop()
letters

['a', 'b', 'c', 'd', 'e']

Pop by index

letters.pop(2)
letters

['a', 'b', 'd', 'e']

Remove specific item

letters.remove('d')
letters

['a', 'b', 'e']

Create tuple using brackets

tup = (1, 2, 3)
tup

(1, 2, 3)

Create tuple with commas

tup = 1, 2, 3
tup

(1, 2, 3)

Create empty tuple

tup = ()
tup

()

Create tuple with single item

tup = 1,
tup

(1,)

Behaviours shared by lists and tuples

The following sequence behaviors are shared by lists and tuples

Check item in sequence

3 in (1, 2, 3, 4, 5)

True

Check item not in sequence

'a' not in [1, 2, 3, 4, 5]

True

Slicing

Setting start, slice to the end

letters = 'a', 'b', 'c', 'd', 'e', 'f'
letters[3:4]

('d',)

Set end, slice from beginning

letters[:4]

('a', 'b', 'c', 'd')

Index from end of sequence

letters[-4:]

('c', 'd', 'e', 'f')

Setting step

letters[1::-2]

('b',)

Unpacking

first, middle = [1, 2, 3]

f"first = {first},  middle = {middle},  last = {last}"

    ---------------------------------------------------------------------------

    ValueError                                Traceback (most recent call last)

    <ipython-input-38-c24a37f354b9> in <module>()
    ----> 1 first, middle = [1, 2, 3]
          2 
          3 f"first = {first},  middle = {middle},  last = {last}"


    ValueError: too many values to unpack (expected 2)

Extended unpacking

first, *middle, last = (1, 2, 3, 4, 5)

f"first = {first},  middle = {middle},  last = {last}"

'first = 1,  middle = [2, 3, 4],  last = 5'

Using list as Stack

A stack is a LIFO (last in, first out) data structure which can be simulated using a list

Push onto the stack using append

stack = []
stack.append('first on')
stack.append('second on')
stack.append('third on')
stack

['first on', 'second on', 'third on']

Retrieve items, last one first using pop

f"Retrieved first: {stack.pop()!r}, retrieved second: {stack.pop()!r}, retrieved last: {stack.pop()!r}"

"Retrieved first: 'third on', retrieved second: 'second on', retrieved last: 'first on'"

5.2 Explore dictionaries

Dictionaries are mappings of key value pairs.

Create an empty dict using constructor

dictionary = {}
dictionary

{}

Create a dictionary based on key/value pairs

key_values = [['key-1','value-1'], ['key-2', 'value-2']]
dictionary = dict(key_values)
dictionary

{'key-1': 'value-1', 'key-2': 'value-2'}

Create an empty dict using curley braces

dictionary = {}
dictionary

{}

Use curley braces to create a dictionary with initial key/values

dictionary = {'key-1': 'value-1',
              'key-2': 'value-2'}

dictionary

{'key-1': 'value-1', 'key-2': 'value-2'}

Access value using key

dictionary['key-1']

'value-1'

Add a key/value pair to an existing dictionary

dictionary['key-3'] = 'value-3'

dictionary

{'key-1': 'value-1', 'key-2': 'value-2', 'key-3': 'value-3'}

Update value for existing key

dictionary['key-2'] = 'new-value-2'
dictionary['key-2']

'new-value-2'

Get keys

list(dictionary.keys())

['key-1', 'key-2', 'key-3']

Get values

dictionary.values()

dict_values(['value-1', 'new-value-2', 'value-3'])

Get iterable keys and items

dictionary.items()

dict_items([('key-1', 'value-1'), ('key-2', 'new-value-2'), ('key-3', 'value-3')])

Use items in for loop

for key, value in dictionary.items():
  print(f"{key}: {value}")

key-1: value-1
key-2: new-value-2
key-3: value-3

Check if dictionary has key

The ‘in’ syntax we used with sequences checks the dicts keys for membership.

'key-5' in dictionary

False

Get method

dictionary.get("bad key", "default value")

'default value'

Remove item

del(dictionary['key-1'])
dictionary

{'key-2': 'new-value-2', 'key-3': 'value-3'}

Keys must be immutable

List as key

Lists are mutable and not hashable

items = ['item-1', 'item-2', 'item-3']

map = {}

map[items] = "some-value"

    ---------------------------------------------------------------------------

    TypeError                                 Traceback (most recent call last)

    <ipython-input-66-25faa77a670a> in <module>()
          3 map = {}
          4 
    ----> 5 map[items] = "some-value"
    

    TypeError: unhashable type: 'list'

Tuple as a key

Tuples are immutable and hence hashable

items = 'item-1', 'item-2', 'item-3'
map = {}
map[items] = "some-value"

map

{('item-1', 'item-2', 'item-3'): 'some-value'}

5.3 Dive into sets

Create set from tuple or list

letters = 'a', 'a', 'a', 'b', 'c'
unique_letters = set(letters)
unique_letters

{'a', 'b', 'c'}

Create set from a string

unique_chars = set('mississippi')
unique_chars

{'i', 'm', 'p', 's'}

Create set using curley braces

unique_num = {1, 1, 2, 3, 4, 5, 5}
unique_num

{1, 2, 3, 4, 5}

Adding to a set

unique_num.add(6)
unique_num

{1, 2, 3, 4, 5, 6}

Popping from a set

Pop method removes and returns a random element of the set

unique_num.pop()

Indexing

Sets have no order, and hence cannot be accessed via indexing

unique_num[4]

    ---------------------------------------------------------------------------

    TypeError                                 Traceback (most recent call last)

    <ipython-input-75-c928415e5703> in <module>()
    ----> 1 unique_num[4]
    
    TypeError: 'set' object does not support indexing

Checking membership

3 in unique_num

True

Set operations

s1 = { 1 ,2 ,3 ,4, 5, 6, 7}
s2 = { 0, 2, 4, 6, 8 }

Items in first set, but not in the second

s1 - s2

{1, 3, 5, 7}

Items in either or both sets

s1 | s2

{0, 1, 2, 3, 4, 5, 6, 7, 8}

Items in both sets

s1 & s2

{2, 4, 6}

Items in either set, but not both

s1 ^ s2

{0, 1, 3, 5, 7, 8}

5.4 Work with the numpy array

Numpy is an opened source numerical computing libary for python. The numpy array is a datastructure representing multidimension arrays which is optimized for both memory and performance.

Create a numpy array from a list of lists

import numpy as np
list_of_lists = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]

np_array = np.array(list_of_lists)

np_array

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

Initialize an array of zeros

zeros_array = np.zeros( (4, 5) )
zeros_array

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

Initialize and array of ones

ones_array = np.ones( (6, 6) )
ones_array

array([[1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.]])

Using arrange

nine = np.arange( 9 )
nine

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

Using reshape

nine.reshape(3,3)

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Introspection

Get the data type

np_array.dtype

dtype('int64')

Get the array’s shape

np_array.shape

(4, 4)

Get the number of items in the array

np_array.size

Get the size of the array in bytes

np_array.nbytes

Setting the data type

dtype parameter

np_array = np.array(list_of_lists, dtype=np.int8)
np_array

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]], dtype=int8)

Size reduction

np_array.nbytes

The data type setting is immutible

Data may be truncated if the data type is restrictive.

np_array[0][0] = 1.7344567
np_array[0][0]

Array Slicing

Slicing can be used to get a view reprsenting a sub-array.
The slice is a view to the original array, the data is not copied to a new data structure
The slice is taken in the form: array[ rows, columns ]

np_array

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]], dtype=int8)

np_array[2:, :3]

array([[ 9, 10, 11],
       [13, 14, 15]], dtype=int8)

Math operations

Unlike a unlike nested lists, matrix operations perform mathimatical operations on data

Create two 3 x 3 arrays

np_array_1 = np.arange(9).reshape(3,3)
np_array_1

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

np_array_2 = np.arange(10, 19).reshape(3,3)
np_array_2

array([[10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

Multiply the arrays

np_array_1 * np_array_2

array([[  0,  11,  24],
       [ 39,  56,  75],
       [ 96, 119, 144]])

Add the arrays

np_array_1 + np_array_2

array([[10, 12, 14],
       [16, 18, 20],
       [22, 24, 26]])

Matrix operations

Transpose

np_array.T

array([[ 1,  5,  9, 13],
       [ 2,  6, 10, 14],
       [ 3,  7, 11, 15],
       [ 4,  8, 12, 16]], dtype=int8)

Dot product

np_array_1.dot(np_array_2)

array([[ 45,  48,  51],
       [162, 174, 186],
       [279, 300, 321]])

5.5 Use the Pandas DataFrame

One of the most highly leveraged data structures for data science
A table-like two dimensional data structure.

Create a DataFrame

import pandas as pd
first_names = ['henry', 'rolly', 'molly', 'frank', 'david', 'steven', 'gwen', 'arthur']
last_names = ['smith', 'brocker', 'stein', 'bach', 'spencer', 'de wilde', 'mason', 'davis']
ages = [43, 23, 78, 56, 26, 14, 46, 92]

df = pd.DataFrame({ 'first': first_names, 'last': last_names, 'age': ages})
df

	age	first	last
0	43	henry	smith
1	23	rolly	brocker
2	78	molly	stein
3	56	frank	bach
4	26	david	spencer
5	14	steven	de wilde
6	46	gwen	mason
7	92	arthur	davis

Head - looking at the top

df.head(10)

	age	first	last
0	43	henry	smith
1	23	rolly	brocker
2	78	molly	stein
3	56	frank	bach
4	26	david	spencer
5	14	steven	de wilde
6	46	gwen	mason
7	92	arthur	davis

Setting number of rows returned with head

df.head(3)

Tail - looking at the bottom

df.tail(2)

	age	first	last
6	46	gwen	mason
7	92	arthur	davis

Describe - descriptive statistics

df.describe()

	age
count	8.000000
mean	47.250000
std	27.227874
min	14.000000
25%	25.250000
50%	44.500000
75%	61.500000
max	92.000000

Access one column

df['first']

   henry
   rolly
   molly
   frank
   david
  steven
    gwen
  arthur
Name: first, dtype: object

Slice a column

df['first'][4:]

   david
  steven
    gwen
  arthur
Name: first, dtype: object

Use conditions to filter

df[df['age'] > 50]

	age	first	last
2	78	molly	stein
3	56	frank	bach
7	92	arthur	davis

5.6 Use the pandas Series

A one dimensional labeled array
Contains data of only one type
Similar to a column in a spreedsheet

Create a series

pd_series = pd.Series( [1, 2, 3 ] )
pd_series

  1
  2
  3
dtype: int64

Series introspection methods

f"This series is made up of {pd_series.size} items whose data type is {pd_series.dtype}"

'This series is made up of 3 items whose data type is int64'

A Pandas DataFrame is composed of Pandas Series.

age = df.age
type( age )

pandas.core.series.Series

Some useful helper methods of a Series

mean

pd_series = pd.Series([ 1, 2, 3, 5, 6, 6, 6, 7, 8])
pd_series.mean()

4.888888888888889

Unique

pd_series.unique()

array([1, 2, 3, 5, 6, 7, 8])

Max

pd_series.min()

Notes: