Lesson 13 Sorting

Pragmatic AI Labs

alt text

This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:

Buying a copy of Pragmatic AI: An Introduction to Cloud-Based Machine Learning
Reading an online copy of Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning
Watching video Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline on Safari Books Online.
Watching video AWS Certified Machine Learning-Speciality
Purchasing video Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video
Viewing more content at noahgift.com

13.1 Sort in python

Understanding Sorting

Python has powerful built-in sorting

World Food Facts DataSet

Original Data Source: https://www.kaggle.com/openfoodfacts/world-food-facts
Modified Source: https://www.kaggle.com/lwodarzek/nutrition-table-clustering/output

Ingest

import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/noahgift/food/master/data/features.en.openfoodfacts.org.products.csv")
df.drop(["Unnamed: 0", "exceeded", "g_sum", "energy_100g"], axis=1, inplace=True) #drop two rows we don't need
df = df.drop(df.index[[1,11877]]) #drop outlier
df.rename(index=str, columns={"reconstructed_energy": "energy_100g"}, inplace=True)
df.head()

	fat_100g	carbohydrates_100g	sugars_100g	proteins_100g	salt_100g	energy_100g	product
0	28.57	64.29	14.29	3.57	0.00000	2267.85	Banana Chips Sweetened (Whole)
2	57.14	17.86	3.57	17.86	1.22428	2835.70	Organic Salted Nut Mix
3	18.75	57.81	15.62	14.06	0.13970	1953.04	Organic Muesli
4	36.67	36.67	3.33	16.67	1.60782	2336.91	Zen Party Mix
5	18.18	60.00	21.82	14.55	0.02286	1976.37	Cinnamon Nut Granola

Using built-in sorting

Convert Pandas DataFrame Columns into a list

food_facts = list(df.columns.values)
food_facts

['fat_100g',
 'carbohydrates_100g',
 'sugars_100g',
 'proteins_100g',
 'salt_100g',
 'energy_100g',
 'product']

Alphabetical Sort

sorted(food_facts)

['carbohydrates_100g',
 'energy_100g',
 'fat_100g',
 'product',
 'proteins_100g',
 'salt_100g',
 'sugars_100g']

Reverse Alphabetical Sort

sorted(food_facts, reverse=True)

['sugars_100g',
 'salt_100g',
 'proteins_100g',
 'product',
 'fat_100g',
 'energy_100g',
 'carbohydrates_100g']

Using built in list sort

Only works on a list

food_facts = list(df.columns.values)
print(f"Before sort: {food_facts}")
food_facts.sort()
print(f"After sort: {food_facts}")

Before sort: ['fat_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'salt_100g', 'energy_100g', 'product']
After sort: ['carbohydrates_100g', 'energy_100g', 'fat_100g', 'product', 'proteins_100g', 'salt_100g', 'sugars_100g']

Timing built-in sort function vs list sort method

list method

food_facts = list(df.columns.values)

%%timeit -n 3 -r 3
food_facts.sort()

3 loops, best of 3: 489 ns per loop

built in function

food_facts = list(df.columns.values)

%%timeit -n 3 -r 3
sorted(food_facts)

3 loops, best of 3: 656 ns per loop

Sorting Dictionary

sorting a dictionary

food_facts_row = df.head(1).to_dict()
food_facts_row

{'carbohydrates_100g': {'0': 64.29},
 'energy_100g': {'0': 2267.85},
 'fat_100g': {'0': 28.57},
 'product': {'0': 'Banana Chips Sweetened (Whole)'},
 'proteins_100g': {'0': 3.57},
 'salt_100g': {'0': 0.0},
 'sugars_100g': {'0': 14.29}}

reverse sort dictionary

sorted(food_facts_row, reverse=True)

['sugars_100g',
 'salt_100g',
 'proteins_100g',
 'product',
 'fat_100g',
 'energy_100g',
 'carbohydrates_100g']

df["product"].head().values

array(['Banana Chips Sweetened (Whole)', 'Organic Salted Nut Mix',
       'Organic Muesli', 'Zen Party Mix', 'Cinnamon Nut Granola'],
      dtype=object)

Sorting A Generator Pipeline

def dataframe_rows(df=df, column="product", chunks=10):
  
    count_row = df.shape[0]
    rows = list(df[column].values)
    for i in range(0, count_row, chunks):
      yield rows[i:i + chunks]
    
    

rows = dataframe_rows()
next(rows)

['Banana Chips Sweetened (Whole)',
 'Organic Salted Nut Mix',
 'Organic Muesli',
 'Zen Party Mix',
 'Cinnamon Nut Granola',
 'Organic Hazelnuts',
 'Organic Oat Groats',
 'Energy Power Mix',
 'Antioxidant Mix - Berries & Chocolate',
 'Organic Quinoa Coconut Granola With Mango']

next(rows)

['Fire Roasted Hatch Green Chile Almonds',
 'Peanut Butter Power Chews',
 'Organic Unswt Berry Coconut Granola',
 'Roasted Salted Black Pepper Cashews',
 'Thai Curry Roasted Cashews',
 'Wasabi Tamari Almonds',
 'Organic Red Quinoa',
 'Dark Chocolate Coconut Chews',
 'Organic Unsweetened Granola, Cinnamon Almond',
 'Organic Blueberry Almond Granola']

sorted_row = (sorted(row) for row in rows )
print(next(sorted_row))

13.2 Create custom sorting functions

Building a Shuffle Function

food_items = ['Chocolate Nut Crunch', 'Cranberries', 'Curry Lentil Soup Mix', 
                'Milk Chocolate Peanut Butter Malt Balls', 'Organic Harvest Pilaf', 
                'Organic Tamari Pumpkin Seed', 'Split Pea Soup Mix', 
                'Swiss-Style Muesli', "Whole Wheat 'N Honey Fig Bars", 
                'Yogurt Pretzels']

from random import sample

def shuffle_list(items):
  """Randomly Shuffles List"""
  
  shuffled = sample(items, len(items))
  return shuffled
  

shuffled_food_items = shuffle_list(food_items)
shuffled_food_items

['Milk Chocolate Peanut Butter Malt Balls',
 'Organic Harvest Pilaf',
 'Curry Lentil Soup Mix',
 'Yogurt Pretzels',
 'Organic Tamari Pumpkin Seed',
 'Chocolate Nut Crunch',
 "Whole Wheat 'N Honey Fig Bars",
 'Split Pea Soup Mix',
 'Cranberries',
 'Swiss-Style Muesli']

Custom Sort Functions

Highly Customized Sort

def best_snack(item):
  if item == "Chocolate Nut Crunch":
    return 1
  return len(item) 

sorted(shuffled_food_items, key=best_snack)

['Chocolate Nut Crunch',
 'Cranberries',
 'Yogurt Pretzels',
 'Split Pea Soup Mix',
 'Swiss-Style Muesli',
 'Organic Harvest Pilaf',
 'Curry Lentil Soup Mix',
 'Organic Tamari Pumpkin Seed',
 "Whole Wheat 'N Honey Fig Bars",
 'Milk Chocolate Peanut Butter Malt Balls']

Sorting Objects

class Food:
  def __init__(self, product, protein):
    self.product = product
    self.protein = protein
  def __repr__(self):
    return f"Food: {self.product}, Protein: {self.protein}"

pairs = df[["product", "proteins_100g"]].head().values.tolist()
pairs

[['Banana Chips Sweetened (Whole)', 3.57],
 ['Organic Salted Nut Mix', 17.86],
 ['Organic Muesli', 14.06],
 ['Zen Party Mix', 16.67],
 ['Cinnamon Nut Granola', 14.55]]

pairs = df[["product", "proteins_100g"]].head().values.tolist()
foods = [Food(item[0], item[1]) for item in pairs]
foods

[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
 Food: Organic Salted Nut Mix, Protein: 17.86,
 Food: Organic Muesli, Protein: 14.06,
 Food: Zen Party Mix, Protein: 16.67,
 Food: Cinnamon Nut Granola, Protein: 14.55]

sorted(foods, key=lambda food: food.protein)

[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
 Food: Organic Muesli, Protein: 14.06,
 Food: Cinnamon Nut Granola, Protein: 14.55,
 Food: Zen Party Mix, Protein: 16.67,
 Food: Organic Salted Nut Mix, Protein: 17.86]

foods[0].__dict__
type(foods[0])

__main__.Food

13.3 Sort in pandas

Sort by One Column: Protein

df.sort_values(by=["carbohydrates_100g"], ascending=False).head(5)

	carbohydrates_100g	sugars_100g	salt_100g	energy_100g	product
42012	100.0	85.71	0.000	1700.0	Spongebob Squarepants Valentine Candy Card Kit
31827	100.0	80.00	0.000	1700.0	Marvel Avengers Assemble, Classroom Candy Mail...
31661	100.0	100.00	0.000	1700.0	White Crystal Sugar
31665	100.0	0.00	0.254	1700.0	Dried Habanero Chiles
42366	100.0	88.89	0.000	1700.0	Iced Tea Mix, Lemon

Sort by Two Columns: Sugar, Salt

df.sort_values(by=["fat_100g", "salt_100g"], ascending=[False, False]).head(10)

	fat_100g	carbohydrates_100g	sugars_100g	proteins_100g	salt_100g	energy_100g	product
8390	100.0	20.00	0.00	0.00	1.524	4240.00	Horseradish Sauce
44709	100.0	17.86	3.57	10.71	0.381	4385.69	Roasted Pecans
295	100.0	0.00	0.00	0.00	0.000	3900.00	Ventura, Soybean - Peanut Frying Oil Blend
5122	100.0	0.00	0.00	0.00	0.000	3900.00	Corn Oil
5123	100.0	0.00	0.00	0.00	0.000	3900.00	Canola Oil
5124	100.0	0.00	0.00	0.00	0.000	3900.00	Vegetable Oil
5125	100.0	0.00	0.00	0.00	0.000	3900.00	Vegetable Shortening
5671	100.0	0.00	0.00	0.00	0.000	3900.00	Organic Coconut Oil
5797	100.0	0.00	0.00	0.00	0.000	3900.00	Premium Sesame Oil (100% Pure)
5798	100.0	0.00	0.00	0.00	0.000	3900.00	Sesame Oil

Groupby

def high_protein(row):
  """Creates a high or low protein category"""
  
  if row > 80:
    return "high_protein"
  return "low_protein"

df["high_protein"] = df["proteins_100g"].apply(high_protein)
df.head()

	fat_100g	carbohydrates_100g	sugars_100g	proteins_100g	salt_100g	energy_100g	product	high_protein
0	28.57	64.29	14.29	3.57	0.00000	2267.85	Banana Chips Sweetened (Whole)	low_protein
2	57.14	17.86	3.57	17.86	1.22428	2835.70	Organic Salted Nut Mix	low_protein
3	18.75	57.81	15.62	14.06	0.13970	1953.04	Organic Muesli	low_protein
4	36.67	36.67	3.33	16.67	1.60782	2336.91	Zen Party Mix	low_protein
5	18.18	60.00	21.82	14.55	0.02286	1976.37	Cinnamon Nut Granola	low_protein

df.groupby("high_protein").median()

	fat_100g	carbohydrates_100g	sugars_100g	proteins_100g	salt_100g	energy_100g
high_protein
high_protein	1.665	3.335	1.665	93.18	0.5207	1700.00
low_protein	3.170	22.390	5.880	4.00	0.6350	1121.54

df.groupby("high_protein").describe()

	carbohydrates_100g								energy_100g		...	salt_100g		sugars_100g
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
high_protein
high_protein	4.0	7.350000	10.724610	0.0	0.00	3.335	10.685	22.73	4.0	1795.09500	...	4.203065	14.77772	4.0	4.242500	6.458671	0.0	0.00	1.665	5.9075	13.64
low_protein	45022.0	34.056436	29.557504	0.0	7.44	22.390	61.540	100.00	45022.0	1111.22544	...	1.440180	2032.00000	45022.0	16.006122	21.496335	-1.2	1.57	5.880	23.0800	100.00

2 rows × 48 columns