Lesson 13 Sorting
Pragmatic AI Labs
This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:
- Buying a copy of Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Reading an online copy of Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Watching video Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline on Safari Books Online.
- Watching video AWS Certified Machine Learning-Speciality
- Purchasing video Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video
- Viewing more content at noahgift.com
13.1 Sort in python
Understanding Sorting
Python has powerful built-in sorting
World Food Facts DataSet
- Original Data Source: https://www.kaggle.com/openfoodfacts/world-food-facts
- Modified Source: https://www.kaggle.com/lwodarzek/nutrition-table-clustering/output
Ingest
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/noahgift/food/master/data/features.en.openfoodfacts.org.products.csv")
df.drop(["Unnamed: 0", "exceeded", "g_sum", "energy_100g"], axis=1, inplace=True) #drop two rows we don't need
df = df.drop(df.index[[1,11877]]) #drop outlier
df.rename(index=str, columns={"reconstructed_energy": "energy_100g"}, inplace=True)
df.head()
fat_100g | carbohydrates_100g | sugars_100g | proteins_100g | salt_100g | energy_100g | product | |
---|---|---|---|---|---|---|---|
0 | 28.57 | 64.29 | 14.29 | 3.57 | 0.00000 | 2267.85 | Banana Chips Sweetened (Whole) |
2 | 57.14 | 17.86 | 3.57 | 17.86 | 1.22428 | 2835.70 | Organic Salted Nut Mix |
3 | 18.75 | 57.81 | 15.62 | 14.06 | 0.13970 | 1953.04 | Organic Muesli |
4 | 36.67 | 36.67 | 3.33 | 16.67 | 1.60782 | 2336.91 | Zen Party Mix |
5 | 18.18 | 60.00 | 21.82 | 14.55 | 0.02286 | 1976.37 | Cinnamon Nut Granola |
Using built-in sorting
Convert Pandas DataFrame Columns into a list
food_facts = list(df.columns.values)
food_facts
['fat_100g',
'carbohydrates_100g',
'sugars_100g',
'proteins_100g',
'salt_100g',
'energy_100g',
'product']
Alphabetical Sort
sorted(food_facts)
['carbohydrates_100g',
'energy_100g',
'fat_100g',
'product',
'proteins_100g',
'salt_100g',
'sugars_100g']
Reverse Alphabetical Sort
sorted(food_facts, reverse=True)
['sugars_100g',
'salt_100g',
'proteins_100g',
'product',
'fat_100g',
'energy_100g',
'carbohydrates_100g']
Using built in list sort
Only works on a list
food_facts = list(df.columns.values)
print(f"Before sort: {food_facts}")
food_facts.sort()
print(f"After sort: {food_facts}")
Before sort: ['fat_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'salt_100g', 'energy_100g', 'product']
After sort: ['carbohydrates_100g', 'energy_100g', 'fat_100g', 'product', 'proteins_100g', 'salt_100g', 'sugars_100g']
Timing built-in sort function vs list sort method
list method
food_facts = list(df.columns.values)
%%timeit -n 3 -r 3
food_facts.sort()
3 loops, best of 3: 489 ns per loop
built in function
food_facts = list(df.columns.values)
%%timeit -n 3 -r 3
sorted(food_facts)
3 loops, best of 3: 656 ns per loop
Sorting Dictionary
sorting a dictionary
food_facts_row = df.head(1).to_dict()
food_facts_row
{'carbohydrates_100g': {'0': 64.29},
'energy_100g': {'0': 2267.85},
'fat_100g': {'0': 28.57},
'product': {'0': 'Banana Chips Sweetened (Whole)'},
'proteins_100g': {'0': 3.57},
'salt_100g': {'0': 0.0},
'sugars_100g': {'0': 14.29}}
reverse sort dictionary
sorted(food_facts_row, reverse=True)
['sugars_100g',
'salt_100g',
'proteins_100g',
'product',
'fat_100g',
'energy_100g',
'carbohydrates_100g']
df["product"].head().values
array(['Banana Chips Sweetened (Whole)', 'Organic Salted Nut Mix',
'Organic Muesli', 'Zen Party Mix', 'Cinnamon Nut Granola'],
dtype=object)
Sorting A Generator Pipeline
def dataframe_rows(df=df, column="product", chunks=10):
count_row = df.shape[0]
rows = list(df[column].values)
for i in range(0, count_row, chunks):
yield rows[i:i + chunks]
rows = dataframe_rows()
next(rows)
['Banana Chips Sweetened (Whole)',
'Organic Salted Nut Mix',
'Organic Muesli',
'Zen Party Mix',
'Cinnamon Nut Granola',
'Organic Hazelnuts',
'Organic Oat Groats',
'Energy Power Mix',
'Antioxidant Mix - Berries & Chocolate',
'Organic Quinoa Coconut Granola With Mango']
next(rows)
['Fire Roasted Hatch Green Chile Almonds',
'Peanut Butter Power Chews',
'Organic Unswt Berry Coconut Granola',
'Roasted Salted Black Pepper Cashews',
'Thai Curry Roasted Cashews',
'Wasabi Tamari Almonds',
'Organic Red Quinoa',
'Dark Chocolate Coconut Chews',
'Organic Unsweetened Granola, Cinnamon Almond',
'Organic Blueberry Almond Granola']
sorted_row = (sorted(row) for row in rows )
print(next(sorted_row))
13.2 Create custom sorting functions
Building a Shuffle Function
food_items = ['Chocolate Nut Crunch', 'Cranberries', 'Curry Lentil Soup Mix',
'Milk Chocolate Peanut Butter Malt Balls', 'Organic Harvest Pilaf',
'Organic Tamari Pumpkin Seed', 'Split Pea Soup Mix',
'Swiss-Style Muesli', "Whole Wheat 'N Honey Fig Bars",
'Yogurt Pretzels']
from random import sample
def shuffle_list(items):
"""Randomly Shuffles List"""
shuffled = sample(items, len(items))
return shuffled
shuffled_food_items = shuffle_list(food_items)
shuffled_food_items
['Milk Chocolate Peanut Butter Malt Balls',
'Organic Harvest Pilaf',
'Curry Lentil Soup Mix',
'Yogurt Pretzels',
'Organic Tamari Pumpkin Seed',
'Chocolate Nut Crunch',
"Whole Wheat 'N Honey Fig Bars",
'Split Pea Soup Mix',
'Cranberries',
'Swiss-Style Muesli']
Custom Sort Functions
Highly Customized Sort
def best_snack(item):
if item == "Chocolate Nut Crunch":
return 1
return len(item)
sorted(shuffled_food_items, key=best_snack)
['Chocolate Nut Crunch',
'Cranberries',
'Yogurt Pretzels',
'Split Pea Soup Mix',
'Swiss-Style Muesli',
'Organic Harvest Pilaf',
'Curry Lentil Soup Mix',
'Organic Tamari Pumpkin Seed',
"Whole Wheat 'N Honey Fig Bars",
'Milk Chocolate Peanut Butter Malt Balls']
Sorting Objects
class Food:
def __init__(self, product, protein):
self.product = product
self.protein = protein
def __repr__(self):
return f"Food: {self.product}, Protein: {self.protein}"
pairs = df[["product", "proteins_100g"]].head().values.tolist()
pairs
[['Banana Chips Sweetened (Whole)', 3.57],
['Organic Salted Nut Mix', 17.86],
['Organic Muesli', 14.06],
['Zen Party Mix', 16.67],
['Cinnamon Nut Granola', 14.55]]
pairs = df[["product", "proteins_100g"]].head().values.tolist()
foods = [Food(item[0], item[1]) for item in pairs]
foods
[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
Food: Organic Salted Nut Mix, Protein: 17.86,
Food: Organic Muesli, Protein: 14.06,
Food: Zen Party Mix, Protein: 16.67,
Food: Cinnamon Nut Granola, Protein: 14.55]
sorted(foods, key=lambda food: food.protein)
[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
Food: Organic Muesli, Protein: 14.06,
Food: Cinnamon Nut Granola, Protein: 14.55,
Food: Zen Party Mix, Protein: 16.67,
Food: Organic Salted Nut Mix, Protein: 17.86]
foods[0].__dict__
type(foods[0])
__main__.Food
13.3 Sort in pandas
Sort by One Column: Protein
df.sort_values(by=["carbohydrates_100g"], ascending=False).head(5)
fat_100g | carbohydrates_100g | sugars_100g | proteins_100g | salt_100g | energy_100g | product | |
---|---|---|---|---|---|---|---|
42012 | 0.0 | 100.0 | 85.71 | 0.0 | 0.000 | 1700.0 | Spongebob Squarepants Valentine Candy Card Kit |
31827 | 0.0 | 100.0 | 80.00 | 0.0 | 0.000 | 1700.0 | Marvel Avengers Assemble, Classroom Candy Mail... |
31661 | 0.0 | 100.0 | 100.00 | 0.0 | 0.000 | 1700.0 | White Crystal Sugar |
31665 | 0.0 | 100.0 | 0.00 | 0.0 | 0.254 | 1700.0 | Dried Habanero Chiles |
42366 | 0.0 | 100.0 | 88.89 | 0.0 | 0.000 | 1700.0 | Iced Tea Mix, Lemon |
Sort by Two Columns: Sugar, Salt
df.sort_values(by=["fat_100g", "salt_100g"], ascending=[False, False]).head(10)
fat_100g | carbohydrates_100g | sugars_100g | proteins_100g | salt_100g | energy_100g | product | |
---|---|---|---|---|---|---|---|
8390 | 100.0 | 20.00 | 0.00 | 0.00 | 1.524 | 4240.00 | Horseradish Sauce |
44709 | 100.0 | 17.86 | 3.57 | 10.71 | 0.381 | 4385.69 | Roasted Pecans |
295 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Ventura, Soybean - Peanut Frying Oil Blend |
5122 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Corn Oil |
5123 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Canola Oil |
5124 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Vegetable Oil |
5125 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Vegetable Shortening |
5671 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Organic Coconut Oil |
5797 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Premium Sesame Oil (100% Pure) |
5798 | 100.0 | 0.00 | 0.00 | 0.00 | 0.000 | 3900.00 | Sesame Oil |
Groupby
def high_protein(row):
"""Creates a high or low protein category"""
if row > 80:
return "high_protein"
return "low_protein"
df["high_protein"] = df["proteins_100g"].apply(high_protein)
df.head()
fat_100g | carbohydrates_100g | sugars_100g | proteins_100g | salt_100g | energy_100g | product | high_protein | |
---|---|---|---|---|---|---|---|---|
0 | 28.57 | 64.29 | 14.29 | 3.57 | 0.00000 | 2267.85 | Banana Chips Sweetened (Whole) | low_protein |
2 | 57.14 | 17.86 | 3.57 | 17.86 | 1.22428 | 2835.70 | Organic Salted Nut Mix | low_protein |
3 | 18.75 | 57.81 | 15.62 | 14.06 | 0.13970 | 1953.04 | Organic Muesli | low_protein |
4 | 36.67 | 36.67 | 3.33 | 16.67 | 1.60782 | 2336.91 | Zen Party Mix | low_protein |
5 | 18.18 | 60.00 | 21.82 | 14.55 | 0.02286 | 1976.37 | Cinnamon Nut Granola | low_protein |
df.groupby("high_protein").median()
fat_100g | carbohydrates_100g | sugars_100g | proteins_100g | salt_100g | energy_100g | |
---|---|---|---|---|---|---|
high_protein | ||||||
high_protein | 1.665 | 3.335 | 1.665 | 93.18 | 0.5207 | 1700.00 |
low_protein | 3.170 | 22.390 | 5.880 | 4.00 | 0.6350 | 1121.54 |
df.groupby("high_protein").describe()
carbohydrates_100g | energy_100g | ... | salt_100g | sugars_100g | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
high_protein | |||||||||||||||||||||
high_protein | 4.0 | 7.350000 | 10.724610 | 0.0 | 0.00 | 3.335 | 10.685 | 22.73 | 4.0 | 1795.09500 | ... | 4.203065 | 14.77772 | 4.0 | 4.242500 | 6.458671 | 0.0 | 0.00 | 1.665 | 5.9075 | 13.64 |
low_protein | 45022.0 | 34.056436 | 29.557504 | 0.0 | 7.44 | 22.390 | 61.540 | 100.00 | 45022.0 | 1111.22544 | ... | 1.440180 | 2032.00000 | 45022.0 | 16.006122 | 21.496335 | -1.2 | 1.57 | 5.880 | 23.0800 | 100.00 |
2 rows × 48 columns