Open In Colab

Lesson 13 Sorting

Pragmatic AI Labs

alt text

This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:

13.1 Sort in python

Understanding Sorting

Python has powerful built-in sorting

World Food Facts DataSet

  • Original Data Source: https://www.kaggle.com/openfoodfacts/world-food-facts
  • Modified Source: https://www.kaggle.com/lwodarzek/nutrition-table-clustering/output
Ingest
import pandas as pd
df = pd.read_csv(
    "https://raw.githubusercontent.com/noahgift/food/master/data/features.en.openfoodfacts.org.products.csv")
df.drop(["Unnamed: 0", "exceeded", "g_sum", "energy_100g"], axis=1, inplace=True) #drop two rows we don't need
df = df.drop(df.index[[1,11877]]) #drop outlier
df.rename(index=str, columns={"reconstructed_energy": "energy_100g"}, inplace=True)
df.head()
fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g energy_100g product
0 28.57 64.29 14.29 3.57 0.00000 2267.85 Banana Chips Sweetened (Whole)
2 57.14 17.86 3.57 17.86 1.22428 2835.70 Organic Salted Nut Mix
3 18.75 57.81 15.62 14.06 0.13970 1953.04 Organic Muesli
4 36.67 36.67 3.33 16.67 1.60782 2336.91 Zen Party Mix
5 18.18 60.00 21.82 14.55 0.02286 1976.37 Cinnamon Nut Granola

Using built-in sorting

Convert Pandas DataFrame Columns into a list

food_facts = list(df.columns.values)
food_facts
['fat_100g',
 'carbohydrates_100g',
 'sugars_100g',
 'proteins_100g',
 'salt_100g',
 'energy_100g',
 'product']
Alphabetical Sort
sorted(food_facts)
['carbohydrates_100g',
 'energy_100g',
 'fat_100g',
 'product',
 'proteins_100g',
 'salt_100g',
 'sugars_100g']
Reverse Alphabetical Sort
sorted(food_facts, reverse=True)
['sugars_100g',
 'salt_100g',
 'proteins_100g',
 'product',
 'fat_100g',
 'energy_100g',
 'carbohydrates_100g']
Using built in list sort

Only works on a list

food_facts = list(df.columns.values)
print(f"Before sort: {food_facts}")
food_facts.sort()
print(f"After sort: {food_facts}")

Before sort: ['fat_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'salt_100g', 'energy_100g', 'product']
After sort: ['carbohydrates_100g', 'energy_100g', 'fat_100g', 'product', 'proteins_100g', 'salt_100g', 'sugars_100g']

Timing built-in sort function vs list sort method

list method

food_facts = list(df.columns.values)
%%timeit -n 3 -r 3
food_facts.sort()


3 loops, best of 3: 489 ns per loop

built in function

food_facts = list(df.columns.values)
%%timeit -n 3 -r 3
sorted(food_facts)
3 loops, best of 3: 656 ns per loop

Sorting Dictionary

sorting a dictionary

food_facts_row = df.head(1).to_dict()
food_facts_row
{'carbohydrates_100g': {'0': 64.29},
 'energy_100g': {'0': 2267.85},
 'fat_100g': {'0': 28.57},
 'product': {'0': 'Banana Chips Sweetened (Whole)'},
 'proteins_100g': {'0': 3.57},
 'salt_100g': {'0': 0.0},
 'sugars_100g': {'0': 14.29}}

reverse sort dictionary

sorted(food_facts_row, reverse=True)
['sugars_100g',
 'salt_100g',
 'proteins_100g',
 'product',
 'fat_100g',
 'energy_100g',
 'carbohydrates_100g']
df["product"].head().values
array(['Banana Chips Sweetened (Whole)', 'Organic Salted Nut Mix',
       'Organic Muesli', 'Zen Party Mix', 'Cinnamon Nut Granola'],
      dtype=object)

Sorting A Generator Pipeline

def dataframe_rows(df=df, column="product", chunks=10):
  
    count_row = df.shape[0]
    rows = list(df[column].values)
    for i in range(0, count_row, chunks):
      yield rows[i:i + chunks]
    
    
rows = dataframe_rows()
next(rows)

['Banana Chips Sweetened (Whole)',
 'Organic Salted Nut Mix',
 'Organic Muesli',
 'Zen Party Mix',
 'Cinnamon Nut Granola',
 'Organic Hazelnuts',
 'Organic Oat Groats',
 'Energy Power Mix',
 'Antioxidant Mix - Berries & Chocolate',
 'Organic Quinoa Coconut Granola With Mango']
next(rows)
['Fire Roasted Hatch Green Chile Almonds',
 'Peanut Butter Power Chews',
 'Organic Unswt Berry Coconut Granola',
 'Roasted Salted Black Pepper Cashews',
 'Thai Curry Roasted Cashews',
 'Wasabi Tamari Almonds',
 'Organic Red Quinoa',
 'Dark Chocolate Coconut Chews',
 'Organic Unsweetened Granola, Cinnamon Almond',
 'Organic Blueberry Almond Granola']
sorted_row = (sorted(row) for row in rows )
print(next(sorted_row))

13.2 Create custom sorting functions

Building a Shuffle Function

food_items = ['Chocolate Nut Crunch', 'Cranberries', 'Curry Lentil Soup Mix', 
                'Milk Chocolate Peanut Butter Malt Balls', 'Organic Harvest Pilaf', 
                'Organic Tamari Pumpkin Seed', 'Split Pea Soup Mix', 
                'Swiss-Style Muesli', "Whole Wheat 'N Honey Fig Bars", 
                'Yogurt Pretzels']

from random import sample

def shuffle_list(items):
  """Randomly Shuffles List"""
  
  shuffled = sample(items, len(items))
  return shuffled
  
shuffled_food_items = shuffle_list(food_items)
shuffled_food_items
['Milk Chocolate Peanut Butter Malt Balls',
 'Organic Harvest Pilaf',
 'Curry Lentil Soup Mix',
 'Yogurt Pretzels',
 'Organic Tamari Pumpkin Seed',
 'Chocolate Nut Crunch',
 "Whole Wheat 'N Honey Fig Bars",
 'Split Pea Soup Mix',
 'Cranberries',
 'Swiss-Style Muesli']

Custom Sort Functions

Highly Customized Sort

def best_snack(item):
  if item == "Chocolate Nut Crunch":
    return 1
  return len(item) 

sorted(shuffled_food_items, key=best_snack)
['Chocolate Nut Crunch',
 'Cranberries',
 'Yogurt Pretzels',
 'Split Pea Soup Mix',
 'Swiss-Style Muesli',
 'Organic Harvest Pilaf',
 'Curry Lentil Soup Mix',
 'Organic Tamari Pumpkin Seed',
 "Whole Wheat 'N Honey Fig Bars",
 'Milk Chocolate Peanut Butter Malt Balls']

Sorting Objects

class Food:
  def __init__(self, product, protein):
    self.product = product
    self.protein = protein
  def __repr__(self):
    return f"Food: {self.product}, Protein: {self.protein}"
pairs = df[["product", "proteins_100g"]].head().values.tolist()
pairs
[['Banana Chips Sweetened (Whole)', 3.57],
 ['Organic Salted Nut Mix', 17.86],
 ['Organic Muesli', 14.06],
 ['Zen Party Mix', 16.67],
 ['Cinnamon Nut Granola', 14.55]]
pairs = df[["product", "proteins_100g"]].head().values.tolist()
foods = [Food(item[0], item[1]) for item in pairs]
foods
[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
 Food: Organic Salted Nut Mix, Protein: 17.86,
 Food: Organic Muesli, Protein: 14.06,
 Food: Zen Party Mix, Protein: 16.67,
 Food: Cinnamon Nut Granola, Protein: 14.55]
sorted(foods, key=lambda food: food.protein)

[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
 Food: Organic Muesli, Protein: 14.06,
 Food: Cinnamon Nut Granola, Protein: 14.55,
 Food: Zen Party Mix, Protein: 16.67,
 Food: Organic Salted Nut Mix, Protein: 17.86]
foods[0].__dict__
type(foods[0])
__main__.Food

13.3 Sort in pandas

Sort by One Column: Protein

df.sort_values(by=["carbohydrates_100g"], ascending=False).head(5)
fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g energy_100g product
42012 0.0 100.0 85.71 0.0 0.000 1700.0 Spongebob Squarepants Valentine Candy Card Kit
31827 0.0 100.0 80.00 0.0 0.000 1700.0 Marvel Avengers Assemble, Classroom Candy Mail...
31661 0.0 100.0 100.00 0.0 0.000 1700.0 White Crystal Sugar
31665 0.0 100.0 0.00 0.0 0.254 1700.0 Dried Habanero Chiles
42366 0.0 100.0 88.89 0.0 0.000 1700.0 Iced Tea Mix, Lemon

Sort by Two Columns: Sugar, Salt

df.sort_values(by=["fat_100g", "salt_100g"], ascending=[False, False]).head(10)
fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g energy_100g product
8390 100.0 20.00 0.00 0.00 1.524 4240.00 Horseradish Sauce
44709 100.0 17.86 3.57 10.71 0.381 4385.69 Roasted Pecans
295 100.0 0.00 0.00 0.00 0.000 3900.00 Ventura, Soybean - Peanut Frying Oil Blend
5122 100.0 0.00 0.00 0.00 0.000 3900.00 Corn Oil
5123 100.0 0.00 0.00 0.00 0.000 3900.00 Canola Oil
5124 100.0 0.00 0.00 0.00 0.000 3900.00 Vegetable Oil
5125 100.0 0.00 0.00 0.00 0.000 3900.00 Vegetable Shortening
5671 100.0 0.00 0.00 0.00 0.000 3900.00 Organic Coconut Oil
5797 100.0 0.00 0.00 0.00 0.000 3900.00 Premium Sesame Oil (100% Pure)
5798 100.0 0.00 0.00 0.00 0.000 3900.00 Sesame Oil

Groupby

def high_protein(row):
  """Creates a high or low protein category"""
  
  if row > 80:
    return "high_protein"
  return "low_protein"

df["high_protein"] = df["proteins_100g"].apply(high_protein)
df.head()
fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g energy_100g product high_protein
0 28.57 64.29 14.29 3.57 0.00000 2267.85 Banana Chips Sweetened (Whole) low_protein
2 57.14 17.86 3.57 17.86 1.22428 2835.70 Organic Salted Nut Mix low_protein
3 18.75 57.81 15.62 14.06 0.13970 1953.04 Organic Muesli low_protein
4 36.67 36.67 3.33 16.67 1.60782 2336.91 Zen Party Mix low_protein
5 18.18 60.00 21.82 14.55 0.02286 1976.37 Cinnamon Nut Granola low_protein
df.groupby("high_protein").median()
fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g energy_100g
high_protein
high_protein 1.665 3.335 1.665 93.18 0.5207 1700.00
low_protein 3.170 22.390 5.880 4.00 0.6350 1121.54
df.groupby("high_protein").describe()
carbohydrates_100g energy_100g ... salt_100g sugars_100g
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
high_protein
high_protein 4.0 7.350000 10.724610 0.0 0.00 3.335 10.685 22.73 4.0 1795.09500 ... 4.203065 14.77772 4.0 4.242500 6.458671 0.0 0.00 1.665 5.9075 13.64
low_protein 45022.0 34.056436 29.557504 0.0 7.44 22.390 61.540 100.00 45022.0 1111.22544 ... 1.440180 2032.00000 45022.0 16.006122 21.496335 -1.2 1.57 5.880 23.0800 100.00

2 rows × 48 columns