Home Python Data Analysis
Post
Cancel

Python Data Analysis

Jupyter Notebooks

  • jupyter - web based environment for notebook documents
  • allows to have python code along with headings, charts, tables, etc
  • the entire flow -
    • there is a notebook server running on our terminal bts
    • what we do in the browser is communicated to this server to be executed
    • finally, the output is returned to be displayed in the browser
  • anaconda - automates setup of a lot of data science related libraries etc on our computer
  • the steps i used to install anaconda on my linux pc is here
  • i also had to run conda config --set auto_activate_base False to prevent the environment from being activated on terminal startup
  • activate the environment in shell using source ~/anaconda3/bin/activate
  • to install packages, use conda install <<package-name>>
  • to run the jupyter notebook, type jupyter-notebook in shell
  • i am doing all of this from the folder where i want to store my notebooks
  • when i created a new notebook, i could see that importing pandas and matplotlib was working by default without me installing anything
  • execute a cell by using the run button / using shift + enter. it executes the current cell and
    • inserts a new cell below if it is the last cell
    • moves to the next cell below if it is not the last cell
  • to only execute the current cell, use cmd + enter
  • jupyter notebooks have autosave
  • to shut down, use - file -> shutdown or cmd + c on the terminal where we ran jupyter-notebook
  • all the code in a cell is executed, but the output is only shown for the last line in the cell
  • there are two kinds of mode - command mode and editing mode
    • go into command mode by clicking anywhere outside / hitting escape
    • go into editing mode by clicking inside a cell / hitting enter when focus is on a cell
  • actions like inserting cells etc can be done using shortcuts when in command mode
  • we can navigate through cells using up and down arrow keys when in command mode
  • use “a” insert a cell above, “b” to insert a cell below
  • use “dd” to delete a cell
  • undo cell operation e.g. undo deleting a cell using “z”, redo a cell operation using shift + z
  • a notebook can have different kinds of cells - e.g. code cells, markdown cells, etc. there is a dropdown that lets us select the type of the current cell
  • we have a little restart icon which we can use to restart the kernel / stop button to interrupt the kernel. use case - e.g. if one cell ends up executing an infinite loop, we will not be able to execute any other cell
  • we can run all the cells from top to bottom using run -> run all cells
  • to see the documentation of something -
    • pd.read_csv - ensure cursor is somewhere on read_csv and do shift + tab - this opens the documentation in a popup
    • pd.read_csv? - this can be the last statement in a cell, and then the documentation is displayed in the output. the output cell might be bigger and easier to browse through than a popup

Dataframes and Datasets

  • textual data can be stored in csv, json, sql, etc
  • dataframes - 2 dimensional data - rows and columns
  • in jupyter, we can take a brief look at the data as seen by pandas using the following -
    1
    2
    
    house_data = pd.read_csv("data/kc_house_data.csv")
    house_data
    
  • by default, pandas shows us the first 5 and last 5 rows, and first 10 and last 10 columns in jupyter. we can configure this however
  • to view all the columns - house_data.columns
  • to view the number of rows - len(house_data)
  • to see the number of rows and columns in one go -
    1
    2
    
    house_data.shape
    # (21613, 21)
    
  • we can construct a new dataframe using the for e.g. first x / last x rows of the existing dataframe
    1
    2
    
    first_7 = house_data.head(7)
    last_6 = house_data.tail(6)
    
  • by default, pandas will assign data types using the following logic - if it is numeric, assign int if data has no decimals, else float. everything else is assigned to the object data type
  • we can view these datatypes assigned by pandas using info. note it also shows the number of rows having null value for that column, the index information, the memory used, etc
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    house_data.info()
      
    # <class 'pandas.core.frame.DataFrame'>
      # RangeIndex: 21613 entries, 0 to 21612
    # Data columns (total 21 columns):
    #  #   Column         Non-Null Count  Dtype  
    # ---  ------         --------------  -----  
    #  0   id             21613 non-null  int64  
    #  1   date           21613 non-null  object 
    #  2   price          21613 non-null  float64
    #  3   bedrooms       21613 non-null  int64  
    # ...
    # memory usage: 3.5+ MB
    
  • specifying a separator manually if the separator used in csv is not comma -
    1
    2
    
    netflix = pd.read_csv("data/netflix_titles.csv", sep="|")
    netflix
    
  • if we want to provide custom column names, we use names. in this case, the first row already had the headers but not in a format we liked, so we also pass 0 for the header attribute, so that pandas can skip the first row, and use the column names we provide
    1
    2
    3
    4
    
    headers = ('sumlev', 'region', 'division', 'state', 'name')
    
    nst = pd.read_csv("data/nst-est2020.csv", names=headers, header=0)
    nst
    
  • finally, if we would like to use one of the existing columns as the index column, we can specify that as well using the index parameter
    1
    2
    
    mount_everest_deaths = pd.read_csv("data/mount_everest_deaths.csv", index_col="No.")
    mount_everest_deaths
    
  • note - we can also pass in the column position instead of the column name for index col, like this - index_col=0

Basic Operations

  • finding the minimum / maximum - it returns a series data structure, and we get the min / max for every column when we perform this operation
    1
    2
    3
    4
    5
    6
    7
    
    house_data.min()
    
    # id                       1000102
    # date             20140502T000000
    # price                    75000.0
    # bedrooms                       0
    # bathrooms                    0.0
    
  • sum - sum all values. e.g. if a column has only holds a 1 or a 0, it gives us the number of values with a 1 for that attribute. for string like columns, it might concatenate them like strings. to prevent that, we also pass in the numeric only attribute
    1
    2
    3
    4
    5
    6
    7
    8
    
    house_data.sum(numeric_only=True)
    
    # price            1.167293e+10
    # bedrooms         7.285400e+04
    # bathrooms        4.570625e+04
    # sqft_living      4.495287e+07
    # sqft_lot         3.265069e+08
    # floors           3.229650e+04
    
  • similarly, we also have count (gives the non null values for all columns), mean, median and mode
  • describe - automatically gives a bunch of statistics around all numeric columns -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    titanic.describe()
    
    #            pclass    survived       sibsp       parch
    # count 1309.000000 1309.000000 1309.000000 1309.000000
    # mean     2.294882    0.381971    0.498854    0.385027
    # std      0.837836    0.486055    1.041658    0.865560
    # min      1.000000    0.000000    0.000000    0.000000
    # 25%      2.000000    0.000000    0.000000    0.000000
    # 50%      3.000000    0.000000    0.000000    0.000000
    # 75%      3.000000    1.000000    1.000000    0.000000
    # max      3.000000    1.000000    8.000000    9.000000
    
  • to get stats around non-numeric columns, we can set include to object. top gives the value that occurs the most number of times, while freq gives the number of times
    1
    2
    3
    4
    5
    6
    7
    
    titanic.describe(include = 'object')
    
    #                         name   sex   age    ticket  fare  cabin  embarked  boat  body  home.dest
    # count                   1309  1309  1309      1309  1309  1309       1309  1309  1309       1309
    # unique                  1307     2    99       929   282   187          4    28   122        370
    # top     Connolly, Miss. Kate  male     ?  CA. 2343  8.05     ?          S     ?     ?          ?
    # freq                       2   843   263        11    60  1014        914   823  1188        564
    

Series and Columns

  • selecting a single column -
    1
    2
    3
    4
    5
    6
    7
    
    titanic["name"]
    
    # 0                         Allen, Miss. Elisabeth Walton
    # 1                        Allison, Master. Hudson Trevor
    #                              ...                       
    # 1307                                Zakarian, Mr. Ortin
    # 1308                                 Zimmerman, Mr. Leo
    
  • note - this is of type pandas series
    1
    2
    3
    
    type(titanic["name"])
    
    # pandas.core.series.Series
    
  • series - one dimensional array with labels
  • for instance, i think when we select a column, the labels are the index column of the dataframe. e.g. if i index the dataframe using show id -
    1
    2
    3
    4
    5
    6
    7
    8
    
    netflix_titles["type"]
    
    # show_id
    # s1         Movie
    # s2       TV Show
    #           ...   
    # s8806      Movie
    # s8807      Movie
    
  • when we call functions like sum for instance (refer above)
    • the value in the series is the sum for every column
    • the labels are the column names
  • functions like describe which returned a dataframe when called on a dataframe, will return a series when called on a series
  • for functions like sum -
    • when calling sum on a dataframe, we got a series
    • when calling sum on a series, we will get a single value
    1
    
    houses["price"].sum()  # 11672925008.0
    
  • we can access the labels of a series using index, and the underlying values using values -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    houses_min = houses.min()
    houses_min
    # id                       1000102
    # date             20140502T000000
    # price                    75000.0
    
    houses_min.index
    # Index(['id', 'date', 'price'], dtype='object')
    
    houses_min.values
    # array([1000102, '20140502T000000', 75000.0], dtype=object)
    

Intermediate Operations

  • unique - give the unique value in a series. the return type of such methods is numpy array
    1
    2
    
    houses["bedrooms"].unique()  # array([ 3,  2,  4,  5,  1,  6,  7,  0,  8,  9, 11, 10, 33])
    type(houses["bedrooms"].unique())  # numpy.ndarray
    
  • nunique - number of unique values. by default, dropna is True
    1
    2
    
    netflix_titles["director"].nunique(), netflix_titles["director"].nunique(dropna=False)
    # (4528, 4529)
    
  • nlargest - n largest values. by default, n is 5
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    houses['price'].nlargest(n=7)
    
    # 7252    7700000.0
    # 3914    7062500.0
    # 9254    6885000.0
    # 4411    5570000.0
    # 1448    5350000.0
    # 1315    5300000.0
    # 1164    5110800.0
    
  • caveat - handling duplicates - e.g. imagine class has 3 unique values - 1st class, 2nd class and 3rd class. when we call nlargest with n set to 709, we get 709 values, each with value 3. when we call it with 710, we get 709 values for 3, and 1 value for 2. but what if we wanted all values for the last value that comes when using nlargest? we can set the keep parameter. when keep is all, we get 986 total values, even though n was 710. other possible values for keep are first (default) and last (probably the last row with the value as 2nd class would be returned in this case?)
    1
    2
    
    len(titanic['pclass'].nlargest(709)), len(titanic['pclass'].nlargest(709, keep='all'))  # (709, 709)
    len(titanic['pclass'].nlargest(710)), len(titanic['pclass'].nlargest(710, keep='all'))  # (710, 986)
    
  • similarly, we can call it on dataframes as well - we will need to specify the column names as well this time around though -
    1
    2
    3
    4
    5
    6
    
    houses.nlargest(5, "price")
    
    #              id            date     price bedrooms
    # 7252 6762700020 20141013T000000 7700000.0        6
    # 3914 9808700762 20140611T000000 7062500.0        5
    # 9254 9208900037 20140919T000000 6885000.0        6
    
  • we access a single column like this - netflix_titles["title"]. to access multiple columns, we can use the following syntax. note that even though we just pass one parameter, what we get back is still a dataframe, and not a series like we would get when using netflix_titles["title"]. note - remember that this creates a new dataframe
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    netflix_titles[["title"]]
    
    #                         title
    # show_id
    # s1       Dick Johnson Is Dead
    # s2       Blood & Water
    # s3       Ganglands
    
    houses[["bedrooms", "bathrooms"]]
    
    #     bedrooms	bathrooms
    # 0          3       1.00
    # 1          3       2.25
    # 2          2       1.00
    
  • value counts - counts of unique values. sorts in descending order of counts by default. we can use the ascending parameter to sort it in ascending order of counts
    1
    2
    3
    4
    5
    
    houses["bedrooms"].value_counts()
    
    # 3     9824
    # 4     6882
    # 2     2760
    
  • we can also have value counts for a dataframe. if we do it for all columns, we might end up having 1 value per row, as any two rows having same values for all columns is rare. we would typically perform this on a subset of columns like below. note - we still get back a series - it feels like the label of the series is comprised of multiple attributes, but it is still a pandas series and not a dataframe
    1
    2
    3
    4
    5
    6
    
    houses[["bedrooms", "bathrooms"]].value_counts()
    
    # bedrooms  bathrooms
    # 4         2.50         2502
    # 3         2.50         2357
    # 2         1.00         1558
    
  • small note - there are multiple ways of doing a thing, maybe we should try being efficient. i make these mistakes frequently - using the first value in the output of value counts instead of using mode directly, etc -
    • sort_values("Attack").head(20) vs nsmallest(20, "Attack")
    • .value_counts().head(1).index[0] vs mode

Plotting Basics

  • in case of a series, we plot values against labels. if i try to for e.g. do houses["bedrooms"].plot(), it would not make much sense, since we would be plotting number of bedrooms against an index that might be like a house identifier
  • so, we can for e.g. plot value counts of bedrooms - this way, we would be plotting number of houses with the given number of bedrooms against number of bedrooms - as we see below, 3 bedrooms are the most common
    1
    
    houses['bedrooms'].value_counts().plot(kind='bar')
    

  • above, we tried plotting a pandas series. we can also plot dataframes
  • e.g. try looking at the general distribution between bedrooms and bathrooms, by plotting one against another. we will have to customize both the x and y axis in this case, otherwise again, we might end up plotting all attributes against the autogenerated index
    1
    
    houses.plot(x="bedrooms", y="bathrooms", kind="scatter")
    

Index

  • both dataframes and series in pandas have labels / indices
  • by default, a range index is used - auto incrementing index that goes 0, 1, 2, and so on
  • when we select a column in a dataframe, the labels used for the series is the same as the one used for the original dataframe
  • e.g. if we have a csv containing the stock related data for a particular stock, we can set the index column to be date, to easily get the low and high price for a particular date. we can set the index to a column manually by calling set index. note - like most things, this too returns a new dataframe instead of mutating the original dataframe
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    bitcoin["High"]
    # 0      147.488007
    # 1      146.929993
    # 2      139.889999
    
    
    bitcoin = bitcoin.set_index("Date")
    #                       High         Low          Open         Close
    # Date
    # 2013-04-29 23:59:59   147.488007   134.000000   134.444000   144.539993
    # 2013-04-30 23:59:59   146.929993   134.050003   144.000000   139.000000
    # 2013-05-01 23:59:59   139.889999   107.720001   139.000000   116.989998
    
    bitcoin["High"]
    # Date
    # 2013-04-29 23:59:59      147.488007
    # 2013-04-30 23:59:59      146.929993
    # 2013-05-01 23:59:59      139.889999
    
  • if for e.g. we were to call bitcoin["High"].plot() after setting the index, the plot would make a lot more sense - high price against date, so how the price of bitcoin changed over days / years. without the re-indexing, it would display the price of bitcoin against an auto-incrementing integer, which would not have made much sense
  • we can also do it when reading the csv using the index col parameter as seen earlier
    1
    2
    3
    4
    5
    6
    7
    8
    
    happiness_indexed = pd.read_csv("data/world-happiness-report-2021.csv", index_col="Country name")
    happiness_indexed
    
    #                   Healthy life expectancy  Freedom to make life choices
    # Country name  
    #      Finland                       72.000                         0.949
    #      Denmark                       72.700                         0.946
    #  Switzerland                       74.400                         0.919
    

Sorting

  • sort values - it is present both in series and dataframe. the default sort order is ascending. but, it is not in place. with most commands, i was reassigning the actual variable itself. there is another way to achieve this though when using these functions - passing in true for the in place argument
    1
    2
    3
    4
    5
    6
    7
    
    happiness_indexed.sort_values("Healthy life expectancy", ascending=False, inplace=True)
      
    #               Healthy life expectancy  Freedom to make life choices
    # Country name
    #    Singapore                   76.953                         0.927
    #    Hong Kong                   76.820                         0.717
    #    Japan                       75.100                         0.796
    
  • sorting by multiple columns - descending by number of bedrooms, ascending by number of bathrooms -
    1
    2
    3
    4
    5
    6
    7
    
    houses = pd.read_csv("data/kc_house_data.csv")
    houses.sort_values(["bedrooms", "bathrooms"], ascending=[False, True])
    
    #        bedrooms  bathrooms
    # 15870        33       1.75
    # 8757         11       3.00
    # 15161        10       2.00
    
  • when sorting by a text column, e.g. name, the sorting will use the ascii value, so Arjun comes before abhishek. we can use the key function in pandas to provide a custom lambda to use when sorting rows -
    1
    
    titanic.sort_values("name", inplace=True, key=lambda name: name.str.lower())
    
  • sort index - helps sort the data by the index / labels -
    1
    
    happiness.sort_index(inplace=True)
    
  • we can call sort values / sort index on pandas series as well - the difference here is providing the column would not be required when sorting by values
  • by default, value counts sorts on frequency. however, this might not make sense when we try to plot it - houses with 3 bedrooms would appear before houses with 1 bedroom on the x axis. so, we sort by the number of bedrooms i.e. the index
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    bedrooms_stats = houses["bedrooms"].value_counts()
    bedrooms_stats
    
    # bedrooms
    # 3         9824
    # 4         6882
    # 2         2760
    # 5         1601
    # 6          272
    # 1          199
    
    bedrooms_stats.sort_index(inplace=True)
    bedrooms_stats.plot(kind="bar")
    

Indexing

  • we already tried accessing data using columns. using one column gives us a pandas series, while using multiple columns gives a dataframe. we only need a pair of square braces for accessing columns
  • to access particular rows, we can use loc / iloc
  • e.g. our data is indexed using country name. we can access the data for a particular country using loc. output format i believe is a series, where the labels are the column names
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    happiness
    #                Healthy life expectancy   Freedom to make life choices
    # Country name
    # Afghanistan    52.493                    0.382
    # Albania        68.999                    0.785
    # Algeria        66.005                    0.480
    # Argentina      69.000                    0.828
    
    happiness.loc["Yemen"]
    # Healthy life expectancy        57.122
    # Freedom to make life choices   0.602
    
  • just like when accessing columns, if we use an additional pair of square braces, we start getting a dataframe instead
    1
    2
    3
    4
    5
    
    happiness.loc[["Yemen"]]
    
    #               Healthy life expectancy   Freedom to make life choices
    # Country name
    # Yemen                          57.122                          0.602
    
  • we can also use slicing with loc - e.g. get all the rows between denmark to france. note - remember to sort using index first for this to work properly
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    happiness.sort_index(inplace=True)
    happiness.loc["Denmark" : "France"]
    
    #                Healthy life expectancy   Freedom to make life choices
    # Country name  
    #      Denmark                    72.700                          0.946
    #      Ecuador                    68.800                          0.842
    #        Egypt                    61.998                          0.749
    #      Finland                    72.000                          0.949
    #       France                    74.000                          0.822
    
  • iloc - access rows using integer position-based indexing
  • e.g. i want the 20th country alphabetically - i may not know what it is. i can however access it using iloc. again, i get back a series
    1
    2
    3
    4
    
    happiness.iloc[19]
    
    # Healthy life expectancy         62.000
    # Freedom to make life choices     0.959
    
  • e.g. if i wanted the 1st 3rd and 5th countries, i add an additional pair of square braces, and again, get back a dataframe this time around -
    1
    2
    3
    4
    5
    6
    7
    
    happiness.iloc[[0, 2, 4]]
      
    #                Healthy life expectancy   Freedom to make life choices
    # Country name
    #  Afghanistan                    52.493                          0.382
    #      Algeria                    66.005                          0.480
    #      Armenia                    67.055                          0.825
    
  • finally, again with iloc, we can also use slicing. we will use integer positions, where we can specify start, end and optionally a step
    1
    
    happiness.iloc[0:5]
    
  • so, loc uses values of index, iloc uses numeric position
  • again, we can use both loc and iloc on series as well

Filtering

  • carefully look at the three steps we follow below for filtering - we can use a column to get a series, we generate a boolean series from it by using conditions, and finally we get the rows that hold true for the corresponding position in the boolean series
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    
    df
    
    #     name                            sex      age
    # 0   Allen, Miss. Elisabeth Walton   female   29
    # 1   Allison, Master. Hudson Trevor  male     0.9167
    # 2   Allison, Miss. Helen Loraine    female   2
      
      
    df['sex']
    
    # 0    female
    # 1      male
    # 2    female
    
    
    df['sex'] == 'female'
    
    # 0     True
    # 1    False
    # 2     True
    
    
    df[df['sex'] == 'female']
    
    #                                              name      sex  age
    # 0                   Allen, Miss. Elisabeth Walton   female   29
    # 2                    Allison, Miss. Helen Loraine   female    2
    
  • we saw == above. we can also use the other comparison operators like !=, >=, >, <=, <, etc
    1
    2
    3
    4
    5
    6
    7
    
    houses[houses['price'] > 5000000]
    
    #               id             date      price  bedrooms  bathrooms
    # 1164  1247600105  20141020T000000  5110800.0         5       5.25
    # 3914  9808700762  20140611T000000  7062500.0         5       4.50
    # 7252  6762700020  20141013T000000  7700000.0         6       8.00
    # 9254  9208900037  20140919T000000  6885000.0         6       7.75
    
  • series have a method called between which we can use. e.g. find houses with bedrooms in the range 5 to 7 -
    1
    2
    3
    4
    5
    6
    
    houses[houses['bedrooms'].between(5, 7)].value_counts('bedrooms')
    
    # bedrooms
    # 5         1601
    # 6         272
    # 7         38
    
  • we can use isin, e.g. find netflix movies in india or south korea -
    1
    2
    3
    4
    5
    
    netflix[netflix['country'].isin(['India', 'South Korea'])].value_counts('country')
    
    # country
    # India          972
    # South Korea    199
    
  • we can combine conditions using boolean operators -
    1
    2
    3
    4
    5
    6
    7
    8
    
    women = titanic['sex'] == 'female'
    died = titanic['survived'] == 0
    titanic[women & died]
    
    #         sex  survived  pclass    cabin
    #   2  female         0       1  C22 C26
    #   4  female         0       1  C22 C26
    # 105  female         0       1      A29
    
  • note - doing it in one line - do not forget parentheses, otherwise python cannot parse it correctly due to priority -
    1
    
    titanic[(titanic['sex'] == 'female') & (titanic['survived'] == 0)]
    
  • similarly, we can use | for or, ~ for negation
  • isna - returns true for rows where the column is missing a value -
    1
    2
    3
    4
    5
    6
    7
    8
    
    netflix[netflix['director'].isna()]
    
    #      show_id   type                                      title   director
    # 1         s2     TV   Show Blood & Water                              NaN
    # 3         s4     TV   Show Jailbirds New Orleans                      NaN
    # 4         s5     TV   Show Kota Factory                               NaN
    # 10       s11     TV   Show Vendetta: Truth, Lies and The Mafia        NaN
    # 14       s15     TV   Show Crime Stories: India Detectives            NaN
    
  • my understanding - everywhere above, we are trying to filter using a column value. we can use the index as well though - recall - we saw in series that we can access the labels using index. my understanding - the point is, whatever we did using dataframe[column], can be done using dataframe.index as well -
    1
    
    countries[countries.index != 'Denmark']
    

Modifying Columns and Indices

  • dropping columns - we use the drop method. we need to specify the columns to drop, and the axis. the same drop method can be used to drop rows as well, hence we need to specify the axis. axis can be set to -
    • 0 / index to drop rows
    • 1 / columns to drop columns
    1
    
    bitcoin.drop(labels=['Name', 'Symbol'], axis='columns')
    
  • another way to do this is to just pass in the columns parameter directly, instead of passing in labels and axis
    1
    
    bitcoin.drop(columns=['Name', 'Symbol'])
    
  • till now, we saw dropping columns. we can also drop rows using one of the following ways -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # method 1
    countries.drop(labels=['Denmark', 'Finland', 'Iceland'], axis='index')
    
    # method 2 - shorthand and my favorite of the three
    countries.drop(index=['Denmark', 'Finland', 'Iceland'])
    
    # method 3 - it is the first positional argument
    # so we can skip providing the "index" kwarg
    countries.drop(['Denmark', 'Finland', 'Iceland'])
    
  • drop all countries except the first 10. we can pass an index series as well
    1
    
    countries.drop(countries.index[10:])
    
  • creating a new column with a constant value -
    1
    
    titanic['constant'] = 'something'
    
  • creating a new column with dynamic values -
    1
    2
    
    # number of relatives = number of parents and children + number of siblings and spouses
    titanic["relatives"] = titanic["parch"] + titanic["sibsp"]
    
  • renaming columns - i think the arguments are similar to drop - instead of labels and axis, we pass mapper and axis
    1
    2
    3
    4
    
    mapper = { 'Regional indicator': 'regional_indicator', 'Ladder score': 'ladder_score' }
      
    countries.rename(mapper=mapper, axis='columns')
    countries.rename(columns=mapper)
    
  • similarly, we can rename indices -
    1
    2
    3
    4
    5
    
    mapper = { 'Netherlands': 'the_netherlands' }
    
    countries.rename(mapper=mapper, axis='index')
    countries.rename(index=mapper)
    countries.rename(mapper)
    
  • a complex problem - find the show called “Evil”, and change its index label to s6666 inplace
    1
    2
    
    evil_index = netflix[netflix['title'] == 'Evil'].index[0]
    netflix.rename(index={ evil_index: 's6666' }, inplace=True)
    

Updating Values

  • my understanding - we have already seen tricks to change column names / index names using rename. now we look at replace - the way of renaming the actual values inside
  • we can use replace as follows. again, it is not in place by default, so we need to pass true for in place explicitly
    1
    
    titanic.replace({'sex': { 'female': 'F', 'male': 'M' }}, inplace=True)
    
  • this method is supported for series as well and not just dataframes. so, we can use the technique below as well -
    1
    2
    3
    
    titanic['sex'] = titanic['sex'].replace({ 'female': 'F', 'male': 'M' })
    # OR
    titanic['sex'].replace({ 'female': 'F', 'male': 'M' }, inplace=True)
    
  • in the titanic dataset, all unknown values in the age column hold ?. we can replace them with none. note the use of dropna in value counts, i do not think this was mentioned earlier
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    titanic.value_counts('age')
    # age
    # ?         263
    # 24         47
    # ...
    
    titanic.replace({ 'age': { '?': None } }, inplace=True)
    titanic.value_counts('age', dropna=False)
    # age
    # NaN       263
    # 24         47
    # ...
    
  • we can use replace when all values that match “x” in a column “a” need to be replaced
  • but imagine i want to replace values in a column “a”, but only for specific rows -
  • we can use loc for both use cases above
    1
    
    countries.loc[['Denmark', 'Sweden', 'Norway'], ['Regional indicator']] = 'Scandinavia'
    
  • setting multiple columns to a single value -
    1
    
    countries.loc[['Finland', 'Denmark'], ['upperwhisker', 'lowerwhisker']] = 4.5
    
  • setting multiple columns, each to its own specific value -
    1
    
    countries.loc[['Finland', 'Denmark'], ['upperwhisker', 'lowerwhisker']] = [4.5, 2.8]
    
  • till now, even in here, we have tied accessing rows whose indices we know
  • however, loc can be passed the boolean pandas series we saw in filtering as well -
    1
    2
    3
    4
    5
    6
    
    houses.loc[houses['bedrooms'] >= 10]
    #         id           date              price       bedrooms   bathrooms
    # 8757    1773100755   20140821T000000   520000.0    11         3.00
    # 13314   627300145    20140814T000000   1148000.0   10         5.25
    # 15161   5566100170   20141029T000000   650000.0    10         2.00
    # 15870   2402100895   20140625T000000   640000.0    33         1.75
    
  • advantage of the above - we can now conditionally update certain rows. we have already seen how to update rows using loc, and we know how to filter rows based on conditions
    1
    2
    3
    4
    5
    6
    7
    
    houses.loc[houses['bedrooms'] >= 10, ['bedrooms']] = 9999 
    houses.loc[houses['bedrooms'] == 9999]
    #         id           date              price       bedrooms   bathrooms
    # 8757    1773100755   20140821T000000   520000.0    9999       3.00
    # 13314   627300145    20140814T000000   1148000.0   9999       5.25
    # 15161   5566100170   20141029T000000   650000.0    9999       2.00
    # 15870   2402100895   20140625T000000   640000.0    9999       1.75
    
  • a complex problem - add a new column ‘luxurious’ - set it to ‘yes’ for houses with grade > 12 and view = 4, and set it to ‘no’ for others -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    good_view = houses['view'] == 4
    good_quality = houses['grade'] > 12
    houses[good_view & good_quality]
    #        price      view  grade
    # 9254   6885000.0  4     13
    # 14556  2888000.0  4     13
    # 19017  3800000.0  4     13
    
    houses['luxurious'] = 'no'
    houses.loc[good_view & good_quality, ['luxurious']] = 'yes'
      
    houses[houses['luxurious'] == 'yes']
    #        price      view  grade  luxurious
    # 9254   6885000.0  4     13     yes
    # 14556  2888000.0  4     13     yes
    # 19017  3800000.0  4     13     yes
    

Data Types

  • we have a dataset that holds ? for columns for missing values in the csv
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    titanic.info()
    # #   Column     Non-Null Count  Dtype 
    #---  ------     --------------  ----- 
    # 0   pclass     1309 non-null   int64 
    # 1   survived   1309 non-null   int64 
    # 2   name       1309 non-null   object
    # 4   age        1309 non-null   object
    
    titanic['age'].value_counts()
    # age
    # ?         263
    # 24         47
    # 22         43
    # 21         41
    
  • issue - we cannot do things like finding the mean age
  • solution - we convert the data type of the age column. first, we replace ? with none, then we cast it to type float -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    titanic['age'] = titanic['age'].astype('float')
    # ValueError: could not convert string to float: '?'
    
    titanic['age'] = titanic.replace({ 'age': { '?' : None } }).astype('float')
    
    titanic.info()
    # #   Column     Non-Null Count  Dtype
    #---  ------     --------------  -----
    # 0   pclass     1309 non-null   int64
    # 1   survived   1309 non-null   int64
    # 2   name       1309 non-null   object
    # 4   age        1046 non-null   float64
    
    titanic['age'].value_counts(dropna=False)
    # age
    # NaN        263
    # 24.0000     47
    # 22.0000     43
    
  • now, we can use numeric functions like for e.g. titanic['age'].mean()
  • another option - to numeric - it is a more aggressive alternative to the one we saw earlier. earlier, we manually ran replace for all the question marks to be replaced by none, and then did the type conversion from object to float. now, with the below approach, we will say try converting it to numeric, and if you cannot, just put a none in there. the default value of errors is raise i.e. raise an exception when you encounter an error. we typically change it to coerce for getting the behavior we described
    1
    
    titanic['body'] = pd.to_numeric(titanic['body'], errors='coerce')
    
  • category type - useful when a column has a set of finite possible values, e.g. gender
  • advantage - less memory usage etc
  • by default, this is the situation. specially look at the dtype column and memory usage in the output
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    titanic['sex'].value_counts()
    # sex
    # male      843
    # female    466
    
    titanic.info()
    #  #   Column     Non-Null Count  Dtype
    # ---  ------     --------------  -----
    #  3   sex        1309 non-null   object
    # memory usage: 143.3+ KB
    
  • when we manually cast gender to type of category, the output looks like follows. look how the type is now changed and the memory usage too has reduced
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    titanic['sex'] = titanic['sex'].astype('category')
    
    titanic['sex'].value_counts()
    # sex
    # male      843
    # female    466
    
    titanic.info()
    #  #   Column     Non-Null Count  Dtype  
    # ---  ------     --------------  -----  
    #  3   sex        1309 non-null   category
    # memory usage: 134.5+ KB
      
    titanic['sex']
    # 0       female
    # 1         male
    #          ...  
    # 1307      male
    # 1308      male
    # Name: sex, Length: 1309, dtype: category
    # Categories (2, object): ['female', 'male']
    

NA Values

  • is na - returns true for cells that do not contain a value. can be called on both the dataframe and series. look at the last statement, where we generate a boolean series that represent all rows which contain null for league, and then use it as a filter condition
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    
    game_stats = pd.read_csv('data/game_stats.csv')
    game_stats
    #    name      league      points  assists  rebounds
    # 0  bob       nba         22.0    5.0      10.0
    # 1  jessie    NaN         10.0    NaN      2.0
    # 2  stu       euroleague  NaN     NaN      NaN
    # 3  jackson   aba         9.0     NaN      2.0
    # 4  timothee  NaN         8.0     NaN      NaN
    # 5  steph     nba         49.0    8.0      10.0
    # 6  NaN       NaN         NaN     NaN      NaN
    
    game_stats.isna()
    #    name   league  points  assists  rebounds
    # 0  False  False   False   False    False
    # 1  False  True    False   True     False
    # 2  False  False   True    True     True
    # 3  False  False   False   True     False
    # 4  False  True    False   True     True
    # 5  False  False   False   False    False
    # 6  True   True    True    True     True
    
    game_stats['league'].isna()
    # 0    False
    # 1     True
    # 2    False
    # 3    False
    # 4     True
    # 5    False
    # 6     True
    
    game_stats[game_stats['league'].isna()]
    #    name      league      points  assists  rebounds
    # 1  jessie    NaN         10.0    NaN      2.0
    # 4  timothee  NaN         8.0     NaN      NaN
    # 6  NaN       NaN         NaN     NaN      NaN
    
  • drop na - dropping rows with missing values. it too creates a new copy unless we specify in place explicitly
  • to drop rows where any of the columns hold null, specify the how parameter as any. note - this is also the default i.e. when we call drop na without any parameters
    1
    2
    3
    4
    5
    
    game_stats.dropna(how='any')
    
    #    name   league  points  assists  rebounds
    # 0  bob    nba     22.0    5.0      10.0
    # 5  steph  nba     49.0    8.0      10.0
    
  • drop only rows where all the columns hold null - specify the how parameter as all
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    game_stats.dropna(how='all')
    
    #    name      league      points  assists  rebounds
    # 0  bob       nba         22.0    5.0      10.0
    # 1  jessie    NaN         10.0    NaN      2.0
    # 2  stu       euroleague  NaN     NaN      NaN
    # 3  jackson   aba         9.0     NaN      2.0
    # 4  timothee  NaN         8.0     NaN      NaN
    # 5  steph     nba         49.0    8.0      10.0  
    
  • drop rows where any of the specified columns are not present
    1
    2
    3
    4
    5
    6
    7
    
    game_stats.dropna(subset=['points', 'rebounds'])
    
    #    name     league  points  assists  rebounds
    # 0  bob      nba     22.0    5.0      10.0
    # 1  jessie   NaN     10.0    NaN      2.0
    # 3  jackson  aba     9.0     NaN      2.0
    # 5  steph    nba     49.0    8.0      10.0
    
  • finally, we can drop columns as well by setting the axis parameter, e.g. drop all columns where any of the rows contain missing values for it
    1
    
    netflix.dropna(how='any', axis=1)
    
  • note - drop na works for series as well
  • we can use fill na to fill the cells missing values with a particular value
  • if we call it directly with a value, it would apply to all columns -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    game_stats.fillna(0)
    
    #    name      league      points  assists  rebounds
    # 0  bob       nba         22.0    5.0      10.0
    # 1  jessie    0           10.0    0.0      2.0
    # 2  stu       euroleague  0.0     0.0      0.0
    # 3  jackson   aba         9.0     0.0      2.0
    # 4  timothee  0           8.0     0.0      0.0
    # 5  steph     nba         49.0    8.0      10.0
    # 6  0         0           0.0     0.0      0.0
    
  • we can however specify specific columns like so -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    game_stats.fillna({ 'points': 10.0, 'assists': 0 })
    
    #    name       league      points  assists  rebounds
    # 0  bob        nba         22.0    5.0      10.0
    # 1  jessie     NaN         10.0    0.0      2.0
    # 2  stu        euroleague  NaN     0.0      NaN
    # 3  jackson    aba         9.0     0.0      2.0
    # 4  timothee   NaN         8.0     0.0      NaN
    # 5  steph      nba         49.0    8.0      10.0
    # 6  anonymous  NaN         NaN     0.0      NaN
    
  • fun exercise - instead of using fill na, use loc for updating values where it is missing
    1
    2
    
    netflix.loc[netflix['rating'].isna(), 'rating'] = 'TV-MA'
    netflix.fillna({ 'rating': 'TV-MA' })
    
  • assume we have two columns in a sales table for shipping and billing addresses. we would like to default the shipping address to the billing address wherever shipping address is missing. we can do it as follows -
    1
    
    sales.fillna({ 'shipping_zip': sales['billing_zip'] }, inplace=True)
    
  • my understanding - based on above point, we can specify a series as well for the value, and it will fill using the corresponding value in the series wherever a null is encountered

Dates and Times

  • dates an be present in lots different formats - months can be in words or numbers, days can come before months or the other way around, separator can be - or /, etc
  • my understanding - the to datetime function does a (mostly) great job at auto detecting the date time format
    1
    2
    
    pd.to_datetime('31 Dec. 2019')  # Timestamp('2019-12-31 00:00:00')
    pd.to_datetime('12/31/2019')  # Timestamp('2019-12-31 00:00:00')
    
  • we can however pass it parameters to configure its behavior as well in case of ambiguity
  • e.g. look below how we use day first and year first to get different outputs for the same input -
    1
    2
    3
    
    pd.to_datetime('10-11-12')  # Timestamp('2012-10-11 00:00:00')
    pd.to_datetime('10-11-12', dayfirst=True)  # Timestamp('2012-11-10 00:00:00')
    pd.to_datetime('10-11-12', yearfirst=True, dayfirst=True)  # Timestamp('2010-12-11 00:00:00')
    
  • we can use the more powerful format as well. format codes reference
    1
    2
    3
    4
    5
    
    pd.to_datetime('10-11-12', format='%y-%d-%m')  # Timestamp('2010-12-11 00:00:00')
    
    meetings = ['Dec 11 2019 Meeting', 'Jan 15 2024 Meeting', 'Mar 7 2024 Meeting']
    pd.to_datetime(meetings, format='%b %d %Y Meeting')
    # DatetimeIndex(['2019-12-11', '2024-01-15', '2024-03-07'], dtype='datetime64[ns]', freq=None)
    
  • python’s default behavior - try parsing it like a numeric column if possible, else change to object. so, converting a dataframe column to datetime format using astype -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    ufos.info()
    # ...
    # 1   city            87888 non-null  object 
    # 2   state           82890 non-null  object 
    # 3   date_time       86938 non-null  object
    
    ufos['date_time'] = pd.to_datetime(ufos['date_time'])
    
    ufos.info()
    # ...
    # 1   city            87888 non-null  object 
    # 2   state           82890 non-null  object 
    # 3   date_time       86938 non-null  datetime64[ns]
    
  • we can also specify the datetime columns upfront while reading a csv, instead of converting it later -
    1
    2
    3
    4
    5
    6
    
    ufos = pd.read_csv('data/nuforc_reports.csv', parse_dates=['date_time'])
    ufos.info()
    # ...
    # 1   city            87888 non-null  object        
    # 2   state           82890 non-null  object        
    # 3   date_time       86938 non-null  datetime64[ns]
    
  • there are also keyword arguments for specifying the datetime format etc in the read csv call, refer documentation
  • we can access the date time properties object on the column of type datetime64. we access it using dt. full list of properties available
  • e.g. view the top 10 years with the most ufo sightings. we first need to extract just the year from the datetime column, and then, we can chain value counts with nlargest to get the top 10 years
    1
    
    ufos['date_time'].dt.year.value_counts().nlargest(10).plot(kind='bar')
    

  • comparing datetime columns
    • notice how we provide strings and pandas can parse it for us automatically. the example below will give us all the ufo sightings since 12am on 22 december, 2019
      1
      
      ufos[ufos['date_time'] > '2019-12-22']
      
    • we already saw how we can access properties on a datetime column. we can use it to perform filtering as well. the example below will give us all sightings where the hour was 2 .i.e. it could have happened at 2.30am, 2.49am etc
      1
      
      ufos[ufos['date_time'].dt.hour == 2.0]
      
  • time deltas - we get this when we subtract two datetime objects. e.g. get the people who waited the longest to report after seeing a ufo
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    ufos['posted'] - ufos['date_time']
    # 0          9 days 05:17:00
    # 1          6 days 05:30:00
    #                ...        
    # 88123      1 days 02:00:00
    # 88124      1 days 02:00:00
    # Length: 88125, dtype: timedelta64[ns]
    
    (ufos['posted'] - ufos['date_time']).nlargest(5)
    # 86762   18463 days 00:00:00
    # 87145   18353 days 22:30:00
    # 721     18314 days 03:00:00
    # 1576    18287 days 00:00:00
    # 1580    18240 days 08:00:00
    # dtype: timedelta64[ns]
    
  • just like in datetime, we can also access properties of time deltas. full list here
  • a complex example -
    • find the homes sold between may 1st 2014 and may 1st 2015
    • create a bar plot showing the total number of sales per month in that period
    • the x axis should be in calendar order (1-12)
  • we filter to get all houses sold in the time period. then, we extract the month and perform value counts on it. finally, we sort by index i.e. by months since by default, value counts will sort by counts. finally, we plot it
    1
    2
    
    houses_sold = houses[houses['date'].between('05-01-2014', '05-01-2015')]
    houses_sold['date'].dt.month.value_counts().sort_index().plot(kind='bar')
    

  • i wanted to get the week of year. i could not access any such property on dt. so, i instead tried to format the date into a string. then, we can use the format codes we mentioned earlier as well
  • e.g. create a line plot showing the total number of sales by week of the year number (1-52)
  • first we obtain the week number the house was sold in. then, we obtain the value counts for each of the week. then we sort by the index i.e. the week number, because otherwise the x axis of the line plot will not be sorted and look weird - recall that value counts will sort by counts and not index
    1
    2
    3
    
    plt.figure(figsize=(15, 5))
    plt.xticks(range(52))
    houses['date'].dt.strftime('%V').value_counts().sort_index().plot(kind='line')
    

Matplotlib

  • till now, whenever we called plot on pandas series, it was actually calling matplotlib bts
  • however, it can have limitations, which is when we might want to interact with matplotlib
  • most common way of importing matplotlib -
    1
    
    import matplotlib.pyplot as plt
    
  • when we do the following, it defaults values on the x axes to be 0, 1, 2, …
    1
    
    plt.plot([2, 6, 2, 4, 8])
    

  • we can specify values for both x and y as follows -
    1
    2
    3
    
    salaries = [20000, 50000, 60000, 100000, 250000, 150000]
    ages = [20, 25, 30, 32, 45, 65]
    plt.plot(ages, salaries)
    

  • note - when we call plot in jupyter notebook, before the actual graph, we see a line like this - [<matplotlib.lines.Line2D at 0x7cd66c249180>]. this is actually the output of plot, but jupyter is smart enough to show us the output at the back of it as well. we will have to call plt.show() if we are not working on a jupyter notebook
  • matplotlib terminology - the top level container is a figure
  • a figure can have multiple axes
  • each axes is a combination of labels, data, etc
  • assume we have the following sample data -
    1
    2
    3
    
    nums = range(6)
    nums_squared = [num**2 for num in nums]
    nums_cubed = [num**3 for num in nums]
    
  • when we have the following code, all of them are plotted on the same figure and the same axes
    1
    2
    3
    
    plt.plot(nums)
    plt.plot(nums_squared)
    plt.plot(nums_cubed)
    

  • we call figure to create a new figure and make it the current figure. so when we call plot, it basically plots on the current active figure. so, with the code below, all of them are plotted on different figures
    1
    2
    3
    4
    5
    6
    7
    8
    
    plt.figure(figsize=(4,3))
    plt.plot(nums)
    
    plt.figure(figsize=(4,3))
    plt.plot(nums_squared)
    
    plt.figure(figsize=(4,3))
    plt.plot(nums_cubed)
    

  • note how we control the size of a figure in matplotlib - we can pass figsize and dpi or dots per inch to figure. i usually just touch figsize, which defaults to 6.4, 4.8
  • we can specify the linestyle when plotting as follows. notice the shorthand at the third call as well
    1
    2
    3
    
    plt.plot(nums, nums, linestyle='dashed')
    plt.plot(nums, nums_squared, linestyle='dotted')
    plt.plot(nums, nums_cubed, linestyle='-.')
    

  • when calling plot, we can specify parameters like color, linewidth, etc as well if needed
  • we can also specify markers and their styling
    1
    
    plt.plot(nums, nums_cubed, marker='o')
    

  • we can use title to set a title for the axes, and labels to set labels for the x and y axis individually
    1
    2
    3
    4
    
    plt.plot(nums, nums_squared)
    plt.title("Squares of Numbers")
    plt.xlabel("Input")
    plt.ylabel("Squares")
    

  • remember - all these methods we see - plot, title, xlabel and ylabel, and others that we see later - also accept a ton of options to control the size, spacing, color, positioning, etc. refer documentation as and when needed
  • when we try to plot the below, look at the default graph. notice how on x axis for e.g., matplotlib itself decided that it should start the ticks from 3 etc
    1
    2
    3
    
    nums = [3, 3.5, 4, 7, 9]
    nums_squared = [num**2 for num in nums]
    plt.plot(nums, nums_squared, marker='o')
    

  • option 1 - we can add custom ticks using xticks and yticks. it serves two purposes -
    • we can only provide the first argument. this controls what ticks should show up
    • we can provide the second argument as well. this controls what the actual tick should be named inside the graph
    1
    2
    
    plt.plot(nums, nums_squared, marker='o')
    plt.xticks([1, 2, 3, 4, 7, 8, 9], ['one', 'two', 'three', 'four', 'seven', 'eight', 'nine'])
    

  • option 2 - we can only modify the limits. e.g. we would like the x axis start from -2 and end at 20 for some reason
    1
    2
    
    plt.plot(nums, nums_squared, marker='o')
    plt.xlim(-2, 15)
    

  • legend - helps distinguish between the different graphs using labels when they are on the same axes in the same figure
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    nums = [1, 2, 3, 4]
    nums_squared = [num ** 2 for num in nums]
    nums_cubed = [num ** 3 for num in nums]
      
    plt.plot(nums, nums, label='linear')
    plt.plot(nums, nums_squared, label='squared')
    plt.plot(nums, nums_cubed, label='cubed')
      
    plt.legend()
    

  • plotting bar charts. assume we have the following data -
    1
    2
    3
    
    plants = ['spinach', 'turnip', 'rhubarb', 'broccoli', 'kale']
    died = [10, 25, 5, 30, 21]
    germinated = [74, 88, 56, 69, 59]
    
  • by default, the different charts would be one on top of another -
    1
    2
    
    plt.bar(plants, germinated)
    plt.bar(plants, died)
    

  • this is how i got them to show one beside another -
    • i ensured width of the first graph is positive while the second one is negative, so that they appear on either sides of the x tick
    • i also ensured they are 0.25% of their actual width, as this ensures the right spacing. if for e.g. i did 0.5, the second bar of a tick will touch the first bar of the next tick
    • i set align to edge. this alsigns them to the edge of the tick. the default is center (refer the graph created by default above)
    1
    2
    
    plt.bar(plants, germinated, width=0.25, align='edge')
    plt.bar(plants, died, width=-0.25, align='edge')
    

  • bar also receives another keyword argument - bottom
    1
    2
    
    plt.bar(plants, germinated, bottom=[20, 20, 20, 20, 20])
    plt.ylim(0, 120)
    

  • use case of the stupidity above 🤣 - we can get the different bars to stack one on top of another. the y coordinates of one graph becomes the bottom of another
    1
    2
    3
    
    plt.bar(plants, died, bottom=germinated, label='died')
    plt.bar(plants, germinated, label='germinated')
    plt.legend()
    

  • we can use barh instead of bar for horizontal bar graphs. notice how for stacking, the bottom changes to left
    1
    2
    3
    
    plt.barh(plants, died, left=germinated, label='died')
    plt.barh(plants, germinated, label='germinated')
    plt.legend()
    

  • histogram - assume we have the following data. note - i did a value count to explain the distribution of data -
    1
    2
    3
    4
    
    nums = [1,2,2,3,5,4,2,2,1,1,3,4,4,2,1,5,2,3,4,5]
    
    { num: nums.count(num) for num in nums }
    # {1: 4, 2: 6, 3: 3, 4: 4, 5: 3}
    
  • when i try to create a histogram on this data, it looks as follows by default -
    1
    
    plt.hist(nums)
    

  • we can configure the bins as follows. my understanding - 1 and 2 together have frequency of 10, 3 has frequency of 3 while 4 and 5 together have frequency of 7. now, the range has been divided into three parts 1-2.33, 2.33-3.66, 3.66-4.99, and the histogram has been plotted accordingly
    1
    
    plt.hist(nums, bins=3)
    

  • histograms are a little different i feel because unlike pie chart, bar graph, etc where we give the actual values to be plotted, here, we only give a series of values and it autmatically calculates the frequency and bins them accordingly
  • a realisitic example - comparing the distribution of ages of people travelling in first class vs third class in the titanic. observation - more younger people were travelling in third class, wheras more older people were travelling in first class. also, note how we change the alpha to visualize them simultaneously
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    titanic = pd.read_csv('/content/drive/MyDrive/Python - Basic Data Analysis/titanic.csv')
      
    # clean the age column
    titanic.replace({ 'age': { '?': None } }, inplace=True)
    titanic['age'] = titanic['age'].astype('float')
      
    # extract the ages of first and third class
    first_class_ages = titanic[titanic['pclass'] == 1]['age']
    third_class_ages = titanic[titanic['pclass'] == 3]['age']
      
    # plot the ages
    plt.hist(first_class_ages, alpha=0.5, label='first')
    plt.hist(third_class_ages, alpha=0.5, label='third')
      
    plt.legend()
    

  • pie charts
    • we use the explode parameter to disconnect the sectors from the pie chart. the fraction determines how far out the sectors would be from the pie. the order is the same as the order of the labels
    • we use the autopct parameter to add percentages inside the sectors. we are using autopct='%.0f%%' here, if we would have used for e.g. autopct='%.2f', it would have shown in this form - 57.73 (with 2 decimal places and without the %)
    1
    2
    
    plt.pie(costs, labels=labels, autopct='%0.0f%%', explode=(0, 0.1, 0, 0, 0.1))
    plt.show()
    

  • subplots - multiple axes in the same figure
    • we use subplot to tell the dimensions and the correct subplot index. in the example below, we say 1 row, 3 columns, and go 1, 2 and 3 respectively for the index
    • title is used for individual axes headings while suptitle is used for the figure heading
    • we call tight layout, as it helps python adjust the padding around subplots
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    nums = [1, 2, 3, 4, 5]
    nums_squared = [num ** 2 for num in nums]
    nums_cubed = [num ** 3 for num in nums]
      
    plt.figure(figsize=(12, 4))
    plt.suptitle("Polynomials")
      
    plt.subplot(1, 3, 1)
    plt.title("X")
    plt.plot(nums, nums)
      
    plt.subplot(1, 3, 2)
    plt.title("X Squared")
    plt.plot(nums, nums_squared)
      
    plt.subplot(1, 3, 3)
    plt.title("X Cubed")
    plt.plot(nums, nums_cubed)
      
    plt.tight_layout()
    plt.show()
    

  • now, imagine if we go back to our titanic example, and we want to plot all three classes - first second and third in different subplots -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
    titanic = pd.read_csv('/content/drive/MyDrive/Python - Basic Data Analysis/titanic.csv')
    titanic['age'] = pd.to_numeric(titanic['age'], errors='coerce')
      
    first_class_ages = titanic[titanic['pclass'] == 1]['age']
    second_class_ages = titanic[titanic['pclass'] == 2]['age']
    third_class_ages = titanic[titanic['pclass'] == 3]['age']
      
    plt.figure(figsize=(12, 4))
    plt.suptitle('titanic class vs age distribution')
      
    plt.subplot(1, 3, 1)
    plt.title('1st class')
    plt.hist(first_class_ages)
      
    plt.subplot(1, 3, 2)
    plt.title('2nd class')
    plt.hist(second_class_ages)
      
    plt.subplot(1, 3, 3)
    plt.title('3rd class')
    plt.hist(third_class_ages)
      
    plt.tight_layout()
    plt.show()
    

  • issue - we know that the scale in the third vs other plots are different i.e. a lot more people are travelling in the third class than in the 2nd and 1st class. this is not evident right off the bat by looking at the graph. hence, we can specify the sharey parameter
    1
    2
    3
    4
    5
    6
    
    # ...
    axes = plt.subplot(1, 3, 1)
    # ...
    plt.subplot(1, 3, 2, sharey=axes)
    # ...
    plt.subplot(1, 3, 3, sharey=axes)
    

  • question - write the code for achieving the below. note - do not use the plot method of pandas series and dataframes
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    
    houses = pd.read_csv('/content/drive/MyDrive/Python - Basic Data Analysis/kc_house_data.csv', parse_dates=['date'])
    
    sales_by_month = houses['date'].dt.month.value_counts().sort_index()
    # date
    # 1      978
    # 2     1250
    # 3     1875
    # 4     2231
    # 5     2414
    # 6     2180
    # 7     2211
    # 8     1940
    # 9     1774
    # 10    1878
    # 11    1411
    # 12    1471
    
    sales_by_week_day = houses['date'].dt.day_of_week.value_counts().sort_index()
    # date
    # 0    4099
    # 1    4715
    # 2    4603
    # 3    3994
    # 4    3685
    # 5     287
    # 6     230
    
    plt.figure(figsize=(10, 4))
    
    plt.subplot(1, 2, 1)
    week_days = ['Mon', 'Tue', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
    plt.title('Sales by Week Day')
    plt.xticks(sales_by_week_day.index, week_days)
    plt.bar(sales_by_week_day.index, sales_by_week_day.values)
    
    plt.subplot(1, 2, 2)
    plt.xticks(range(1, 13))
    plt.title('Sales by Month')
    plt.plot(sales_by_month.index, sales_by_month.values)
    
    plt.tight_layout()
    plt.show()
    
  • note how we use index and values that we disccussed here
  • we also had to sort by index first before beginning to plot, because value counts sorts by values by default
  • notice the use of xticks for renaming the labels for weekdays. i had to do the same thing for months as well, otherwise the default was 2, 4, 6, 8…

Maptlotlip + Pandas

  • plotting a pandas series
    1
    
    titanic['sex'].value_counts().plot(kind='pie')
    

  • plotting a pandas dataframe - note how it is making a bar for all columns automatically
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    house_area
    #        sqft_living  sqft_lot
    # 12777  13540        307752
    # 7252   12050        27600
    # 3914   10040        37325
    # 9254   9890         31374
    # 8092   9640         13068
    
    house_area.plot(kind='bar')
    

  • ufo sightings by month - we use this series in the next few points, and this is what our data looks like -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    ufo_sightings_by_month
    # 1.0      5979
    # 2.0      4559
    # 3.0      5494
    # 4.0      5817
    # 5.0      6063
    # 6.0      8357
    # 7.0     10682
    # 8.0      8997
    # 9.0      8498
    # 10.0     8371
    # 11.0     7596
    # 12.0     6525
    
  • for providing parameters like title, we have two options -
    • option 1 - in the same line. disadvantage - lesser options to configure styling etc
      1
      
      ufo_sightings_by_month.plot(kind='bar', title='UFO Sightings by Month', xlabel='month', ylabel='num. of sightings')
      
    • option 2 - i think the central idea is instead of interacting only with pandas plot api, we mix with calls to matplotlib apis directly like we saw in matplotlib. advantage - now, we can configure styling etc
      1
      2
      3
      4
      5
      
      ufo_sightings_by_month.plot(kind='bar')
      
      plt.title('UFO Sightings by Month')
      plt.xlabel('month')
      plt.ylabel('num. of sightings')
      

  • now, we would like to use months abbreviations instead. we have multiple options -
    • option 1 - use rename to rename indices
      1
      2
      3
      4
      5
      
      months_lookup = { idx + 1: months[idx] for idx in range(12) }
      # {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
          
      ufo_sightings_by_month_abbrev = ufo_sightings_by_month.rename(index=months_lookup)
      ufo_sightings_by_month_abbrev.plot(kind='bar', title='UFO Sightings by Month')
      
    • option 2 - use xticks. this is useful if we just want to modify plots but it might make testing etc difficult
      1
      2
      
      ufo_sightings_by_month.plot(kind='bar', title='UFO Sightings by Month')
      plt.xticks(range(12), labels=months)
      

  • by default, bar charts for dataframes looks as follows. understand that pandas is coming with reasonable defaults and helpers. there was so much effort was required from our end when doing this manually using matplotlib - specifying labels and legends, specifying the align property with a negative width, etc
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    salaries
    #                    BasePay    OvertimePay  OtherPay
    # EmployeeName
    # NATHANIEL FORD     167411.18  0.00         400184.25
    # GARY JIMENEZ       155966.02  245131.88    137811.38
    # ALBERT PARDINI     212739.13  106088.18    16452.60
    # CHRISTOPHER CHONG  77916.00   56120.71     198306.90
    
    salaries.plot(kind='barh')
    

  • making a stacked version too is so much easier compared to doing it via matplotlib manually by specifying bottom / left etc
    1
    
    salaries.plot(kind='barh', stacked=True)
    

  • the usual way - .plot(kind='hist'). it creates all graphs in the same axes
    1
    
    salaries.plot(kind='hist')
    

  • calling .hist() directly. it creates different axes for the different columns - feels like subplots
    1
    
    salaries.hist()
    

  • box plot - this too helps visualize distribution of values like histogram. summary according to me, might be wrong -
    • we have a line at the median (the green line)
    • the general distribution of data lies between the two whiskers (the two standalone blue lines)
    • the fliers depict the outliers (the circles). e.g. one house had 33 or so bedrooms, so look at the boxplot
    1
    
    houses['bedrooms'].plot(kind='box')
    

  • we can view the list of configuration parameters here. e.g. we can disable the fliers
    1
    
    houses[['bedrooms', 'bathrooms']].plot(kind='box', showfliers=False)
    

  • scatter plot - how different variables, e.g. bedrooms and bathrooms correlate to eachother. refer this for different configuration options
    1
    
    houses.plot(kind='scatter', x='bedrooms', y='bathrooms', marker='x', c='#2ca02c')
    

  • adding multiple graphs to the same axes on the same figure - same as we saw in matplotlib i.e. we need to call figure on plt for creating a new figure, else the current active figure is used
  • e.g. - ufo sightings have a shape attribute. find the 5 most common shapes, and plot them on the same axes. use a legend to differentiate between them. plot them for the range 2000-2018
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    common_shapes = ufos['shape'].value_counts().nlargest(5)
    
    for common_shape in common_shapes.index:
        years = ufos[ufos['shape'] == common_shape]['date_time'].dt.year
        years.value_counts().sort_index().plot(kind='line', label=common_shape)
    
    plt.legend()
    plt.xlim(2000, 2018)
    plt.title('UFO Sightings by Shape (2000-2018)')
    

  • e.g. plot how blinding lights performed on the charts. note how we can specify the x and y attributes when plotting dataframes. also, note how we can invert the y axis - a rank is better when lower, and we want to show a higher rank using a peak / lower rank using a trench
    1
    2
    3
    4
    5
    6
    
    billboard_charts = pd.read_csv('/content/drive/MyDrive/Python - Basic Data Analysis/billboard_charts.csv', parse_dates=['date'])
    blinding_lights = billboard_charts[billboard_charts['song'] == 'Blinding Lights']
      
    blinding_lights.plot(y='rank', x='date')
    plt.gca().invert_yaxis()
    plt.title('Blinding Lights Chart Performance')
    

  • when we try plotting a dataframe, the different columns would be plotted on the same axes by default
    1
    
    salaries.plot(kind='hist')
    

  • we can create subplots instead just by passing in keyword arguments
    1
    
    salaries.plot(kind='hist', subplots=True)
    

  • we can configure other parameters like layout (the dimensions), sharex / sharey, etc as well, already discussed in matplotlib
    1
    2
    
    salaries.plot(kind='hist', subplots=True, layout=(1, 3), figsize=(20, 5), sharex=True, bins=30)
    plt.tight_layout()
    

  • note, my understanding - the above method of passing in true for the subplots keyword argument works because we wanted to plot the different columns of the same dataframe. what if we wanted to plot entirely different series etc on the same figure on different axes. we use a combination of interacting with matplotlib apis directly and through pandas apis. apis used -
    • subplots can be called for setting the dimensions of the subplot, setting figure size, etc. it returns both the figure and the axes created in the process. the axes we receive has the same rows / columns as the dimensions we specify. note that parameters like sharex / sharey can be passed into this subplots call as well
    • note how we pass in axes argument to plot of pandas series / dataframe
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
      
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
      
    data_2000 = ufos[ufos['date_time'].dt.year == 2000]['date_time'].dt.month.value_counts().sort_index()
    data_2000.plot(kind='barh', ax=axes[0][0], ylabel='', title=2000)
    axes[0][0].set_yticks(range(12), labels=months)
      
    data_2001 = ufos[ufos['date_time'].dt.year == 2001]['date_time'].dt.month.value_counts().sort_index()
    data_2001.plot(kind='barh', ax=axes[0][1], ylabel='', title=2001)
    axes[0][1].set_yticks(range(12), labels=months)
      
    data_2002 = ufos[ufos['date_time'].dt.year == 2002]['date_time'].dt.month.value_counts().sort_index()
    data_2002.plot(kind='barh', ax=axes[0][2], ylabel='', title=2002)
    axes[0][2].set_yticks(range(12), labels=months)
      
    data_2003 = ufos[ufos['date_time'].dt.year == 2003]['date_time'].dt.month.value_counts().sort_index()
    data_2003.plot(kind='barh', ax=axes[1][0], ylabel='', title=2003)
    axes[1][0].set_yticks(range(12), labels=months)
      
    data_2004 = ufos[ufos['date_time'].dt.year == 2004]['date_time'].dt.month.value_counts().sort_index()
    data_2004.plot(kind='barh', ax=axes[1][1], ylabel='', title=2004)
    axes[1][1].set_yticks(range(12), labels=months)
      
    data_2005 = ufos[ufos['date_time'].dt.year == 2005]['date_time'].dt.month.value_counts().sort_index()
    data_2005.plot(kind='barh', ax=axes[1][2], ylabel='', title=2005)
    axes[1][2].set_yticks(range(12), labels=months)
      
    plt.suptitle(f'UFO Sightings by Months (2000-2005)')
      
    plt.tight_layout()
    

  • for e.g. reproduce the graph below -
    • this time around, there is just one axes. so, we can call set xticks, set title, etc on this one axes itelf
    • again, since this is ranks of songs, we invert the y axis
    • the labels on x axes was another challenge here, but easy when using xticks
    • pandas, matplotlib, etc are smart enough to understand dates even if we specify them like strings - note how we specify strings for dates when using in between and setting xticks
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    
    years = [2016, 2017, 2018, 2019, 2020]
    christmases = [f'{year}-12-25' for year in years]
    # ['2016-12-25', '2017-12-25', '2018-12-25', '2019-12-25', '2020-12-25']
      
    songs = [
        { 'song': 'All I Want For Christmas Is You', 'artist': 'Mariah Carey' },
        { 'song': 'Rockin\' Around The Christmas Tree', 'artist': 'Brenda Lee' },
        { 'song': 'Jingle Bell Rock', 'artist': 'Bobby Helms' }
    ]
      
    period = billboard_charts['date'].between(christmases[0], christmases[-1])
      
    _, axes = plt.subplots(1, 1, figsize=(10, 7))
      
    plt.gca().invert_yaxis()
      
    years = [2016, 2017, 2018, 2019, 2020]
    christmas_values = [pd.to_datetime(f'12-25-{year}') for year in years]
    christmas_labels = [f'Xmas {year}' for year in years]
      
    axes.set_xticks(christmas_values, christmas_labels)
    axes.set_title('Christmas Songs on the Hot')
      
    for song in songs:
        condition = (billboard_charts['song'] == song['song']) & (billboard_charts['artist'] == song['artist'])
        billboard_charts[condition & period].plot(kind='line', x='date', y='rank', ax=axes, label=song['song'], xlabel='')
      
    plt.legend(loc='upper left')
    plt.tight_layout()
    plt.show()
    
  • for saving a figure to a local file, use savefig(path.png)

Grouping and Aggregation

  • assume i have data for stocks of different cars like below -
    1
    2
    3
    4
    5
    6
    
    car_stocks
    
    #    Symbol  Date        Open        High        Low         Close       Adj Close   Volume
    # 0  RIVN    2021-11-10  106.750000  119.459999  95.199997   100.730003  100.730003  103679500
    # 1  RIVN    2021-11-11  114.625000  125.000000  108.010002  122.989998  122.989998  83668200
    # 2  RIVN    2021-11-12  128.645004  135.199997  125.250000  129.949997  129.949997  50437500
    
  • to get the mean of a particular stock, i can do the following -
    1
    
    car_stocks[car_stocks['Symbol'] == 'RIVN']['Close'].mean()  # 127.523
    
  • but what if i wanted the mean of all of the stocks individually in a dataframe? i can do it as follows
    1
    2
    3
    4
    5
    6
    7
    
    car_stocks.groupby('Symbol')['Close'].mean()
    
    # Symbol
    # GM       62.164615
    # LCID     49.829231
    # RIVN    127.523077
    # Name: Close, dtype: float64
    
  • notice how groupby gives us a pandas data frame group by object
    1
    2
    3
    
    car_stocks.groupby('Symbol')
    
    # <pandas.core.groupby.generic.DataFrameGroupBy object at 0x77f885d61a90>
    
  • we can call ngroups to see the number of groups -
    1
    
    car_stocks.groupby('Symbol').ngroups  # 3
    
  • we can call groups to see the actual groups. it is a dictionary, where the keys are the actual keys we used to group, while the values are the indices of the rows
    1
    2
    3
    4
    5
    
    car_stocks.groupby('Symbol').groups
    
    # {'GM': [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38],
    # 'LCID': [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25],
    # 'RIVN': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]}
    
  • iterating over the dataframe group by object using a for each group - we get back a tuple of the form group name, dataframe. ofcourse, the dataframe only contains rows belonging to the group. the columns used in the group by clause would not be present in the dataframe. use case - useful when the aggregation functions available to us by default are not enough, and we want to run some custom functionality
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    
    for name, group in car_stocks.groupby('Symbol'):
        print(name)
        print('--------------------')
        print(group[['Low', 'High']].describe())
        print('\n')
     
    # GM
    # --------------------
    #              Low       High
    # mean   61.051539  63.129231
    # min    57.730000  60.560001
    # max    62.630001  65.180000
    # 
    # 
    # LCID
    # --------------------
    #              Low       High
    # mean   46.442539  51.811538
    # min    39.341000  45.000000
    # max    50.709999  57.750000
    # 
    # 
    # RIVN
    # --------------------
    #               Low        High
    # mean   119.150000  135.309230
    # min     95.199997  114.500000
    # max    153.779999  179.470001
    
  • when we tried calculating the mean of the closing price earlier -
    • when we did car_stocks.groupby('Symbol'), we got back a dataframe group by object
    • when we added a car_stocks.groupby('Symbol')['Close'], we got back a series group by object
    • we finally called car_stocks.groupby('Symbol')['Close'].mean() to get back the mean of closing price for each symbol (i.e. stock)
  • if we would have called mean on the dataframe group by object directly, we would have gotten back a dataframe -
    1
    2
    3
    4
    5
    6
    7
    
    car_stocks.groupby('Symbol').mean()
      
    #         Open        High        Low
    # Symbol
    # GM      61.937693   63.129231   61.051539
    # LCID    48.761538   51.811538   46.442539
    # RIVN    127.710000  135.309230  119.150000
    
  • split, apply, combine - this is a workflow. going back to the closing price mean by stock example -
    • we split into different parts - e.g. forming groups using groupby
    • we then apply a function on each of the parts - e.g. performing a mean on each of these groups individually
    • we finally combine the results from each of these parts - we get back a series containing means for each of the group
  • we can also run multiple aggregation functions at once - below, we run it on both dataframe group by object and series group by object. running it on the dataframe group by object results in hierarchical columns -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    car_stocks.groupby('Symbol')['Close'].agg(['mean', 'min', 'max'])
    
    #         mean        min         max
    # Symbol   
    # GM      62.164615   59.270000   64.610001
    # LCID    49.829231   40.750000   55.520000
    # RIVN    127.523077  100.730003  172.009995
    
    car_stocks.groupby('Symbol').agg(['mean', 'min', 'max'])
    #         Open                                High                                Low
    #         mean        min         max         mean        min         max         mean        min        max
    # Symbol
    # GM      61.937693   57.849998   64.330002   63.129231   60.560001   65.180000   61.051539   57.730000  62.630001
    # LCID    48.761538   42.299999   56.200001   51.811538   45.000000   57.750000   46.442539   39.341000  50.709999
    # RIVN    127.710000  106.750000  163.800003  135.309230  114.500000  179.470001  119.150000  95.199997  153.779999
    
  • we can go more granular as well - we can run specific aggregation functions for specific columns as well -
    1
    2
    3
    4
    5
    6
    7
    8
    
    car_stocks.groupby('Symbol').agg({ 'Open': ['min', 'max'],  'Close': ['mean'] })
    
    #         Open                    Close
    #         min         max         mean
    # Symbol     
    # GM      57.849998   64.330002   62.164615
    # LCID    42.299999   56.200001   49.829231
    # RIVN    106.750000  163.800003  127.523077
    
  • we can provide custom functions to agg as well - understand that this could very well have been a function from a library, and we would just have to pass its reference -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    def range(x):
        return x.max() - x.min()
      
    car_stocks.groupby('Symbol')['Open'].agg(range)
    
    # Symbol
    # GM       6.480004
    # LCID    13.900002
    # RIVN    57.050003
    
  • x is a pandas series, and range is called for every group - for all open prices for a particular stock, one at a time
  • another example - this time, our custom aggregation function is called for multiple attributes, but everything is still the same. just that the output changes from a series to a dataframe, but the aggregation function is still called on a per attribute, per group basis
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    def count_nulls(x):
        return len(x) - x.count()
      
    titanic.groupby('pclass').agg(count_nulls)
    
    #         survived  age  sex
    # pclass
    # 1       0         39   0
    # 2       0         16   0
    # 3       0         208  0
    
  • named aggregations - we just saw nested columns above, when we try to do multiple aggregations on multiple columns at once. this can make accessing data more complicated, since we would have to use hierarchical columns. in general, we might want to give a custom name to the result of our aggregation. we can do so using named aggregations -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    car_stocks.groupby('Symbol').agg(
        close_avg=('Close', 'mean'),
        close_max=('Close', 'max'),
    )
    
    #         close_avg   close_max
    # Symbol
    # GM      62.164615   64.610001
    # LCID    49.829231   55.520000
    # RIVN    127.523077  172.009995
    
  • example - we have statisctics on a per player basis for laliga, having columns for team name, shots taken, shots on target
  • we would like to generate the plot below - x axis of the plot is shared
  • generating the relevant data -
    • first, we find total shots and shots on target by a team
    • for this, we group by team, and perform sum aggregations for shots and shots on target
    • we calculate accuracy using these two
    • finally, we sort the data based on accuracy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    team_stats = laliga.groupby('Team').agg(
        total=('Shots', 'sum'),
        on_target=('Shots on target', 'sum')
    )
    team_stats['accuracy'] = team_stats['on_target'] / team_stats['total']
    team_stats.sort_values(['accuracy'], inplace=True)
    team_stats
    
    #                   total  on_target  accuracy
    # Team   
    # SD Eibar          422    153        0.362559
    # D. Alavés         299    109        0.364548
    # CD Leganés        334    132        0.395210
    # R. Valladolid CF  319    131        0.410658
    # SD Huesca         343    142        0.413994
    
  • generating the plot -
    • most accurate teams - top 5 rows, least accurate teams - bottom 5. use head and tail to obtain them
    • we have entirely different pandas plots that we would like to plot on the same figure on different axes. so, we use subplots. subplots can apart from dimensions, receive the sharex parameter
    • note how we pass the axes received from subplots to plot
    • we can set the xticks on (any) axes i guess
    1
    2
    3
    4
    5
    6
    7
    8
    
    fig, axes = plt.subplots(2, 1, sharex=True)
    
    team_stats.tail(5).plot(kind='barh', y='accuracy', ax=axes[0], legend=False, title='Most Accurate Teams', color='green')
    team_stats.head(5).plot(kind='barh', y='accuracy', ax=axes[1], legend=False, title='Least Accurate Teams', color='red')
    
    axes[0].set_xticks([0.1, 0.2, 0.3, 0.4, 0.5])
    
    plt.tight_layout()
    
  • we discuss hierarchical indexing next, but we can group by levels of hierarchical indices as well. we need to specify the levels keyword argument for that
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    
    state_pops
    
    #              population
    # state  year 
    # AK     1990  553290.0
    #        1991  570193.0
    #        1992  588736.0
    # ...    ...   ...
    # WY     2009  559851.0
    #        2010  564222.0
    #        2011  567329.0
    
    state_pops.groupby(level=['year']).sum()
    
    # year  
    # 1990  499245628.0
    # 1991  505961884.0
    # ...
    # 2012  631398915.0
    # 2013  635872764.0
    
  • note - we specify name in this case, but we could have specified the level - 0, 1 etc as well
  • if we see, the components of the hierarchical index are named, so specifying their names directly without the level keyword argument inside of groupby would have worked as well
    1
    2
    3
    4
    5
    6
    7
    8
    
    state_pops.groupby('year').sum()
    
    # year  
    # 1990  499245628.0
    # 1991  505961884.0
    # ...
    # 2012  631398915.0
    # 2013  635872764.0
    
  • summary -
    • we saw grouping using attributes by now
    • but then we might want to group by index / components of hierachical index as well
    • hence we could use the level keyword argument
    • but then, we could use the same syntax as attributes for indices as well i.e. omit the level keyword argument

Hierarchical Indexing

  • also called multi indexing
  • when we group by a single column, we get the following result -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    mean_by_sex = titanic.groupby('sex')['age'].mean()
    
    mean_by_sex.index
    
    # Index(['female', 'male'], dtype='object', name='sex')
      
    mean_by_sex
    
    # sex
    # female    28.687071
    # male      30.585233
    
  • however, when we group by multiple columns, we get the following result -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    mean_by_pclass_and_sex = titanic.groupby(['pclass', 'sex'])['age'].mean()
    
    mean_by_pclass_and_sex.index
    
    # MultiIndex([(1, 'female'),
    #             (1,   'male'),
    #             (2, 'female'),
    #             (2,   'male'),
    #             (3, 'female'),
    #             (3,   'male')],
    #            names=['pclass', 'sex'])
    
    mean_by_pclass_and_sex
    
    # pclass  sex
    # 1       female    37.037594
    #         male      41.029250
    # 2       female    27.499191
    #         male      30.815401
    # 3       female    22.185307
    #         male      25.962273
    
  • so, labels instead of being a plain index are now multi index
  • above, we showed a multi index with a series, below is an example of a multi index with a dataframe. the index in this case is the same as the one we got when doing a mean of age, only the entire data structure changes from a series to a dataframe
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    titanic.groupby(['pclass', 'sex']).mean(numeric_only=True)
    
    #                 survived  age        sibsp     parch     fare
    # pclass  sex
    # 1       female  0.965278  37.037594  0.555556  0.472222  37.037594
    #         male    0.340782  41.029250  0.340782  0.279330  41.029250
    # 2       female  0.886792  27.499191  0.500000  0.650943  27.499191
    #         male    0.146199  30.815401  0.327485  0.192982  30.815401
    # 3       female  0.490741  22.185307  0.791667  0.731481  22.185307
    #         male    0.152130  25.962273  0.470588  0.255578  25.962273
    
  • typically when seting up an index, we want it to -
    • be unique - having the same index for multiple rows in a dataframe does not give an error. but, it is typically not advisable - e.g. loc would give us multiple rows
    • make our data easily accessible - use for e.g. semantic index / natural key
  • imagine we have the following dataframe -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    state_pops = pd.read_csv('data/state_pops.csv')
    state_pops
    
    #       state  year  population
    # 0     AL     2012  4817528.0
    # 1     AL     2010  4785570.0
    # ...   ...    ...   ...
    # 1270  USA    2011  311582564.0
    # 1271  USA    2012  313873685.0
    
  • we can set up a custom hierarchical index for this dataset
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    state_pops.set_index(['state', 'year'], inplace=True)
    state_pops
    
    #              population
    # state  year  
    # AL     2012  4817528.0
    #        2010  4785570.0
    #        2011  4801627.0
    # USA    2013  316128839.0
    #        2009  306771529.0
    #        2010  309326295.0
    
  • if we try sorting the index, by default, the data is sorted in the order of levels - e.g. the data is sorted first by state, and for a state, the rows are sorted by years
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    state_pops.sort_index()
    
    #              population
    # state  year
    # AK     1990  553290.0
    #        1991  570193.0
    #        1992  588736.0
    # WY     2009  559851.0
    #        2010  564222.0
    #        2011  567329.0
    
  • assume we want to sort the data by years only. so, all the data for the lowest year should come first and so on. we can do the below -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    state_pops.sort_index(level=1)
    
    #             population
    # state year 
    # AK    1990  553290.0
    # AL    1990  4050055.0
    # ...   ...   ...
    # WV    2013  1854304.0
    # WY    2013  582658.0
    
  • finally, assume we would like to sort in ascending order of state but then descending order of year. we can do the below -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    state_pops.sort_index(level=[0, 1], ascending=[True, False])
    
    #              population
    # state  year
    # AK     2013  735132.0
    #        2012  730307.0
    # ...    ...   ...
    # WY     1994  480283.0
    #        1993  473081.0
    
  • finally - we were using numbers for levels till now, but names are supported as well - e.g. we can use state_pops.sort_index(level=['year'], inplace=True)
  • indexing - behavior around slicing etc is pretty similar to what we studied here, just that we need to be wary of levels
  • accessing by the first level only - we get back a dataframe, and not a series
    1
    2
    3
    4
    5
    6
    7
    
    state_pops.loc['WY']
    
    #       population
    # year
    # 1990  453690.0
    # 1991  459260.0
    # 1992  466251.0
    
  • accessing by all levels - we get back a series, where the indices are the columns. we need to provide a tuple with the values for all the levels.
    1
    2
    3
    
    state_pops.loc[('WY', 2013)]
    
    # population    582658.0
    
  • note - we can still use slicing etc when using tuples -
    1
    2
    3
    4
    5
    6
    7
    8
    
    state_pops.loc[('WY', 2010) : ('WY', 2013)]
    
    #              population
    # state  year
    # WY     2010  564222.0
    #        2011  567329.0
    #        2012  576626.0
    #        2013  582658.0
    
  • till now, we saw accessing using the 1st level and all levels. what if we would like to access using some intermediate level(s)?
  • first, recall from updating, if we have a normal dataframe without the hierarchical indexing, we would use loc as follows (remember that : by itself means everything - all indices / all columns depending on where it is used) -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    
    titanic
    #    pclass  survived
    # 0  1       1
    # 1  1       1
    # 2  1       0
    
    titanic.loc[:, 'pclass']
    # 0       1
    # 1       1
    # 2       1
    
    titanic.loc[:, ['pclass']]
    #   pclass
    # 0 1
    # 1 1
    # 2 1
    
    titanic.loc[:, :]
    #    pclass  survived
    # 0  1       1
    # 1  1       1
    # 2  1       0
    
  • so, extending on the above for a dataframe with hierarchical indexing, my understanding is we will need extra commas for the extra levels. so, back to our original question of how to access using selective levels when we have hierarchical indexing - we can for e.g. just use : for the levels for which we want everything, and specify singular values using a, specify ranges like a:b, specify selected values using [a,b] etc based on use case
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    
    state_pops.loc[:,:,]
    #             population
    # state year
    # AK    1990  553290.0
    #       1991  570193.0
    # ...   ...   ...
    # WY    2009  559851.0
    #       2010  564222.0
    
    # since we specify only one year
    # pandas would eliminate this column altogether
    state_pops.loc[:,2010,:]
    #        population
    # state
    # AK     713868.0
    # AL     4785570.0
    # AR     2922280.0
    # AZ     6408790.0
    # CA     37333601.0
    # CO     5048196.0
    
    state_pops.loc[:,[2010,2013],:]
    #              population
    # state  year 
    # AK     2010  713868.0
    #        2013  735132.0
    # AL     2010  4785570.0
    #        2013  4833722.0
    # ...    ...   ...
    # WV     2010  1854146.0
    #        2013  1854304.0
    # WY     2010  564222.0
    #        2013  582658.0
    
    state_pops.loc[:,2010:2012,:]
    #              population
    # state  year  
    # AK     2010  713868.0
    #        2011  723375.0
    #        2012  730307.0
    # AL     2010  4785570.0
    #        2011  4801627.0
    # ...    ...   ...
    # WV     2011  1855184.0
    #        2012  1856680.0
    # WY     2010  564222.0
    #        2011  567329.0
    #        2012  576626.0
    
  • cross section or xs is another useful alternative to the loc syntax when using hierarhcical indexing. i will stick to loc for now though

Accessing Hierachical Index Values

  • for accessing all values of a column, we use the syntax df['col_name'], but this would not work for index column(s)
  • to access the values of an index when a dataframe does not have hierarchical indexing, we use df.index
  • what if we wanted to access the components of a hierarchical index? assume our dataframe looks like this -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    #              population
    # state  year
    # AK     1990  553290.0
    #        1991  570193.0
    #        1992  588736.0
    # ...    ...   ...
    # WY     2009  559851.0
    #        2010  564222.0
    #        2011  567329.0
    
  • to access the index values of a particular position, we can use the following -
    1
    2
    3
    
    state_pops.index[0]  # ('AK', 1990)
    state_pops.index[1]  # ('AK', 1991)
    state_pops.index[2]  # ('AK', 1992)
    
  • to access all the index values, we have two options according too my understanding -
  • option 1 - access via the levels property. but, it will only have the unique values - it would not be an accurate representation of our data
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    state_pops.index.levels
    # FrozenList([['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'VT', 'WA', 'WI', 'WV', 'WY'],
    #             [1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013]]
    
    state_pops.index.levels[0]
    # Index(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
    #       'VT', 'WA', 'WI', 'WV', 'WY']
    
    state_pops.index.levels[1]
    # Index([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
    #        2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013],
    
  • option 2 - accessing via get level values. usecase - recall how we performed filtering using column attributes - df[df['col'] > 500]. we can do the same when using option 2. our conditions will look like this now - df[df.index.get_level_values(1) > 500]
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    state_pops.index.get_level_values(0)
    # Index(['AK', 'AK', 'AK', 'AK', 'AK', 'AK', 'AK', 'AK', 'AK', 'AK',
    #        ...
    #        'WY', 'WY', 'WY', 'WY', 'WY', 'WY', 'WY', 'WY', 'WY', 'WY'],
    
    state_pops.index.get_level_values(1)[:50]
    # Index([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
            # 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
            # 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
            # 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
            # 1990, 1991],
    

Hierarchical Columns

  • we typically work with them when we use groupings and aggregations without for e.g. flattening them using named aggregations. assume we have created the following dataframe -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    titanic_stats = titanic.groupby(['pclass', 'sex']).agg({ 
        'fare': ['sum', 'mean'],
        'age': ['min', 'max', 'mean']
    })
    titanic_stats
    
    #                 fare                  age
    #                 sum        mean       min     max   mean
    # pclass  sex
    # 1       female  4926.0000  37.037594  2.0000  76.0  37.037594
    #         male    6195.4167  41.029250  0.9167  80.0  41.029250
    # 2       female  2832.4167  27.499191  0.9167  60.0  27.499191
    #         male    4868.8333  30.815401  0.6667  70.0  30.815401
    # 3       female  3372.1667  22.185307  0.1667  63.0  22.185307
    #         male    9060.8333  25.962273  0.3333  74.0  25.962273
    
  • now, if we try inspecting the columns property of the dataframe, we see the below -
    1
    2
    3
    4
    5
    6
    7
    
    titanic_stats.columns
    # MultiIndex([('fare',  'sum'),
    #             ('fare', 'mean'),
    #             ( 'age',  'min'),
    #             ( 'age',  'max'),
    #             ( 'age', 'mean')],
    #            )
    
  • to access the individual columns, we can access them using the two options below. recall that when we try accessing a column, we get back a series - the labels of this series is the same as the original dataframe (which in this case is a hierarchical index), while the values of the series are the values of the column. note - the second option is preferred / more efficient i think, because we access the desired data in one go -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    titanic_stats['fare']['sum']  # option 1
    titanic_stats[('fare', 'sum')]  # option 2
    
    # pclass  sex
    # 1       female  4926.0000
    #         male    6195.4167
    # 2       female  2832.4167
    #         male    4868.8333
    # 3       female  3372.1667
    #         male    9060.8333
    

Unstack

  • helps pivot the index to columns. if we do not specify the level, the largest / innermost level is used
  • assume we have the following series -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    titanic_age_stats = titanic.groupby(['pclass', 'sex'])['age'].mean()
    titanic_age_stats
    
    # pclass  sex
    # 1       female  37.037594
    #         male    41.029250
    # 2       female  27.499191
    #         male    30.815401
    # 3       female  22.185307
    #         male    25.962273
    
  • when we try plotting it, we get the following. recollection of how ploting of pandas series works by default - x axis is the index (which is hiearchical index / multi index in this case), y axis is the values
    1
    
    titanic_age_stats.plot(kind='bar')
    

  • when we unstack without any arguments, the below is what happens - the innermost level of sex becomes a column
    1
    2
    3
    4
    5
    6
    7
    
    titanic_age_stats.unstack()
    
    # sex     female    male
    # pclass
    # 1       37.037594  41.029250
    # 2       27.499191  30.815401
    # 3       22.185307  25.962273
    
  • now when we try plotting this, we get the below. recollection of how plotting for a dataframe works - we get a bar for every attribute for every index. the values of these attributes is the y axis, the labels are the x axis
    1
    
    titanic_age_stats.unstack().plot(kind='bar')
    

  • we can also specify the level we would like to unstack using -
    1
    2
    3
    4
    5
    6
    7
    8
    
    titanic_age_stats.unstack(level='pclass')
    
    # pclass  1          2          3
    # sex
    # female  37.037594  27.499191  22.185307
    # male    41.029250  30.815401  25.962273
    
    titanic_age_stats.unstack(level='pclass').plot(kind='bar')
    

  • note, my understanding - we have till now performed unstack on a series with hierarchical index. this results in a dataframe, where the column is the level that we unstack, and a level from the hierarhcical index is removed
  • complicating things because i am bored - when we try unstacking a dataframe with hierarchical columns - we get an additional level of hierarchical columns
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    
    titanic_age_stats = titanic.groupby(['pclass', 'sex']).agg({
        'age': ['min', 'max', 'mean']
    })
    titanic_age_stats
    #                               age
    #                 min     max   mean
    # pclass  sex
    # 1       female  2.0000  76.0  37.037594
    #         male    0.9167  80.0  41.029250
    # 2       female  0.9167  60.0  27.499191
    #         male    0.6667  70.0  30.815401
    # 3       female  0.1667  63.0  22.185307
    #         male    0.3333  74.0  25.962273
    
    titanic_age_stats.unstack()
    #                                       age
    #         min             max           mean
    # sex     female  male    female  male  female     male
    # pclass
    # 1       2.0000  0.9167  76.0    80.0  37.037594  41.029250
    # 2       0.9167  0.6667  60.0    70.0  27.499191  30.815401
    # 3       0.1667  0.3333  63.0    74.0  22.185307  25.962273
    

Textual Data

  • by default, pandas assigns type object to columns if they cannot be assigned numeric data types. object data type encompasses strings, numbers, arrays, etc everything
  • my understanding - even if a column is of type object, we can access string methods on it. the other option i believe is to convert it to string type first using astype
  • we can access string methods using str
    1
    2
    3
    4
    5
    6
    7
    
    titanic['name'].str.lower()
    
    # 0      allen, miss. elisabeth walton
    # 1     allison, master. hudson trevor
    #                                  ...
    # 1307             zakarian, mr. ortin
    # 1308              zimmerman, mr. leo
    
  • understand that we just used lower on the column, but pandas was smart enough to apply it to the entire series. this is also applicable to string indexing. e.g. the cabin column looks like below - it is a combination of deck and cabin number, and we make a new column just for deck as follows
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    titanic['cabin']
    
    # 0       B5
    # 1  C22 C26
    # 2  C22 C26
    # 3  C22 C26
    
    titanic['deck'] = titanic['cabin'].str[0]
    titanic['deck']
    # 0  B
    # 1  C
    # 2  C
    # 3  C
    
  • we can use slicing etc as well
  • strip - strips whitespaces by default
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    s = pd.Series(['1. Hawk.   ', '2. Pickle!\n', '3. Melonhead?\t'])
    s
    
    # 0        1. Hawk.   
    # 1       2. Pickle!\n
    # 2    3. Melonhead?\t
    
    s.str.strip()
    
    # 0         1. Hawk.
    # 1       2. Pickle!
    # 2    3. Melonhead?
    
  • note - more features of the strip api -
    • specify the characters to strip using the to_strip parameter
    • it also has different versions - lstrip and rstrip to only strip from beginning / end
  • split - split strings into components. by default, the output would be a list for every string
    1
    2
    3
    4
    5
    6
    7
    
    titanic['home.dest'].str.split('/')
    
    # 0                      [St Louis, MO]
    # 1  [Montreal, PQ ,  Chesterville, ON]
    # 2  [Montreal, PQ ,  Chesterville, ON]
    # 3  [Montreal, PQ ,  Chesterville, ON]
    # 4  [Montreal, PQ ,  Chesterville, ON]
    
  • we can make each element its own series / column by setting the expand option to true
    1
    2
    3
    4
    5
    6
    
    titanic['home.dest'].str.split('/', expand=True)
    
    #    0             1                 2
    # 0  St Louis, MO  None              None
    # 1  Montreal, PQ  Chesterville, ON  None
    # 2  Montreal, PQ  Chesterville, ON  None
    
  • note - more features of the split api -
    • a regex instead of a normal sring to split based on
    • we can specify the maximum limit i.e. the maximum number of columns the split should go upto. no more splits would be created, and everything would be put into the last column
  • replace - we have already seen replace, but this is the replace method available for string data type
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    ufos['duration']
    # 0    5 seconds
    # 1  3-5 seconds
    # 2          NaN
    # 3   10 seconds
    
    ufos['duration'].str.replace('seconds', 's')
    # 0    5 s
    # 1  3-5 s
    # 2    NaN
    # 3   10 s
    
  • above was a simple use case, but we can get very complicated with replace - refer docs - we can match using regex, and instead of passing in what to replace with, we can pass a callable which would be called using the for e.g. regex that was matched
  • contains - returns a boolean
  • again instead of a plain string, we can pass in a regex to match as well
  • a complex example - imagine the movies in our dataset have a “genres” column, which are separated by pipes. we can find the genre value counts as follows using explode -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    movies['genres']
    
    # 0   Animation|Comedy|Family
    # 1  Adventure|Fantasy|Family
    # 2            Romance|Comedy
    # 3      Comedy|Drama|Romance
    
    movies['genres'].str.split('|').explode().value_counts()
    
    # Drama            20054
    # Comedy           13067
    # Thriller          7565
    # Romance           6662
    # Action            6542
    

Apply and Map

  • apply - run on every value of the series
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    titanic['age']
    # 0  29.0000
    # 1   0.9167
    # 2   2.0000
    # 3  30.0000
    # 4  25.0000
      
    titanic['age'].apply(lambda x: (x, x * 365))
    # 0               (29.0, 10585.0)
    # 1  (0.9167, 334.59549999999996)
    # 2                  (2.0, 730.0)
    # 3               (30.0, 10950.0)
    # 4                (25.0, 9125.0)
    
  • in case our function requires arguments, we can pass them as so -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    titanic['fare']
    
    # 0  211.3375
    # 1  151.5500
    # 2  151.5500
    
    def currency_conveter(amount, denomination, multiplier):
        return f'{denomination}{amount * multiplier}'
      
    titanic['fare'].apply(currency_conveter, args=('$', 23))
    
    # 0  $4860.7625
    # 1    $3485.65
    # 2    $3485.65
    
  • till now, we saw apply for series. when using apply on a dataframe, it will call the function for all columns by default. so, if we return back one value per column, we get back a series, where the labels are column names
    1
    2
    3
    4
    5
    
    titanic[['age', 'fare', 'pclass']].apply(lambda col: col.max() - col.min())
    
    # age        79.8333
    # fare      512.3292
    # pclass      2.0000
    
  • we can change it to be called for all rows instead. usecase - we have a complex calculation that involves multiple columns of the row. e.g. we have two columns, representing (number of siblings and spouses) and (number of parents and children) respectively. we can get the family size by adding the two. we need to pass in the axis argument, which is index by default
    1
    
    titanic['relatives'] = titanic.apply(lambda row: row['sibsp'] + row['parch'], axis='columns')
    
  • note - doing titanic['relatives'] = titanic['sibsp'] + titanic['parch'] would also have worked in this case
  • map (for series) - we pass it a dictionary, and it will replace any values matching the key of the dictionary with the value for that key
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    titanic['pclass']
      
    # 0       1
    # 1       1
    #         ...
    # 1307    3
    # 1308    3
      
    titanic['pclass'].map({ 1: '1st', 2: '2nd', 3: '3rd' })
      
    # 0       1st
    # 1       1st
    #         ... 
    # 1307    3rd
    # 1308    3rd
    
  • we can also pass a function to map, and map and apply will work in the same way in this case
  • when we use map on dataframes, the function is run on all cells of the dataframe. recall how apply was only run along one of the axis - so, the function was either passed the entire row or the entire column
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    titanic[['name', 'home.dest']]
    
    #    name                            home.dest
    # 0  Allen, Miss. Elisabeth Walton   St Louis, MO
    # 1  Allison, Master. Hudson Trevor  Montreal, PQ / Chesterville, ON
    # 2  Allison, Miss. Helen Loraine    Montreal, PQ / Chesterville, ON
    
    titanic[['name', 'home.dest']].map(lambda str: str.capitalize())
    
    #    name                            home.dest
    # 0  Allen, miss. elisabeth walton   St louis, mo
    # 1  Allison, master. hudson trevor  Montreal, pq / chesterville, on
    # 2  Allison, miss. helen loraine    Montreal, pq / chesterville, on
    

Combining Dataframes

Concat

  • concat - concatenate series / dataframes
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    import pandas as pd
    
    s1 = pd.Series(['a', 'b', 'c'])
    s2 = pd.Series(['d', 'e', 'f'])
    
    pd.concat([s1, s2])
    # 0    a
    # 1    b
    # 2    c
    # 0    d
    # 1    e
    # 2    f
    
  • we can set ignore index to true if our index was not semantic. notice the difference in the index values above and below
    1
    2
    3
    4
    5
    6
    7
    
    pd.concat([s1, s2], ignore_index=True)
    # 0    a
    # 1    b
    # 2    c
    # 3    d
    # 4    e
    # 5    f
    
  • we can concatenate by index follows -
    1
    2
    3
    4
    5
    
    pd.concat([s1, s2], axis='columns')
    #   0 1
    # 0 a d
    # 1 b e
    # 2 c f
    
  • however, this is not just putting side by side - it is actually using the index values to join. e.g. -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    
    food = pd.Series(
        data=['avocado', 'blueberry', 'cucumber'],
        index=['a', 'b', 'c']
    )
      
    animals = pd.Series(
        data=['dolphin', 'bear', 'chameleon'],
        index=['d', 'b', 'c']
    )
      
    pd.concat([food, animals], axis='columns')
    
    #    0          1
    # a  avocado    NaN
    # b  blueberry  bear
    # c  cucumber   chameleon
    # d  NaN        dolphin
    
  • notice the column names would be numeric by default. we can change that using the keys keyword argument
    1
    2
    3
    4
    5
    6
    7
    
    pd.concat([food, animals], axis='columns', keys=['khana', 'janwar'])
    
    #    khana      janwar
    # a  avocado    NaN
    # b  blueberry  bear
    # c  cucumber   chameleon
    # d  NaN        dolphin
    
  • note - we saw NaN earlier, because the join is outer by default. we can set it to inner as well
    1
    2
    3
    4
    5
    
    pd.concat([food, animals], axis='columns', join='inner')
    
    #    0          1
    # b  blueberry  bear
    # c  cucumber   chameleon
    
  • till now, we were combining series. now, we combine dataframes. assume we have the data below -
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    harvest_21 = pd.DataFrame(
        [['potatoes', 9001], ['garlic', 1350], ['onions', 87511]],
        columns=['crop', 'qty']
    )
    #    crop      qty
    # 0  potatoes  9001
    # 1  garlic    1350
    # 2  onions    87511
    
    harvest_22 = pd.DataFrame(
        [[1600, 'garlic'], [560, 'spinach'], [999, 'turnips'], [1000, 'onions']],
        columns=['qty', 'crop']
    )
    #    qty   crop
    # 0  1600  garlic
    # 1  560   spinach
    # 2  999   turnips
    # 3  1000  onions
    
  • when we try to concatenate the two dataframes, we get the below. note - even though the ordering of columns for the two dataframes were different, pandas combines them using the column names
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    pd.concat([harvest_21, harvest_22])
    #    crop      qty
    # 0  potatoes  9001
    # 1  garlic    1350
    # 2  onions    87511
    # 0  garlic    1600
    # 1  spinach   560
    # 2  turnips   999
    # 3  onions    1000
    
  • assume we have another dataframe with an extra column -
    1
    2
    3
    4
    5
    6
    7
    8
    
    harvest_23 = pd.DataFrame(
        [['potatoes', 900, 500], ['garlic', 1350, 1200], ['onions', 875, 950]],
        columns=['crop', 'qty', 'profit']
    )
    #    crop      qty   profit
    # 0  potatoes  900   500
    # 1  garlic    1350  1200
    # 2  onions    875   950
    
  • if we now try concatenating two dataframes with difference in columns, we get NaN for the missing columns
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    pd.concat([harvest_22, harvest_23])
    #    qty   crop      profit
    # 0  1600  garlic    NaN
    # 1  560   spinach   NaN
    # 2  999   turnips   NaN
    # 3  1000  onions    NaN
    # 0  900   potatoes  500.0
    # 1  1350  garlic    1200.0
    # 2  875   onions    950.0
    
  • to change this behavior, we can specify inner for the join type
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    pd.concat([harvest_22, harvest_23], join='inner')
    #    qty   crop
    # 0  1600  garlic
    # 1  560   spinach
    # 2  999   turnips
    # 3  1000  onions
    # 0  900   potatoes
    # 1  1350  garlic
    # 2  875   onions
    
  • the ignore index parameter behaves in the same way, already discussed
  • we can also set up hierarchical indexing using the keys parameter - e.g. it is typical to analyze files for different years simultaneously, and we might want to encode this information in the form of a hierarchical index for the dataframe
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    pd.concat([harvest_21, harvest_22, harvest_23], join='inner', keys=[2021, 2022, 2023])
    #          crop      qty
    # 2021  0  potatoes  9001
    #       1  garlic    1350
    #       2  onions    87511
    # 2022  0  garlic    1600
    #       1  spinach   560
    #       2  turnips   999
    #       3  onions    1000
    # 2023  0  potatoes  900
    #       1  garlic    1350
    #       2  onions    875
    

Merge

  • its closer to a database style join and is more flexible than concat since we can combine using columns instead of relying on the index
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    
    teams = pd.DataFrame(
        [
            ["Suns", "Phoenix", 20, 4], 
            ["Mavericks", "Dallas", 11, 12], 
            ["Rockets", "Houston", 7, 16],
            ["Nuggets", "Denver", 11, 12]
        ], 
        columns=["team", "city", "wins", "losses"]
    )
    #    team       city     wins  losses
    # 0  Suns       Phoenix  20    4
    # 1  Mavericks  Dallas   11    12
    # 2  Rockets    Houston  7     16
    # 3  Nuggets    Denver   11    12
      
    cities = pd.DataFrame(
        [
            ["Houston", "Texas", 2310000], 
            ["Phoenix", "Arizona", 1630000], 
            ["San Diego", "California", 1410000],
            ["Dallas", "Texas", 1310000]
        ],
        columns=["city", "state", "population"]
    )
    #    city       state       population
    # 0  Houston    Texas       2310000
    # 1  Phoenix    Arizona     1630000
    # 2  San Diego  California  1410000
    # 3  Dallas     Texas       1310000
    
  • now, if we perform a merge, an inner join is performed using the common column name automatically -
    1
    2
    3
    4
    5
    
    teams.merge(cities)
    #    team       city     wins  losses  state    population
    # 0  Suns       Phoenix  20    4       Arizona  1630000
    # 1  Mavericks  Dallas   11    12      Texas    1310000
    # 2  Rockets    Houston  7     16      Texas    2310000
    
  • we can set the how parameter for join type. as we saw, it is inner by default, but we can set it to outer, left, right, etc
    1
    2
    3
    4
    5
    6
    7
    
    teams.merge(cities, how='left')
    
    #    team       city     wins  losses  state    population
    # 0  Suns       Phoenix  20    4       Arizona  1630000.0
    # 1  Mavericks  Dallas   11    12      Texas    1310000.0
    # 2  Rockets    Houston  7     16      Texas    2310000.0
    # 3  Nuggets    Denver   11    12      NaN      NaN
    
  • cross join is also there - all rows of one dataframe with all rows of the other dataframe
  • by default, the same column name was used explicitly. we can however, specify the column(s) explicitly using the on keyword argument
    1
    
    teams.merge(cities, on='city')
    
  • note - we can specify multiple columns for the on parameter as well based on use case
  • what if the two dataframes have similar column names, and are not being used for joining? pandas will suffix them with _x and _y by default. e.g. below, the name column is being used for the join, so it is only present once. however, the score column is not, and therefore it is preset with a suffix
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    midterm = pd.DataFrame(
        [['shameek', 42], ['colt', 45]],
        columns=['name', 'score']
    )
    
    final = pd.DataFrame(
        [['shameek', 85], ['colt', 97]],
        columns=['name', 'score']
    )
    
    midterm.merge(final, on='name')
    
    #     name    score_x  score_y
    # 0  shameek  42       85
    # 1  colt     45       97
    
  • we can however, specify the suffixes to append -
    1
    2
    3
    4
    5
    
    midterm.merge(final, on='name', suffixes=['_midterm', '_final'])
    
    #    name     score_midterm  score_final
    # 0  shameek  42             85
    # 1  colt     45             97
    
  • also note how we had to specify on explicitly, otherwise both name and score would be used. since there is no data with the same value in both tables, we end up with an empty result set

Seaborn

Relational Plots

  • uses matplotlib underneath, and works well with pandas
  • typically imported as sns
    1
    
    import seaborn as sns
    
  • to play around with seaborn, we can use any of the datasets present here via load dataset. it returns the pandas dataframe
    1
    2
    3
    4
    5
    6
    7
    
    tips = sns.load_dataset('tips')
    tips
    
    #    total_bill  tip   sex     smoker  day  time    size
    # 0  16.99       1.01  Female  No      Sun  Dinner  2
    # 1  10.34       1.66  Male    No      Sun  Dinner  3
    # 2  21.01       3.50  Male    No      Sun  Dinner  3
    
  • note - for the default theme of sns to kick in which kind of looks good, run the following
    1
    
    sns.set_theme()
    
  • for a scatterplot, we can do the following -
    1
    
    sns.scatterplot(tips, x='total_bill', y='tip')
    

  • note - the exact above result could have been achieved without seaborn as well -
    1
    
    tips.plot(kind='scatter', x='total_bill', y='tip')
    
  • but, now, look how we can simply pass hue for different scatter plots based on color on the same axes -
    1
    
    sns.scatterplot(tips, x='total_bill', y='tip', hue='sex')
    

  • further, we can pass in style for different scatter plots based on marker on the same axes
    1
    
    sns.scatterplot(tips, x='total_bill', y='tip', hue='sex', style='smoker')
    

  • note - if we use the same column for hue and style, the marker and color both change, thus maybe improving readability
    1
    
    sns.scatterplot(tips, x='total_bill', y='tip', hue='sex', style='sex')
    

  • e.g. assume tips have a size column, which represents the number of people together. we can add the size keyword argument, which changes the size of the marker
    1
    
    sns.scatterplot(tips, x='total_bill', y='tip', size='size')
    

  • assume we have a dataset for flights like so i.e. we have 12 records per year for each of the months -
    1
    2
    3
    4
    5
    6
    7
    
    flights = sns.load_dataset('flights')
    flights
    
    #    year  month  passengers
    # 0  1949  Jan    112
    # 1  1949  Feb    118
    # 2  1949  Mar    132
    
  • e.g. we try to create a lineplot below. but, we do not specify how it should plot the multiple records that it gets for a passenger in a year. it plots using the estimator as mean by default
    1
    
    sns.lineplot(flights, x='year', y='passengers')
    

  • if we wanted to achieve this ourselves using matplotlib, we would have to group it and then use the aggregation function like below -
    1
    
    flights.groupby('year')['passengers'].mean().plot()
    

  • estimators are pandas functions. we can also provide a custom estimator, e.g. sum as so -
    1
    
    sns.lineplot(flights, x='year', y='passengers', estimator='sum')
    

  • note how there is also a confidence interval that seaborn also adds to the plot. we can control its width, method, etc using error bar. setting it to None would remove it completely
    1
    
    sns.lineplot(flights, x='year', y='passengers', estimator='sum', errorbar=None)
    
  • my understanding - seaborn has two kinds of plots - figure level plots and axes level plots. the ones we saw above - lineplot and scatterplot are axes level plots their corresponding figure level plot is relplot or relational plot
    1
    2
    3
    4
    5
    
    # initial
    sns.scatterplot(data=tips, x='total_bill', y='tip')
    
    # using relplot
    sns.relplot(data=tips, x='total_bill', y='tip', kind='scatter')
    
  • but now, we can easily put different subplots / different axes on the same figure
  • e.g. assume we would like to have different columns for the different values of sex
    1
    
    sns.relplot(data=tips, x='total_bill', y='tip', row='time', col='sex', hue='smoker')
    

  • a more involved example. break into -
    • columns using sex
    • rows using time - lunch or dinner
    • different colors for smokers and non smokers
    1
    
    sns.relplot(data=tips, x='total_bill', y='tip', row='time', col='sex', hue='smoker')
    

  • controlling figure size for axes level plots - we make the figure call first
    1
    2
    
    plt.figure(figsize=(4, 3))
    sns.scatterplot(data=tips, x='total_bill', y='tip')
    
  • controlling figure size for figure level plots - relplot creates a figure for us bts, so we cannot call the figure ourselves. instead, we control size of each facet i.e. subplot using height and aspect (ratio between height and width)
    1
    
    sns.relplot(data=tips, x='total_bill', y='tip', row='time', col='sex', hue='smoker', height=3, aspect=2)
    

Distribution Plots

  • relation plots - relation between two things x and y
  • distribution plots - distribution of data, e.g. histogram
  • histogram example - assume we try to visualize the tips dataset -
    1
    
    sns.histplot(data=tips, x='tip')
    

  • if we use hue, by default, they would come one on top of another. the opacity is such that they are see through -
    1
    
    sns.histplot(data=tips, x='tip', hue='smoker')
    

  • we can configure it to be stacked instead of appearing one on top of another
    1
    
    sns.histplot(data=tips, x='tip', hue='smoker', multiple='stack')
    

  • we can also set multiple to be dodge, so that appear one beside another. note how i also configure bins in this case
    1
    
    sns.histplot(data=tips, x='tip', hue='smoker', multiple='dodge', bins=5)
    

  • finally, we can add the kde curve to the histogram plot as well by setting kde to true
    1
    
    sns.histplot(data=tips, x='tip', hue='smoker', kde=True)
    

  • above, we ovrlayed the kde curve on top of the histogram. however, we can add a standalone kde curve as well. below, we try to visualize the weights of different species of penguins simultaneously
    1
    
    sns.kdeplot(data=penguins, x='body_mass_g', hue='species')
    

  • finally, we can also configure the precision by adjusting the bandwidth
    1
    
    sns.kdeplot(data=penguins, x='body_mass_g', hue='species', bw_adjust=0.4)
    

  • histograms / kde plots are also called as univariate distribution plots i.e. we only look at the distribution of a single feature
  • we can look at bivariate distribution plots as well i.e. analyze two features at once, both on x and y axis
  • kde bivariate distribution plots - try looking for smoother curves (like the hollow i believe?)
    1
    
    sns.kdeplot(data=penguins, x='bill_length_mm', y='flipper_length_mm', hue='species')
    

  • histogram bivariate distribution plots - try looking for the concentrated coloring (like a heat map)
    1
    
    sns.histplot(data=penguins, x='bill_length_mm', y='flipper_length_mm', hue='species')
    

  • rugplots - ticks along the x or y axis to show the presence of an observation
    1
    
    sns.rugplot(data=penguins, x='body_mass_g')
    

  • this is not very useful by itself. because rugplots are useful when used with other plots. e.g. below, from our scatterplot, it is difficult to find out where the majority of the values lie, so we supplement it with a rugplot
    1
    2
    
    sns.scatterplot(data=diamonds, x='carat', y='price', s=2)
    sns.rugplot(data=diamonds, x='carat', y='price', alpha=0.005)
    

  • we use displot for the figure level plot of distibution plots, no surprises here
    1
    
    sns.displot(data=penguins, kind='kde', x='body_mass_g', col='species', row='island', height=2, aspect=2, hue='sex')
    

Categorical Plots

  • count plot - displays count. but unlike histograms which typically used for numerical data, count plots are typically used for non numerical data
    1
    
    sns.countplot(data=penguins, x='species', hue='sex')
    

  • to achieve something similar when using matplotlib by itself, i did the following -
    1
    
    penguins[['species', 'sex']].value_counts().unstack('sex').plot(kind='bar')
    

  • issue - if we tried to make a scatterplot for categorical data - it would be hard to comment on the density -
    1
    
    sns.scatterplot(data=titanic, x='pclass', y='age')
    

  • solution 1 - we can use stripplot - it introduces a little bit of jitter to improve readability -
    1
    
    sns.stripplot(data=titanic, x='pclass', y='age')
    

  • solution 2 - we can use swarmplot - it ensures points are non overlapping to improve readability. my understanding - use this only for smaller / sampled datasets, otherwise achieving this can become difficult
    1
    2
    
    plt.figure(figsize=(10, 4))
    sns.swarmplot(data=titanic, x='pclass', y='age')
    

  • note how i had to adjust the figuresize, otherwise i get the warning - UserWarning: 15.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  • box plots - helps visualize distribution of categorical data easily. features -
    • q1 represents the 25% value
    • q3 represents the 75% value
    • we have the median value plotted in between
    • the range between q1 to q3 is called iqr or inter quartile range
    • the lines surrounding iqr are called whiskers. they are placed relative to q1 and q3, and default to 1.5 i believe
    • finally, we have outliers outside these whiskers
    1
    
    sns.boxplot(data=titanic, x='age')
    

  • using boxplot for categorical data -
    1
    
    sns.boxplot(data=titanic, x='pclass', y='age', hue='sex')
    

  • combining boxplot and swarmplot. small reminder from matplotlib that they go into the same figure and axes since we do not call a plt.figure() in between
    1
    2
    
    sns.boxplot(data=penguins, y='body_mass_g', x='species')
    sns.swarmplot(data=penguins, y='body_mass_g', x='species', color='black')
    

  • violin plot - has the box plot at the center along with the kde curve. carefully look at the black line to see the median, inter quartile range and whiskers
    1
    
    sns.violinplot(data=titanic, x='pclass', y='age')
    

  • note - if we add a hue, it creates different violin plots side by side
    1
    
    sns.violinplot(data=titanic, x='pclass', y='age', hue='sex')
    

  • we can however, change this behavior by providing the split parameter
    1
    
    sns.violinplot(data=titanic, x='pclass', y='age', hue='sex', split=True)
    

  • bar plot - again, compare the difference from matplotib, where there is no calculation - it just plots, while seaborn grouping and using an estimator like we saw in line plots
    1
    
    sns.barplot(data=titanic, y='pclass', x='survived', hue='sex', estimator='sum', orient='h')
    

  • the black line i believe helps with approximation and thus faster plotting and calculations
  • plotting the same thing using matplotlib -
    1
    
    titanic.groupby(['pclass', 'sex'])['survived'].sum().unstack().plot(kind='barh')
    

  • categorical plot - figure level plot, not bothering as there is nothing new
This post is licensed under CC BY 4.0 by the author.

Python Basics

Spark Advanced