Lists and Arrays
Contents
4. Lists and Arrays#
Follow along!
Remember that to best make use of this tutorial, it is highly recommended that you make your own notebook and type every piece of code yourself!
In the previous section, we discussed working with variables that contained a single string/integer/number and how to check their type and convert them from one type to another. While this was maybe interesting, real modern data is “big”, so we need to be able to deal with collections of data and not just one number at a time. Thankfully, this is where Python has a native list
type and where we’ll introduce the extremely useful numpy array, (numpy.ndarray
) to handle many data points.
These data types are similar to those you just learned about except that now we have to worry about where our data is and not just what it is. For example, if you have loaded into a variable the air temperature of Evanston, Illinois for the last 100 years, you might want to do an analysis of just the summer highs over time, so how do you access just those data points? Alternately, you might be attempting to train a weather model, so you want to predict each time point based on the previous one in time, so how do you access elements in sequence? These are the general techniques we’ll explore in this section. The upcoming section on loops will teach you how to apply operations and functions iteratively and in sequences (as in our time-series hypothetical), and the section on conditional statements will give you the tools to select elements conditionally, so that you can only look at summer months, for example. However, before we can use those techniques, we need to familiarize ourselves with the basics of lists and arrays in Python.
Please open a new Jupyter notebook and follow along by copying all Python commands!
4.1. Lists#
The Python list
is one of the most flexible data structures in any programming language. Most simply, a list operates as an ordered container of other variables, regardless of their type. We’ll see shortly that Python lists can be appended to, iterated over, and edited. Their elements, or list entries, can be accessed with indices, and we’ll see later that we can easily apply functions to all elements of a list with list comprehensions. Simply put, if you only had Python lists to work with, you could probably get most computational tasks accomplished.
Most simply, we indicate a Python list using square brackets, []
. When entering a list manually, we separate elements with commas, as shown below.
list1 = ['abcd', 1, 3.1415, False]
print(list1)
print(type(list1))
['abcd', 1, 3.1415, False]
<class 'list'>
The above list contains a string, integer, float, and boolean data point in turn, but the variable list1
is type list
.
There isn’t any restriction on what can be put in lists, so we’re free to do things like:
crazy_list = ['abcd', [12, 4, 'no'], ['yes', [True, 'unique element']], 'False', False]
print(crazy_list)
empty_list = []
print(empty_list)
all_lists = [list1, crazy_list, empty_list]
print(all_lists)
['abcd', [12, 4, 'no'], ['yes', [True, 'unique element']], 'False', False]
[]
[['abcd', 1, 3.1415, False], ['abcd', [12, 4, 'no'], ['yes', [True, 'unique element']], 'False', False], []]
That is, we can have nested lists inside of lists!
4.1.1. Accessing List Elements#
Of course, creating lists is all great and good, but once we’ve collected data, we need to be able to pull it back out. To do this we use indices, which are just integers that indicate which element of the list we’d like to access, where the first element of a list is indicated by a 0, the second by a 1, the third by a 2, and so on. Specifically, we put these indices again into a square bracket notation as shown below:
animals = ['dogs', 'cats', 'birds']
first_animal = animals[0] ## This is the SYNTAX for the FIRST ELEMENT
second_animal = animals[1] ## This is the SYNTAX for the SECOND ELEMENT
print(animals)
print(first_animal)
print(second_animal)
['dogs', 'cats', 'birds']
dogs
cats
So we use the syntax list[index]
to grab the index+1
th element from the list. This applies regardless of what type the list element is, for example
third_crazy_element = crazy_list[2]
print(third_crazy_element)
second_third_crazy_element = third_crazy_element[1]
print(second_third_crazy_element)
second_second_third_crazy_element = second_third_crazy_element[1]
print(second_second_third_crazy_element)
other_element = crazy_list[2][1][1] ## What's happening here!?
print(other_element)
['yes', [True, 'unique element']]
[True, 'unique element']
unique element
unique element
Each of the elements was a list, so we could continue asking for indices. (If I’d added another layer you could have asked for the fifth element.) The last line, which defines other_element
shows how we can chain together index requests into nested lists. This should not be super obvious, but as we’ll continue, we’ll see how Python commands can often be concatenated into one-liners like above.
There are then some caveats to this. In Python, you cannot ask for an element at an index that does not exist. Consider the following example:
tenth_animal = animals[9] ## Index 9 is asking for the TENTH element
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[5], line 1
----> 1 tenth_animal = animals[9] ## Index 9 is asking for the TENTH element
IndexError: list index out of range
This should result in an IndexError: list index out of range
because the index you asked for, 9, was beyond the range of this 3-element list. (There is no 10th element to return!)
Lists are often flexible and we don’t always know their length, so we can use the :len
function to return the number of elements in a list.
third_crazy_element = crazy_list[2]
print(f"The length of the third element is {len(third_crazy_element)}")
second_third_crazy_element = third_crazy_element[1]
print(f"The length of the second-third element is {len(second_third_crazy_element)}")
## Write down your prediction before running!
# second_second_third_crazy_element = second_third_crazy_element[1]
# print(f"The length of the second-second-third element is {len(second_second_third_crazy_element)}")
The length of the third element is 2
The length of the second-third element is 2
This also offers an easy way to access the last element in a list:
longer_list = ['a', 'b', 'c', 'd', 'e', 1, 2, 3]
len_list = len(longer_list)
## Guess which one will work before uncommenting!
# last_element = longer_list[len_list]
# last_element = longer_list[len_list - 1]
print(last_element)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 8
3 len_list = len(longer_list)
5 ## Guess which one will work before uncommenting!
6 # last_element = longer_list[len_list]
7 # last_element = longer_list[len_list - 1]
----> 8 print(last_element)
NameError: name 'last_element' is not defined
However, Python also supports negative indexing where entering negative integers counts from the end of the list to the front.
last_el = longer_list[-1]
second_to_last_el = longer_list[-2]
print(last_el)
print(second_to_last_el)
print(longer_list[-20]) ## Does this work!?
3
2
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[8], line 7
4 print(last_el)
5 print(second_to_last_el)
----> 7 print(longer_list[-20]) ## Does this work!?
IndexError: list index out of range
Most importantly we can also access list elements via slicing; we can ask for a “slice” of the list using the list[start:end:increment]
notation, which gives elements starting at index start
, up to but not including end
, in steps of size increment
. For example, consider
print(longer_list[::2]) # This grabs every other element starting at 0
print(longer_list[1::2]) # This grabs every other element starting at 1
print(longer_list[0:6:3]) # Every third element starting at 0, ending at 6
['a', 'c', 'e', 2]
['b', 'd', 1, 3]
['a', 'd']
This might seem confusing, or really simple, but having a strong grasp of indexing is essential to coding in Python, so please take the time to complete the following exercises:
4.1.2. Exercises#
Make a list containing the integers 1 through 13 and the first 8 letters of the English alphabet. Save into new variables the:
First 5 elements of the list
The last 12 elements of the list
The odd-indexed elements of the list
Every other even-indexed element of the list
Put the names of yourself and three friends into a list called
names
and put your corresponding favorite colors intocolors
. Confirm that these lists have the same length using thelen
function and an appropriately formatted print statement. Define a variableidx = 0
and write a print statement that prints out you and your friends’ names and corresponding favorite colors by accessingnames[idx]
andcolors[idx]
. Copy and paste this printing code four times and insertidx = idx + 1
in between each print. Can you describe each of you and your friends’ favorite colors in turn?Create several variables containing an integer, boolean, string, and list data type in turn. Print the result of applying the
list
function to them in turn.Consider the following code and explain the output in the context of what you know about boolean variables.
print(bool([True]))
print(bool(['True']))
print(bool(['False']))
print(bool([False]))
print(bool([False, False, False]))
print(bool([]))
4.1.3. Modifying Lists#
Now that you can access list elements, it’s important to note that lists are mutable data types in that they can be changed (mutated). That is, once we’ve defined a variable as a list, we can come back, after the fact, and edit the elements, size, and structure of the list. This is not always the case with data structures, so it’s worth point out. There are many ways to modify lists in Python, and I’ll point out the most useful here.
4.1.3.1. Modifying Elements#
Consider our earlier crazy_list
. We can always change any specific element by directing an assignment operation at it. For example:
print(crazy_list)
best_list_element = crazy_list[2]
print(best_list_element)
crazy_list[2] = 4
print(crazy_list)
print(crazy_list[2])
['abcd', [12, 4, 'no'], ['yes', [True, 'unique element']], 'False', False]
['yes', [True, 'unique element']]
['abcd', [12, 4, 'no'], 4, 'False', False]
4
We see that we’ve changed the 3rd element of crazy list from best_list_element = ['yes', [True, 'unique element']]
to 4
.
4.1.3.2. Adding Elements#
Elements can be added to lists in several ways. First of all, we can concatenate lists with the +
operation:
some_animals = ['dog', 'cat', 'gerbil']
other_animals = ['iguana', 'flying squirrel', 'parakeet']
print(some_animals)
print(other_animals)
many_animals = some_animals + other_animals
print(many_animals)
['dog', 'cat', 'gerbil']
['iguana', 'flying squirrel', 'parakeet']
['dog', 'cat', 'gerbil', 'iguana', 'flying squirrel', 'parakeet']
But we can also use the list.append
and list.insert
methods to add elements to either the end or a specific index of a list, respectively.
many_animals.append('chinchilla')
print(many_animals)
many_animals.insert(2, 'turtle')
print(many_animals)
many_animals.insert(2, 'tortoise')
print(many_animals)
['dog', 'cat', 'gerbil', 'iguana', 'flying squirrel', 'parakeet', 'chinchilla']
['dog', 'cat', 'turtle', 'gerbil', 'iguana', 'flying squirrel', 'parakeet', 'chinchilla']
['dog', 'cat', 'tortoise', 'turtle', 'gerbil', 'iguana', 'flying squirrel', 'parakeet', 'chinchilla']
Additionally, we can multiply lists using the *
operator as with strings. Try print(4*many_animals)
and print(24*[True, 0])
and see what happens!
4.1.3.3. Removing Elements#
Similarly, we can remove elements from either the end or a specific index of a list using list.pop
.
last_animal = many_animals.pop()
print(last_animal)
print(many_animals)
middle_animal = many_animals.pop(3)
print(middle_animal)
print(many_animals)
chinchilla
['dog', 'cat', 'tortoise', 'turtle', 'gerbil', 'iguana', 'flying squirrel', 'parakeet']
turtle
['dog', 'cat', 'tortoise', 'gerbil', 'iguana', 'flying squirrel', 'parakeet']
Notice that the list.pop
function can save the removed value into a variable. This can be useful if you need to iterate through a list and use each element for something one time.
You can also use the list.remove
function to remove elements by value. That is, we can remove the first instance of a specific value with list.remove
.
many_animals.insert(2, 'turtle')
many_animals.insert(-1, 'turtle')
many_animals.insert(0, 'turtle')
print(many_animals)
many_animals.remove('turtle')
print(many_animals)
many_animals.remove('turtle')
print(many_animals)
many_animals.remove('turtle')
print(many_animals)
['turtle', 'dog', 'cat', 'turtle', 'tortoise', 'gerbil', 'iguana', 'flying squirrel', 'turtle', 'parakeet']
['dog', 'cat', 'turtle', 'tortoise', 'gerbil', 'iguana', 'flying squirrel', 'turtle', 'parakeet']
['dog', 'cat', 'tortoise', 'gerbil', 'iguana', 'flying squirrel', 'turtle', 'parakeet']
['dog', 'cat', 'tortoise', 'gerbil', 'iguana', 'flying squirrel', 'parakeet']
4.1.3.4. Copying Lists#
In previous examples we have done things like x = y
to copy values to multiple variables with impunity. Unfortunately we can’t continue this with lists. To see why, consider the following:
print(many_animals)
copied_list = many_animals
copied_list[4] = 'tiger'
print(copied_list)
print(many_animals) ## This list has a tiger in it now!
['dog', 'cat', 'tortoise', 'gerbil', 'iguana', 'flying squirrel', 'parakeet']
['dog', 'cat', 'tortoise', 'gerbil', 'tiger', 'flying squirrel', 'parakeet']
['dog', 'cat', 'tortoise', 'gerbil', 'tiger', 'flying squirrel', 'parakeet']
We can see that the copied_list
wasn’t just a copy because modifying it also changed the original list, many_animals
. This is because in Python, variables don’t actually contain the data we’ve assigned them to, instead they point to that data in the computer’s storage. So when we type copied_list = many_animals
, we’re just telling Python that both copied_list
and many_animals
should direct to the same data in storage.
This may seem like an odd point to make, but this is an insidious problem to encounter because it will raise no error messages, but data that you don’t expect to change might be modified if you don’t copy lists correctly. The easiest way to truly create a new copy of data is to use the copy = old[:]
syntax, where the [:]
simply says “create a slice of everything,” which Python dutifully places in an actually new list (a new part of your computer’s storage is allocated).
4.1.4. Other List Information#
There are several other useful things you can do with lists that we’ll discuss below. You should not worry about memorizing these, just try to keep in mind that these sort of things can be done with lists.
4.1.4.1. The list
Function#
Just as we saw with other data types, we can attempt to convert variables into a list with the list
function. For some data types, this will simply add a pair of square brackets around your data, but for other data types that you’ll eventually learn about, such as arrays, this type conversion can be more signficant. In particular, Python thinks of lists as collections of stuff so for some data types that can’t really be thought of that way, like a single integer or float, list
will throw a TypeError
as you’ll see below. Try out the code below to see how this works.
my_str = "the quick brown fox"
my_int = 6543
my_float = 65.43
my_bool = True or False
print(list(my_str)) ## What happened here!?
## What errors do these lines create?
# print(list(my_int))
# print(list(my_float))
# print(list(my_bool))
## What about this?
# print([my_int])
# print(type([my_int]))
# print(list([my_int]))
# print(type(list([my_int])))
['t', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x']
4.1.4.2. Lists have Length#
We’ve already mentioned this, but it can be very useful to know how many elements are in your lists. This is done via the len
like this:
my_list = ['a', 1234, 'list_element', True, True, 4+6]
list_len = len(my_list)
print(f"There are {list_len} elements in 'my_list'")
There are 6 elements in 'my_list'
We can see that the len
function returns the number of elements in a list. This may seem simple, but it can be very useful in many contexts.
We’ll see that this output is always a non-negative integer, where we can get a length of 0 by asking for the length of an empty list like so:
empty_list = [] ## Square brackets with nothing in between!
print("This is an empty list:", empty_list)
zero_length = len(empty_list)
print(f"The length of the empty list is {zero_length}")
This is an empty list: []
The length of the empty list is 0
We’ll use this frequently in the section on loops to intialize code that will be repeated many times, using the list.append
function, for example, to fill the list with useful calculations.
4.1.4.3. The Range Function#
We’ll talk in a moment about a bunch of useful list functions, but the range
function is too useful to not highlight specifically. The range
function can be used to make collections of integers to cover a specific range (hence the name). We can use the list
type conversion function to return a list of these numbers. Try out the example below and use help(range)
to read the documentation.
start, stop, step = 0, 10, 2
my_range1 = range(start, stop, step)
print(my_range1)
print(type(my_range1)) ## What type does 'my_range1' appear to be?
## Apply the list() function:
my_range1 = list(my_range1)
print(my_range1)
print(type(my_range1))
## If we don't supply a step size, it increments by 1
print(list(range(start, stop)))
## If we don't supply a starting point, it assumes 0
print(list(range(stop)))
## We can use negative step sizes
print(list(range(stop, start, -step)))
## Is 'stop'=10 in this list?
print(list(range(10)))
## What happens if we try to use a float?
# print(list(range(10.1)))
range(0, 10, 2)
<class 'range'>
[0, 2, 4, 6, 8]
<class 'list'>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[10, 8, 6, 4, 2]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Note that range only works for integers and that the stop
input is excluded from the range.
This function will again be extremely useful in the section on loops, but having a way to get a regularly spaced range of numbers is something that could be handy in a variety of circumstances.
4.1.4.4. Useful List Functions#
As a final note, I just want to highlight a few really useful list functions.
If we have a list of all “numbers”, then the sum
, max
, and min
functions can be very useful!
num_list = list([3, 6.3, 8, 12.52, 4, 4, 3, True, 3, 0.1])
list_max = max(num_list)
list_min = min(num_list)
list_sum = sum(num_list)
print(f"The max of 'num_list' is {list_max}")
print(f"The min of 'num_list' is {list_min}")
print(f"The sum of 'num_list' is {list_sum}")
The max of 'num_list' is 12.52
The min of 'num_list' is 0.1
The sum of 'num_list' is 44.92
We can sort lists using the sorted
function or the list.sort
method. The difference between these is that sorted
returns a copy of the list that is now sorted, while list.sort
will sort the list in-place. This is somewhat important, because if the order of your original list is important, then you don’t want to mess that up with a list.sort
.
orig_names = ['Eric', 'Madhav', 'Bill', 'Caroline', 'Tiffany']
names_to_sort = orig_names[:]
sorted_names = sorted(names_to_sort)
print(names_to_sort)
print(sorted_names)
names_to_sort.sort()
print(names_to_sort)
## We can reverse the sort order with the "reverse" keyword:
sorted_names = sorted(names_to_sort, reverse=True)
names_to_sort.sort(reverse=True)
print(sorted_names)
print(names_to_sort)
['Eric', 'Madhav', 'Bill', 'Caroline', 'Tiffany']
['Bill', 'Caroline', 'Eric', 'Madhav', 'Tiffany']
['Bill', 'Caroline', 'Eric', 'Madhav', 'Tiffany']
['Tiffany', 'Madhav', 'Eric', 'Caroline', 'Bill']
['Tiffany', 'Madhav', 'Eric', 'Caroline', 'Bill']
We can count the number of times a certain element appears in a list with the list.count
method. Using the num_list
from the earlier example, we can see:
num_3 = num_list.count(3)
num_4 = num_list.count(4)
num_8 = num_list.count(8)
print(f"There are {num_3} 3s in 'num_list'")
print(f"There are {num_4} 4s in 'num_list'")
print(f"There are {num_8} 8s in 'num_list'")
There are 3 3s in 'num_list'
There are 2 4s in 'num_list'
There are 1 8s in 'num_list'
Finally, we can check if an element exists in a list using the in
operation. If an element exists in the list, we can find the index of the first instance of it using the list.index
function.
print(3 in num_list)
print(5 in num_list)
idx_3 = num_list.index(3)
idx_5 = num_list.index(5) ## What is the error message here?
print(f"There is a 3 in 'num_list' at index {idx_3}")
True
False
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[24], line 5
2 print(5 in num_list)
4 idx_3 = num_list.index(3)
----> 5 idx_5 = num_list.index(5) ## What is the error message here?
7 print(f"There is a 3 in 'num_list' at index {idx_3}")
ValueError: 5 is not in list
If this is your first time working with lists, don’t worry about trying to memorize everything you’ve seen here. Coding is a skill and you will remember things that you use more frequently and look up things that you don’t. I have been coding in Python for years and I constantly look up functions or syntaxes. These notes on lists have thus been given to you as a reference and to show you the things that you could do with a Python list, which you should try and remember. That is, you should know that there’s a function for sorting lists, but it’s ok if you don’t remember the exact syntax off the top of your head. For a good reference on list methods, see python.org.
4.1.5. Exercises#
Create a nested list using your
names
list from exericise 4.1.2 by placingnames
in between two square brackets. Save this list to a variablenames_colors
.What is the length of this list? What is the length of
names_colors[0]
?Undo the extra bracketing with
names_colors = names_colors[0]
. Setidx=0
and useidx
as an index to replace each element ofnames_colors
with a list[names[idx], colors[idx]]
(wherecolors
is the same list from 4.1.2). You should now have a list of lists, where each sub-list has a name and a color.You’ve had a falling out and need to remove the last friend from
names_colors
. However, at the same time, you’ve made two new friends! Add one of them (and their favorite color) right after your name+color usinglist.insert
. Add the other friend to the end ofnames_colors
usinglist.append
.Use
sorted
to sortnames_colors
. How did it sort the list? Execute the following code and try to explain what happened differently.
sorted_friends = sorted(names_colors, key = lambda x: x[1])
Create a list containing 10 numbers of your choice. Use the
sum
andlen
functions to calculate the average of the list elements. Print this result to the screen in a formatted string.Let’s practice using the
range
function:Create a list of all the integers from -12 to 8.
Create a list containing the first 10,000 even integers.
Create a list of the first 5 positive integers that are divisible by 7 as
list_of_sevens
. Useidx = list(range(len(list_of_sevens)))
to make a list of indices for the elements inlist_of_sevens
. Useii = idx.pop()
andlist_of_sevens[ii]
several times to print out the elements in reverse order in a formatted string.
Consider the problem here. Save the sample DNA as a string and use string formatting to print the number of A’s, C’s, G’s and T’s to the screen.
Convert a string containing a DNA sequence into the corresponding RNA, as indicated here. This process involves replacing all T’s in the DNA with U’s. As a hint, consider using
list.index
orlist.replace
.
4.2. Numpy Arrays#
Hopefully the first part of this section has convinced you that Python lists might be useful and flexible tools for holding multiple data points. However, Python lists are sometimes too flexible, in allowing for the collection of arbitrary data, they lose some potentially useful functionality. In particular, applying operations to each element in a list requires that we access each list element individually (which you will still do in the section on loops), which can quickly become very slow! To fill the need for a slightly more structured collection of data, the numpy package was developed as “The fundamental package for scientific computing in Python.” Going forward as a data scientist, familiarity with numpy will be essential.
Here we’ll introduce you to the numpy.ndarray
, which is the data structure we’re looking for. We’ll see that this structure, informally known as a numpy array, trades some of the list’s flexibility for functionality and speed. But before we talk about working with numpy, let’s spend a moment talking about Python packages.
4.2.1. Importing Packages in Python#
In the introduction, I noted that Python has a large, active online community of developers. What this means is that packages (also called modules) of Python code are constantly being created and disseminated for all sorts of applications. This is facilitated by the Python programming language via the powerful import
operation that let’s a programmer “import” the contents of any other package on their computer into a Python code to be used. That is, if you write a couple tools for doing some useful data analysis, you can send them to me (or put them on the internet!) and I can use import
to give my code access to your tools. This may sound somewhat complicated or abstract, and it can be, which is why environments like Anaconda can be especially useful.
In particular, advanced projects may involve building on dozens of other people’s code in many packages. It could quickly become tedious to download and manage new packages. To assist with this, Python installs generally include the pip
function, or the Anaconda Navigator has the environments tab. As a bonus, Anaconda also includes common packages, such as numpy and matplotlib, pre-installed. However, to install a new package using Anaconda, you can search for the package in the environments tab. If you are not using Anaconda, follow the instructions here, typically the important command will be pip install my_package
. Again, this is mostly for your reference later on, anyone using Anaconda will not have to worry about the next step.
The important information on importing packages is then how to use the import
functionality in your code. Typically what this looks like is the line import numpy as np
near the top of your code. In a Jupyter notebook, put this in the first cell in the notebook and run it, so that all cells have access to the package. What this command tells Python is that you are importing the numpy package, and you’re calling it np
, so that numpy functions can be accessed as np.function_I_want
.
There are several options for the import statement. For example, you can import all functions without the nickname using from numpy import *
, where *
is a wildcard character so that Python reads “from the numpy package, import everything matching *”, but the * matches everything, so it imports all the numpy functions. This is not usually recommended, as you are not guaranteed that a package won’t have a function with the same name as something you already have. So, if numpy also has a print
function, Python wouldn’t know whether to use the built-in print
or numpy’s. This is cleared up by insisting that the functions be prepended with something, so there’s no ambiguity between np.print
and print
. Also, import numpy
imports numpy functions that can now be accessed with numpy.function
, the as np
just makes things easier to type.
You can also import specific functions or sub-modules from a package using the from package import func1, func2
syntax. This is useful when you know you’re only going to be using a few things from the package.
TLDR: importing packages
Going forward, we’ll often use import numpy as np
to tell Python we want to access numpy functions. Throughout the rest of the tutorial, if you see np.function
, it is assuming that you have imported numpy at the beginning of your code.
4.2.2. Numpy Arrays#
As promised, we will now introduce the numpy.ndarray
, or the numpy (N-dimensional) array. Most simply, an array operates like a matrix that you may know from math class, or like an Excel spreadsheet. Elements are laid in a grid so that each element can be identified by its row and column. The caveat is that arrays can only contain one data type, unlike lists.
To make an array, we can apply the np.array
function to a list. We can also use the np.empty
function to make an “empty” array of a given size or the np.zeros
and np.ones
functions to make arrays of zeros or ones, respectively.
num_list = list([3, 6.3, 8, 12.52, 4, 4, 3, True, 3, 0.1])
num_arr = np.array(num_list)
print(num_arr)
print(type(num_arr))
print(num_arr.dtype)
n_rows, n_cols = 5, 3
empty_arr = np.empty((n_rows, n_cols)) ## SYNTAX ALERT NOTE THE DOUBLE
## PARENTHESES
zero_arr = np.zeros((n_rows, n_cols))
one_arr = np.ones((n_rows, n_cols))
print(f"\nArrays of size {n_rows} x {n_cols}")
print("\nAn empty array:")
print(empty_arr)
print("\nAn zero array:")
print(zero_arr)
print("\nAn one array:")
print(one_arr)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[25], line 2
1 num_list = list([3, 6.3, 8, 12.52, 4, 4, 3, True, 3, 0.1])
----> 2 num_arr = np.array(num_list)
3 print(num_arr)
4 print(type(num_arr))
NameError: name 'np' is not defined
In these examples, you can see that converting a list of mixed types (num_list
has ints, floats, and booleans) coerced everything into one type in num_arr
(floats, in this case). You can check the datatype using the numpy.ndarray.dtype
attribute. We haven’t talked about attributes yet, but you can think of them as properties of a piece of data.
You should also see that the np.empty
, np.zeros
, and np.ones
functions all made 5 by 3 arrays. The important note is that the input to these functions is a tuple of the dimension sizes. Tuples are another data type that we won’t spend time on that are indicated like a list, with elements separated by commas, but using parentheses, ()
, instead of square brackets.
As a final note, you can see that the empty_arr
is definitely not empty! This is because what it tells Python to do is to just set aside some space in the storage for the array data to be placed. Since your computer generally doesn’t actually “delete” things by setting all the bits to zero, when you ask Python what’s in the empty array, it looks at the bits that have been allocated, and interprets them according to dtype
(float in this case). So this part of memory could have previously been some file I’d written, but when interpreted as an array of floats, looks like some weird numbers. As a result, this function, while ostensibly faster than np.zeros
or [np.ones
], can be more dangerous because it’s harder to detect what is real data vs uninitialized elements.
4.2.3. Array Size, Dimension, and Type#
Keeping track of your arrays is essential in more complicated programs. To do this, it is often useful to check that the arrays you are creating are the correct shape, dimension, and type. To check these, you can use the shape
, ndim
, and dtype
attributes.
n_rows, n_cols = 5, 3
zero_arr = np.zeros((n_rows, n_cols))
arr_shape = zero_arr.shape
arr_ndim = zero_arr.ndim
arr_datatype = zero_arr.dtype
print(f"The shape of 'zero_arr' is {arr_shape}")
print(f"The number of dimensions of 'zero_arr' is {arr_ndim}")
print(f"The type of data in 'zero_arr' is {arr_datatype}")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[26], line 2
1 n_rows, n_cols = 5, 3
----> 2 zero_arr = np.zeros((n_rows, n_cols))
3 arr_shape = zero_arr.shape
4 arr_ndim = zero_arr.ndim
NameError: name 'np' is not defined
Note that the shape
attribute returns a tuple, so that we can use it to create similarly sized arrays. We can access tuple elements using normal list indexing (tuple[index]
) and we can use the numpy.empty_like
, numpy.zeros_like
, numpy.ones_like
functions to create empty, zero-filled, and one-filled arrays of the same size as an input array, respectively.
more_cols = np.zeros((arr_shape[0], 10))
print(more_cols)
fewer_rows = np.ones((2, arr_shape[1]))
print(fewer_rows)
same_but_ones = np.ones_like(zero_arr)
print(same_but_ones)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[27], line 1
----> 1 more_cols = np.zeros((arr_shape[0], 10))
2 print(more_cols)
4 fewer_rows = np.ones((2, arr_shape[1]))
NameError: name 'np' is not defined
4.2.4. Accessing Array Elements#
As noted just above, accessing tuple elements uses the same syntax as for listsA: my_tuple[index_I_want]
. The same syntax holds for arrays, except now we have multiple dimensions whose indices we need to specify. The syntax for array indexing is my_arr[index1, index2, index3]
where each subsequent index corresponds to each subsequent dimension of the array: the first index indicates the row, the second the column, the third the “sheet”, etc. We can use the slicing operations above on any dimension, using different slices in different dimensions to grab different sections of the array.
big_arr = big_arr = np.array(2 * (list('abcdefghijklmnopqrstuvwxyz') + ['aa', 'bb'])).reshape(8, 8)
print(big_arr)
print(f"The shape of 'big_arr' is {big_arr.shape}")
print(f"The number of dimensions of 'big_arr' is {big_arr.ndim}")
print(f"The type of data in 'big_arr' is {big_arr.dtype}\n")
first_row = big_arr[0]
first_col = big_arr[:, 0]
print("The first row is:", first_row)
print("The first column is:", first_col)
subset1 = big_arr[1:3, 2:6]
subset2 = big_arr[-3:, :5]
print("\n'subset1' is:")
print(subset1)
print(f"The shape of 'subset1' is {subset1.shape}")
print("'subset2' is:")
print(subset2)
print(f"The shape of 'subset2' is {subset2.shape}")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[28], line 1
----> 1 big_arr = big_arr = np.array(2 * (list('abcdefghijklmnopqrstuvwxyz') + ['aa', 'bb'])).reshape(8, 8)
3 print(big_arr)
5 print(f"The shape of 'big_arr' is {big_arr.shape}")
NameError: name 'np' is not defined
We can also access array elements via masking which we’ll talk more about in the section on conditional statements and by providing lists or arrays of indices directly. For example, if we want to access the first and third rows of a list directly, we can use my_arr[[0, 2]]
.
4.2.5. Useful Array Functions#
As you may have seen in the previous example, we can use the numpy.reshape
function to change the size of an array. So if we use the range
function to make a list of numbers, we can do np.array(range(20)).reshape(4, 5)
to make a 4x5 array.
reshaped_arr = np.array(range(20)).reshape(4, 5)
print(reshaped_arr)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[29], line 1
----> 1 reshaped_arr = np.array(range(20)).reshape(4, 5)
2 print(reshaped_arr)
NameError: name 'np' is not defined
Perhaps the most useful thing that we can do with arrays is vectorize our code by applying operations element-wise. For example,
added_arr = reshaped_arr + 3.4
minus_arr = added_arr - reshaped_arr
mult_arr = (4 * reshaped_arr / 5.345)**2.3
print(added_arr, "\n")
print(minus_arr, "\n")
print(mult_arr)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[30], line 1
----> 1 added_arr = reshaped_arr + 3.4
2 minus_arr = added_arr - reshaped_arr
3 mult_arr = (4 * reshaped_arr / 5.345)**2.3
NameError: name 'reshaped_arr' is not defined
If we have row or column arrays that match the size of another array, Python will assume that we want to do the operations row-wise or column-wise. This is known as broadcasting.
row_arr = np.arange(5)
col_arr = np.arange(4).reshape(-1, 1) ## WHAT DOES THE -1 DO HERE?
## (check the shape!)
## (try removing the reshape)
print("\n'reshaped_array':")
print(reshaped_arr)
print("\n'row_arr':")
print(row_arr)
print("\n'col_arr':")
print(col_arr)
row_added = reshaped_arr + row_arr
col_added = reshaped_arr + col_arr
print("\nAdding a row array:")
print(row_added)
print("\nAdding a column array:")
print(col_added)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[31], line 1
----> 1 row_arr = np.arange(5)
2 col_arr = np.arange(4).reshape(-1, 1) ## WHAT DOES THE -1 DO HERE?
3 ## (check the shape!)
4 ## (try removing the reshape)
NameError: name 'np' is not defined
Generally, arrays have pre-set sizes, but the numpy.concatenate
, numpy.hstack
, numpy.vstack
functions can be used to combine arrays (if their shapes match!). The numpy.concatenate
function will assume that you want to use the last dimension for concatenating, numpy.hstack
assumes you want to concatenate along rows, and numpy.vstack
assumes columns.
Alongside the numpy.ndarray.reshape
function, the numpy.ndarray.squeeze
function “squeezes” the array to get rid of extra dimensions.
We also can still use sum
, min
, and max
, but now we also can use the numpy versions numpy.sum
, numpy.min
, and numpy.max
which allow for the specification along which dimension (or axis) that we want to perform the operation:
arr_mean = np.mean(reshaped_arr)
row_mean = np.mean(reshaped_arr, axis=1)
col_mean = np.mean(reshaped_arr, axis=0)
print(f"The mean of 'reshaped_arr' is {arr_mean}")
print(f"The row-wise mean of 'reshaped_arr' is {row_mean}")
print(f"The column-wise mean of 'reshaped_arr' is {col_mean}")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[32], line 1
----> 1 arr_mean = np.mean(reshaped_arr)
2 row_mean = np.mean(reshaped_arr, axis=1)
3 col_mean = np.mean(reshaped_arr, axis=0)
NameError: name 'np' is not defined
The numpy package also has a host of functions for scientific and numeric purposes. The mathematical functions in particular are especially useful, including numpy.mean
, numpy.exp
, numpy.log
, and numpy.round
. You can use numpy.info
to access detailed numpy documentation. If you ever want a basic math operation in Python, Google “python numpy my_operation” first!
4.2.6. Exercises#
The numpy analog to
range
isnumpy.arange
. Use it to make an array of integers from -50 to 50. Divide by 25 to make an array of 101 equally spaced numbers from -2 to 2.Use
numpy.info
to look up how thenumpy.linspace
function works. Use it to make an array of 101 numbers between -2 and 2 namedmy_grid
.Create an array of indices for
my_grid
(usingarange
??). Add this array of indices tomy_grid
.Create a new array omitting the last element using slicing. Name this array
my_grid2
. What islen(my_grid2)
? Reshapemy_grid2
into a 10x10 array namedmy_arr
. What islen(my_arr)
? How does it seem thatlen
works? Try reshaping into a 25x4 array or a 4x25 array as a hint!Create a new array
y_values
frommy_grid3 = np.linspace(-2, 2, 100)
using the function \(y = \frac{6}{5}x^{-2} + x - 3.4\). (\(y\) isy_values
andmy_grid3
is \(x\).) What are the maximum and minimum values ofy_values
? Use thenumpy.argmin
andnumpy.argmax
functions to find where (the index location) of these maximum and minimum values - save them asmax_idx
andmin_idx
. Usemax_idx
andmin_idx
to show the values ofmy_grid3
at which the min and max occur!Apply
numpy.unique
to the listnum_list
from earlier in the section. What does it seem to do?