ICS 32 Winter 2022, Project #3: Try Not to Breathe

Background

We saw in the previous project that our Python programs are capable of connecting to the "outside world" around them — to other programs running on the same machine, or even to other programs running on different machines in faraway places. This is a powerful thing for a program to be able to do, because it is no longer limited to taking its input from a user or from a file stored locally; its input is now potentially anything that's accessible via the Internet, making it possible to solve a vast array of new problems and process a much broader collection of information. Once you have the ability to connect your programs to others, a whole new world opens up. Suddenly, the idea that you should be able to write a program that combines, say, Google search queries, the Internet Movie Database, and your favorite social network to find people who like movies similar to the ones you like doesn't seem so far-fetched.

But we also saw that getting programs to share information is tricky, for (at least) two reasons. Firstly, there's a software engineering problem: A protocol has to be designed that both programs can use to have their conversation. Secondly, there's a social problem: If the same person (or group of people) isn't writing both programs, it's necessary for them to agree on the protocol ahead of time, then to implement it. This second problem has a potentially catastrophic effect on our ability to make things work — how could you ever convince Google to agree to use your protocol just to communicate with you?

In practice, both of these problems are largely solved by the presence of standards, such as those defined by the World Wide Web Consortium and the Internet Engineering Task Force. Standards help by providing detailed communication protocols whose details have already been hammered out, with the intention of handling the most common set of needs that will arise in programs. This eliminates the need to design one's own protocol (where the standard protocols will suffice, which is more often than you might think) and allows programs to be combined in arbitrary ways; as long as they support the protocol, they've taken a big step toward being able to interoperate with each other. What's more, standard protocols often have standard implementations, so that you won't have to code up the details yourself as you did in the previous project. For example, Python has built-in support for a number of standard Internet protocols, including HTTP (HyperText Transfer Protocol, the protocol that your browser uses to download web pages) among others.

At first blush, HTTP doesn't seem all that important. It appears to be a protocol that will allow you to write programs that download web pages (i.e., that allow you to write programs that play the same role that web browsers do). But it turns out that HTTP is a lot more important than that, since it is the protocol that underlies a much wider variety of traffic on the Internet than you might first imagine. This is not limited only to the conversation that your browser has with a web server in order to download a web page, though that conversation most often uses HTTP (or its more secure variant, HTTPS). HTTP also underlies a growing variety of program-to-program communications using web protocols, where web sites or other software systems communicate directly with what are broadly called web services, fetching data and also making changes to it. This is why you can post tweets to Twitter using either their web site, a client application on your laptop, or a smartphone app; all of these applications use the same protocol to communicate with the Twitter service, differing only in the form of user interface they provide.

Fortunately, since HTTP support is built directly into Python, we can write programs that use these web services without having to handle low-level details of the protocol, though there are some details that you'll need to be familiar with if you want to use the provided implementation effectively. We'll be discussing some of these details in lecture soon, and these will be accompanied by a code example, which will give you some background in the tools you'll need to solve these kinds of problems in Python.

This project gives you the opportunity to explore a small part of the vast sea of possibilities presented by web APIs and web services. You'll likely find that you spend a fair amount of your time in this project understanding the web API you'll need — being able to navigate technical documentation and gradually build an understanding of another system is a vital skill in building real software — and that the amount of code you need might not be as much as you expect when you first read the project write-up. As always, work incrementally rather than trying to work on the entire project all at once; there is partial credit available for a partial solution, as long as the portions that you've finished are stable and correct. When you're done, you'll have taken a valuable step toward being able to build Python programs that interact with web services, which opens up your ability to write programs for yourself that are real and useful.

Additionally, you'll get what might be your first experience with writing classes in Python, which will broaden your ability to write clean, expressive Python programs, a topic we'll continue revisiting and refining throughout the rest of this course. Along with that, you'll learn about why it can be a powerful technique to write multiple, similar classes in a way that leaves them intentionally identical in at least one aspect of how they behave.

The problem

Perhaps particularly for people with chronic respiratory problems, but certainly for everyone, the quality of the air we breathe can have a dramatic impact on our short- and long-term health. As a kid growing up in the southern California area in the 1980s, there were days when I went to school but none of us was permitted to play outside during recess due to what, in those days, were called "smog alerts," which tuned me into the idea, from an early age, that air quality matters. The gray-brown skies of my youth are mostly a thing of the past, but, nonetheless, there are some days when you really want to avoid breathing the outside air as much as possible. The tricky part is knowing which days they are, because you can't often look out the window and see definitively what the quality of the air is; much of what makes the air problematic is invisible, more so than when I was a child.

Nowadays, the Internet provides a valuable resource to help us to monitor and manage the impact of air quality. In your work on this project, you'll write a program that can answer a question similar to the following: Where are some places where the air quality is unhealthy within 30 miles of where I am now?

To do that, though, we'll need some information that we won't have at our fingertips; it's not our ambition to build an air quality sensor and drive around in a 30-mile area looking for an unhealthy reading. But thanks to the ubiquitous Internet of today, we'll be able to obtain and use (free of charge) information that will allow us to answer a question like this without ever leaving the house. What we'll need are two things:

A collection of air quality sensors that can provide us with up-to-the-minute data about air quality all over the United States (and, to a lesser extent, the rest of the world). We don't need the sensors, of course, but we need their output.
A geocoding service that can tell us things like "Where is Bren Hall in Irvine, California?" or "What's the street address at this latitude and longitude?"

Given the ability to obtain answers to those kinds of questions and use them as input to our program, the rest of the problem is reduced to interpreting that input appropriately and performing the right calculations on it.

Because we're building a program in a problem domain that's new to us, though, we'll need to know some things about it. We don't need to become experts in air quality measurement or the intricacies of geographic algorithms and mapping, but we need to know enough about those things to be able to build what we seek to build. When we build programs, we're in the automation business, but we have to know something about what we're automating, even if we don't have to know everything.

How is air quality measured?

In the United States, the usual technique for reporting on air quality is called the Air Quality Index (AQI), so we'll use that same technique. A basic explanation of AQI is available below:

AQI Basics

Reading through that document, you'll see that a standardized scale is used to describe risk levels — 175 is in a range considered to be unhealthy, for example — but it turns out that the same scale is used to describe the risks posed by different pollutants: ozone, carbon monoxide, particulates of various sizes, and so on. But, of course, the risk posed by those pollutants is different — how much ozone is too much doesn't necessarily correspond to how much carbon monoxide is too much — so what we really need to know are two things:

For the pollutant whose risk we're assessing, we first need a measurement of the concentration of that pollutant (i.e., how much of that pollutant is in the air).
Given that concentration, we need a formula for translating it to an AQI value that adequately communicates its risk.

The concentrations of different pollutants are measured differently. The risk posed by them is different, too, so the formula for translating concentrations to AQI is also different for each.

Of course, we aren't building a sensor to measure the concentration of pollutants in the air, so we'll need an online source for that data. But that source won't give us the AQI value, so it'll be up to us to determine it ourselves.

Determining the AQI value

We'll be considering only one pollutant, which is commonly referred to as PM2.5, which is a shorthand for "particulates smaller than 2.5 microns" (i.e., smaller than 2.5 millionths of a meter). Sensors generally report concentrations of PM2.5 in µg/m³ (micrograms per cubic meter). So how do we convert that concentration to an AQI value? We do so by following this procedure:

If the concentration is between...	Then...
0.0 ≤ µg/m³ < 12.1	0.0 µg/m³ is an AQI of 0 12.0 µg/m³ is an AQI of 50 Every other value in this range is proportional to those (e.g., 6.0 is halfway between 0.0 and 12.0, so the AQI would be halfway between 0 and 50, i.e., 25)
12.1 ≤ µg/m³ < 35.5	12.1 µg/m³ is an AQI of 51 35.4 µg/m³ is an AQI of 100 Every other value in this range is proportional to those (e.g., 23.75 is halfway between 12.1 and 35.4, so the AQI would be halfway between 51 and 100, i.e., 75.5, which rounds up to 76)
35.5 ≤ µg/m³ < 55.5	35.5 µg/m³ is an AQI of 101 55.4 µg/m³ is an AQI of 150 Every other value in this range is proportional to those (e.g., 45.45 is halfway between 35.5 and 55.4, so the AQI would be halfway between 101 and 150, i.e., 125.5, which rounds up to 126)
55.5 ≤ µg/m³ < 150.5	55.5 µg/m³ is an AQI of 151 150.4 µg/m³ is an AQI of 200 Every other value in this range is proportional to those (e.g., 102.95 is halfway between 55.5 and 150.4, so the AQI would be halfway between 151 and 200, i.e., 175.5, which rounds up to 176)
150.5 ≤ µg/m³ < 250.5	150.5 µg/m³ is an AQI of 201 250.4 µg/m³ is an AQI of 300 Every other value in this range is proportional to those (e.g., 200.45 is halfway between 150.5 and 250.4, so the AQI would be halfway between 201 and 300, i.e., 250.5, which rounds up to 251)
250.5 ≤ µg/m³ < 350.5	250.5 µg/m³ is an AQI of 301 350.4 µg/m³ is an AQI of 400 Every other value in this range is proportional to those (e.g., 300.45 is halfway between 250.5 and 350.4, so the AQI would be halfway between 301 and 400, i.e., 350.5, which rounds up to 351)
350.5 ≤ µg/m³ < 500.5	350.5 µg/m³ is an AQI of 401 500.4 µg/m³ is an AQI of 500 Every other value in this range is proportional to those (e.g., 425.45 is halfway between 350.5 and 500.4, so the AQI would be halfway between 401 and 500, i.e., 450.5, which rounds up to 451)
at or above 500.5	The AQI reading is "off the charts" (i.e., the highest meaningful reading is 500), so we'll report it as 501.

The thing to notice is that the formula changes slightly as we move up the scale, but each row in the formula works the same way: Each uses a technique generally called linear interpolation, which basically means "Given the value at each endpoint, assume that the rest of the values are represented by a straight line in between." In that case, some relatively straightforward algebra will get us where we need to go.

Note, too, that AQI is always reported as an integer value, and that we always round to the nearest integer (i.e., if the formula above yields 24.7, the AQI would be reported as 25).

Latitudes, longitudes, and geocoding

Before you get too much farther, if you don't about how the latitude and longitude system works — don't feel bad if you don't, but you do need to understand this in order to solve this problem! — take a look at the link below:

Wikipedia: Geographic coordinate system (Latitude and longitude)

In paticular, note the limits on allowable latitudes and longitudes, as well as the difference between North and South latitude and between West and East longitude. And note, too, that latitude and longitude, generally, don't work the same way, so once you've understood one, you'll still need to be sure you've wrapped your mind around the other. There aren't a lot of details, but if you haven't thought about them in a while — or if you've never seen them before — it's worth taking a few minutes to get your understanding sorted out before continuing.

What is geocoding?

The word geocoding sounds like some kind of programming technique, but it's actually something else: It's a process for converting the descriptions of places on the Earth into their locations and back again. In other words, it allows us to answer questions such as these.

What is the latitude and longitude where Bren Hall in Irvine, California is located?
What is located at latitude 33.674381°N and longitude 117.865975°W?

The first of those questions is what we'd call forward geocoding (i.e., taking the description of a location and turning it into geographic coordinates). The second is what we'd instead call reverse geocoding (i.e., taking geographic coordinates and describing what's there).

Of course, answering questions like these requires an enormous amount of data that we don't have, so it won't be up to us to determine these answers; instead, we'll obtain them online as we need them.

Determining the distances between two locations

One of the fundamental operations your program needs is to be able to determine the distance between two locations on Earth. Before you can do that, though, we first need to agree on what is meant by "distance." The Earth is (more or less) spherical and a particular location (i.e., a latitude and longitude) specifies a point somewhere on its surface. When we consider the distance between two such locations, there are two ways to think about it:

A straight line traveling through the interior of the sphere, with the two locations as the endpoints of the line. We might call this the straight-line distance between the locations.
The shortest arc that travels along the surface of the sphere that has the two locations as the endpoints of the arc. The length of such an arc is called the great-circle distance between the two locations.

As is often the case, there's a tension between what's easier to implement and what's actually required. The straight-line distance would presumably be easier to calculate, but if our goal is to calculate distances that people might travel, it's a misleading answer — it assumes that people travel from one location on Earth to another by boring a hole in the Earth! The great-circle distance makes a lot more sense when we consider the distances between locations on Earth, because people would tend to travel either along the Earth's surface (e.g., by walking, bicyling, or riding in a car) or roughly parallel to it (e.g., in an airplane).

So, when calculating the distance between two locations, your goal is to calculate the great-circle distance between them. Of course, we'll need a formula to do it, and, as luck would have it, there's a relatively simple formula that's plenty precise for our needs.

The equirectangular approximation

Given two points on the surface of the Earth expressed in terms of latitudes and longitudes, we can calculate the distance between them by using a formula we might call the equirectangular approximation. Why it's an approximation is that it's based around a slightly imprecise "rounding off" of reality, in which we imagine that if you laid the entire Earth's surface out flat, latitudes would be horizontal lines equally spaced from each other and longitudes would be vertical lines equally spaced from each other. Then we imagine that flat surface "wrapped back around" a sphere. While this isn't quite accurate, the approximation is not far from reality, particularly in the context of the shorter distances that we'll be interested in here, so we'll use this simpler model, since it also leads to a simple and performant formula for calculating distances.

Given that, how do we calculate our distances? Some mild algebra and trigonometry (since we're dealing with spheres and angles) is all we need.

let dlat be the difference in the latitudes of the two points, in radians
let dlon be the difference in the longitudes of the two points, in radians
let alat be the average of the two latitudes, in radians
let R be the radius of the Earth, in miles (3958.8)
let x = dlon * cos(alat)
let d = sqrt(x² + dlat²) * R

After going through those steps, d will be a reasonably close approximation of the distance between the two points, expressed in miles.

Where will we get our data?

While we'll be implementing some calculations of our own, the most meaningful input to our program will need to be obtained online, which raises the question of where we're going to get the information and how we're going to make sense out of it.

Air quality data from PurpleAir's API

PurpleAir is a company that sells Internet-aware air quality monitoring devices. Many of those devices are configured to be connected to the Internet, in which case they send their data back to PurpleAir, with some owners sharing that data publicly; it's that public data that we'll be using in this project.

PurpleAir actually provides two separate APIs containing its sensor data, one that's called the "legacy" API (i.e., it's been around longer) and another that's called the "experimental" API (i.e., it's newer, but its output is shorter and simpler). Of these, we'll depend on the experimental API.

Downloading the experimental API data for all of PurpleAir's public sensors is a simple matter of visiting the following URL:

https://www.purpleair.com/data.json

It's not a bad idea to save a copy of this file in the same directory as your program's code. It will vary over time, but you'll need a stable copy that you can test with, so you don't have to download this huge amount of data every time you run your program as you build it — something that PurpleAir ultimately won't allow (see the section titled Limitations below).

Let's take a look at what some of the data looks like, as of this writing. Looking at the overall format, we can recognize it as the JSON format we saw when we learned about Web APIs. Its basic arrangement appears to be the following:

All in all, what we got back was one large JSON object.
Its first field is called version, which presumably indicates the current version of the API.
Its second field is called fields, whose value is a list of strings.
Its third field is called data, whose value is a list of lists, where each sublist appears to have the same number of elements that the fields list had. (That's not an accident. What we've got is a complete set of information from each sensor.)

So, what will we want to know about each sensor?

The second element in each sensor's list is the one named pm, which indicates its current reading of the concentration of PM2.5 in the air. It's being reported in µg/m³.
The fifth element in each sensor's list is the one named age, which specifies how many seconds it's been since the sensor last reported its value to PurpleAir. We'll ignore any sensor that hasn't reported a value in the last hour.
The 26th element in each sensor's list is the one named Type, which specifies whether the sensor is indoors or outdoors. When the value is 0, the sensor is outdoors; when the value is 1, the sensor is indoors. We're not interested in sensors that are indoors, since our goal is reporting on outdoor air quality.
The 28th element in each sensor's list is the one named Lat, which is a latitude, in degrees, where the sensor has been placed.
The 29th element in each sensor's list is the one named Lon, which is a longitude, in degrees, where the sensor has been placed.

Any sensor that doesn't have these elements, or that has these elements but they have values that aren't what they're expected to be (e.g., they're null instead of a number) should be ignored.

Geocoding via Nominatim's API

Nominatim is a web API that provides geocoding services using an open set of map data called OpenStreetMap. Specifically, we'll be interested in using it for two things:

Forward geocoding, which means that we want to take a description such as Bren Hall, Irvine, CA and find out its latitude and longitude.
Reverse geocoding, which means that we have a latitude and longitude, such as 33.5935341°N and 117.874846°W, and we want a description of what's there.

Nominatim's API has fairly extensive documentation that describes its use, so you'll want to take a look through that to understand the services it provides and how to access them. See if you can construct URLs that find the answers to the two examples above. Don't worry if it takes a little while, but do spend some time working on that problem before you try to reach out to Nominatim's API from your program; you can't use tools that you don't understand how to use.

Nominatim API documentation

Nominatim's API is capable of returning information in a variety of formats, but we'll need to agree on what format we're using — because, as you'll see in the next section of the write-up, we'll need to know what format your program can handle, so we can test it properly — so we'll need to agree to always pass this query parameter in the URLs given to Nominatim's API, even if there are other options available:

format=json

Testing without the APIs

One of the challenges when you work on a project since as this is that your ability to test the program — or even to run it and see its output — is at least partly dependent on the performance of the API. If the API isn't functioning properly, your program won't function properly either, but when you're building a program, it's good to be able to tell the difference between a program that isn't working because it's broken in some way and one that's working fine but dependent on something outside of it that's not working.

For that reason, your program will need a way to obtain its information from a file stored on your hard drive, instead of reaching out to the API. This will allow you to test your program with known-good data, which you'll mostly want to do, except when you're specifically working on the parts of the program where you're reaching out to the APIs. In the next sections of this write-up, you'll see how we'll make that possible.

The program

Your program will read a sequence of lines of input from the Python shell that configure its behavior, then generate and print some output consistent with that configuration. The general goal of the program is this: Given a "center" point, a range (in miles), and an AQI threshold, describe the locations within the given range of the center point having the n worst AQI values that are at least as much as the threshold. (That's a mouthful, so you'll want to read that sentence a few times; there's a lot going on there. Read further, too, and you'll see an example that will help to clarify.)

The input

The first thing your program does is read several lines of input that describe the job you want it to do. Your program should not print any prompts to the user; it should just blindly read this input, expecting that the user understands how to use the program already.

The first line of input will be in one of two formats:
- CENTER NOMINATIM location, where location is any arbitrary, non-empty string describing the "center" point of our analysis. For example, if this line of input said CENTER NOMINATIM Bren Hall, Irvine, CA, the center of our analysis is Bren Hall on the campus of UC Irvine. The word NOMINATIM indicates that we'll use Nominatim's API to determine the precise location (i.e., the latitude and longitude) of our center point.
- CENTER FILE path, where path is the path to a file on your hard drive containing the result of a previous call to Nominatim. The file needs to exist. The expectation is the file will contain data in the same format that Nominatim would have given you, but will allow you to test your work without having to call the API every time — important, because Nominatim imposes limitations on how often you can call into it, and because this could allow you to make large parts of the program work without having hooked up the APIs at all.
The second line of input will be in the following format:
- RANGE miles, where miles is a positive integer number of miles. For example, if this line of input said RANGE 30, then the range of our analysis is 30 miles from the center location.
The third line of input will be in the following format:
- THRESHOLD AQI, where AQI is a positive integer specifying the AQI threshold, which means we're interested in finding places that have AQI values at least as high as that threshold. It is safe to assume that the AQI threshold is non-negative, though it could be zero.
The fourth line of input will be in the following format:
- MAX number, where number is the maximum number of locations we want to find in our search, which you can assume would be a positive integer. For example, if this line of input said MAX 5, then we're looking for up to five locations where the AQI value is at or above the AQI threshold.
The fifth line of input will be in one of two formats:
- AQI PURPLEAIR, which means that we want to obtain our air quality information from PurpleAir's API.
- AQI FILE path, where path is the path to a file on your hard drive containing the result of a previous call to PurpleAir's API with all of the sensor data in it.
The sixth line of input will be in one of two formats:
- REVERSE NOMINATIM, which means that we want to use the Nominatim API to do reverse geocoding, i.e., to determine a description of where problematic air quality sensors are located.
- REVERSE FILES path1 path2 ..., which means that we want to use files stored on our hard drive containing the results of previous calls to Nominatim's reverse geocoding API instead. Paths are separated by spaces — which means they can't contain spaces — and we expect there to be at least as many paths listed as the number we passed to MAX (e.g., if we said MAX 5 previously, then we'd specify at least five files containing reverse geocoding data).

We will not be testing invalid inputs in the Python shell, so you can feel free to handle them in any way you'd like — up to and including a program crash.

The output

After reading all of the input, you'd first display the latitude and longitude of the center location, with latitudes and longitudes shown in the following format.

CENTER 33.64324045/N 117.84185686276017/W

Then, you'd use the information that's either stored in the specified files or downloaded from the specified APIs to find the sensors that are in the specified range of the center location, then determine which of those sensors have the highest AQI values and, for any of them that are at or above the AQI threshold, display information about the first n of them. For example, suppose the input was as follows:

CENTER NOMINATIM Bren Hall, Irvine, CA
RANGE 30
THRESHOLD 150
MAX 5
AQI PURPLEAIR
REVERSE NOMINATIM

This means we're looking for up to five locations within 30 miles of Bren Hall at UC Irvine where the AQI value is at least 150. Given a choice (i.e., if there are more than five locations with AQI values that meet the threshold), we want to show information about the five locations with the highest AQI values. You would display these in descending order of their AQI (i.e., the highest AQI first, then the second-highest, and so on), and it doesn't matter what order you choose for two or more locations whose AQIs are the same. For each location, you'd print three lines of output:

AQI AQI_value, where AQI_value is the AQI value you calculated for this location.
latitude longitude, which is the latitude and longitude for this location, in the same format as you printed the center location's latitude and longitude.
description, which is the full description of the location.

A complete example that uses the APIs

As I was writing this, I ran a test, whose results I'm showing below. Note that the output you're seeing is wholly dependent on data from PurpleAir's sensors at the moment I ran the test, as well as the geocoding service done by Nominatim's API, so if you run the same test, you will almost certainly obtain different results, but this is a good demonstration of the output format that's required here.

CENTER NOMINATIM Bren Hall, Irvine, CA
RANGE 30
THRESHOLD 100
MAX 5
AQI PURPLEAIR
REVERSE NOMINATIM
CENTER 33.64324045/N 117.84185686276017/W
AQI 180
33.53814/N 117.5998/W
Garcilla Drive, Orange County, California, 92690, United States of America
AQI 157
33.690376/N 118.03055/W
Orange County, California, United States of America
AQI 154
33.68315/N 117.66642/W
Alton Parkway, Foothill Ranch, Lake Forest, Orange County, California, 92610, United States of America
AQI 152
33.816/N 118.23275/W
Arco, Tesoro Carson Refinery, Bangle, Carson, Los Angeles County, California, 90810, United States of America
AQI 151
33.86117/N 117.96228/W
1880, West Southgate Avenue, Fullerton, Orange County, California, 92833, United States of America

A complete example that uses locally-stored data

I recommend that you do the majority of your testing with locally-stored data. Testing requires not only running a program, but also knowing what the output is supposed to be; only then can you know whether you've got the correct output. But when you're writing a program that reads data from an API that will likely give you different data every time you call it, it becomes difficult to know what the right answer is.

So, as a first step in this direction, you'll find some example data below. Download these files and store them in the same direction as your program's code.

Once you're finished with your program, you should be able to run the following test and see the results shown below.

CENTER FILE nominatim_center.json
RANGE 30
THRESHOLD 50
MAX 3
AQI FILE purpleair.json
REVERSE FILES nominatim_reverse1.json nominatim_reverse2.json nominatim_reverse3.json
CENTER 33.6432477/N 117.84186526398847/W
AQI 159
33.838673/N 118.29809/W
West Carson, Los Angeles County, California, 90502, United States
AQI 65
33.716675/N 118.309906/W
1498, West Hamilton Avenue, Los Angeles, Los Angeles County, California, 90731, United States
AQI 54
33.753635/N 117.85664/W
1040, Stafford Street, Logan, Santa Ana, Orange County, California, 92701, United States

What to do in the case of API failure

In this project, we face the problem that our program may be written perfectly, yet still might fail in some circumstances. This is because we're dependent on two APIs sending us the data we need, in the format we expect, without which our program can't generate its output. Yet, the APIs are themselves software, and software fails; our communication with the APIs is done via a computer network, and computer networks fail, too. So we'll need to account for these possibilities in our design, and also have a mechanism for testing them.

First, we'll need to decide what it means for the APIs to have failed. To do that, we'll attack the problem from the opposite angle: What does success look like?

The HTTP status code in the response to all of our API requests was 200. Any other status code is considered a failure, regardless of the data that sent in the response.
The content of all of our API requests was formatted as we expected (e.g., it was in JSON format if that's what we expected).
If we used a file on our hard drive in place of a call to an API, the file existed and the contents of the file were formatted as we expected (e.g., it was in JSON format if that's what we expected).

In any other case, we'll say that our program has failed, and we'll print an alternatively-formatted set of output — entirely separate from the normal one — that briefly describes the first failure you encountered.

The first line of output will simply be the word FAILED.
If the first failure you encountered was due to the use of an API...
The second line of output will contain the HTTP status code of the first API request that failed, as well as the URL that you connected to. (Note that the status code might still be 200, if the failure was due to missing or misformatted content.)
- If there was no HTTP status code (e.g., because your computer is not connected to the Internet and can't contact the server at all), then you would print the URL here, but not the status code, since there would be no status code to print.
The third line of output will be exactly one of these three phrases:
- NOT 200 (if an API request returned a status code other than 200)
- FORMAT (if an API request returned data that had missing or misformatted content)
- NETWORK (if an API request couldn't be sent at all because, for example, there was no network connectivity)
If the first failure you encountered was due to a file on your hard drive that you were using in place of a call to an API...
- The second line of output will contain the path to the file that you attempted to use.
- The third line of output will be exactly one of these two phrases:
  - MISSING (if the file does not exist or couldn't be opened)
  - FORMAT (if the file could be opened, but its contents had missing or misformatted content)

For example, if your program makes an API request whose response contains the HTTP status code 429, your output would be something like this (albeit with the actual URL that failed):

FAILED
429 https://whatever.the.url.that/failed/was?including=its&parameters=please
NOT 200

Or if your program tried to use the file D:\Examples\Python\purpleair.json but that file didn't exist, your output would be this instead:

FAILED
D:\Examples\Python\purpleair.json
MISSING

To be clear, you'll print this alternative output (and only this alternative output) if any of the API requests or usages of files fails; otherwise, you'll follow the requirements above and print output describing the center location and any locations where air quality is problematic.

Design requirements and advice

As with the previous project, you'll be required to design your program using multiple Python modules (i.e., multiple .py files), each encapsulating a different major part of the program. We'll leave you some flexibility in determining where to draw the line between what's in one module and what's in another, but the module that you'd execute to run your program must be named precisely project3.py.

Fetching our data with classes

There are three points in your program where you'll need to fetch data from either an API or a file:

When you use forward geocoding to determine the location of the center of your analysis.
When you need to obtain information from air quality sensors.
When you use reverse geocoding to determine the description of where a problematic air quality sensor is.

In each of these three cases, there are two separate ways to solve the problem — one using an API and the other using a file. In each case, you'll be required to implement Python classes, which contain attributes that configure it, if necessary (e.g., the path to a file that should be read), and a method that obtains the data. Classes that obtain the same data must share an interface (i.e., they must have a method with the same name, the same parameters, and the same type of return value), so that you can build objects of these types when you read your program's input, then execute them later without knowing which types of objects they actually are.

(This is one key benefit in using classes in Python; we can treat different kinds of objects with similar capabilities the same way, which avoids us having to use if statements to differentiate. We saw an example of this in lecture, when we talked about duck typing.)

Where should I start?

There are lots of ways to start this project, but your goal, as always, is to find stable ground as often as possible. One problem you know you'll need to solve is generating the final report, so you could begin by generating a portion of it — maybe just some details of the output report that are formatted correctly, even if the data is hard-coded. Now you're on stable ground.

One problem you know you'll need to solve is the problem of calculating an AQI value, given a PM2.5 concentration; you might consider continuing with that. You can test this using the Python shell or assert-based tests before proceeding, and then you're on stable ground. Continue with the equirectangular approximation of distances between points on the Earth, then test that. Now you're on stable ground again.

From there, you might continue by implementing a module that obtains the air quality data from PurpleAir's API, perhaps first by implementing the class that reads that data from a file, then later implementing the class that loads it from the web instead. (You'll want the part that reads from a file pretty early on, because there are limitations on how often you can PurpleAir to send you all of its sensor data, so better not to keep asking repeatedly.)

Once you've got these implemented, you might continue with forward and reverse geocoding using Nominatim — again, first by implementing the classes that read this data from a file, then later implementing the classes that load them from Nominatim's API instead.

Now you'd have a lot of pieces in place, and you can start thinking about how to tie them together. At this point, you may feel like you don't have a program yet, but that's not so out of the ordinary when you work on a large project; it's often quite a while before you have something that runs an entire end-to-end process, because you first need to build and test a lot of smaller-scale tools. In that sense, this project is a pretty realistic view into what it takes to build realistic programs that interact with complex sets of inputs and outputs.

But, again, there are lots of sequences that could lead to a good solution, and you'll want to consider how you can achieve partial solutions that nonetheless meet the requirements partially, because partial credit is available for those. Still, if you find a way to approach this that's different than what I've suggested, but that leads you to a complete program that meets the design requirements, that's fine; we don't care what order you implement it in, ultimately, but we're happy to help you find an ordering if you're not sure what to work on next.

Limitations

Third-party libraries

Remember that, as stated in the Project Guide, third-party libraries — libraries that are not part of Python's standard library — are off-limits in your work unless they are explicitly permitted. This includes, for example, code you might find online that communicates with Nominatim's or PurpleAir's APIs, or third-party libraries such as requests that are commonly used for HTTP-based communication. The intent here is that you be the one to write that code, because that's one of the learning objectives here.

Respecting the terms of service of the APIs we'll use

The APIs we're using in this project are subject to terms of service, which is to say that there are restrictions around how we're permitted to use them. In particular, we'll need to be cognizant of the following restrictions:

PurpleAir doesn't have an explicitly specified rate limit, but all indications are that they limit the use of their API, particularly since we're going to be fetching data from all of their sensors. In light testing, I've frequently received an HTTP status code of 429 and an error message stating that I need to wait a certain number of seconds before trying again. So, overall, test this sparingly (i.e., use AQI FILE instead of AQI PURPLEAIR as your input much more often than not).
Nominatim has a rate limit of one request per second, which means that you'll need to be sure that your program "pauses" for one second between subsequent requests. Since there may be multiple reverse geocoding lookups required in one run of your program, this is something you'll need to include in your program; if you're making multiple requests, you'll need to "pause" for one second between them.
Nominatim requires that we set a header called Referer, which specifies information about where the request came from. Set the Referer header as follows:
- https://www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project3/YOUR_UCINETID
- (Replace YOUR_UCINETID with your UCInetID. If, for example, your UCI email address is boo@uci.edu, you would replace YOUR_UCINETID with boo instead.)

More details on Nominatim's usage policy can be found here.

Deliverables

Gathering your files for submission

We've written automation tools to help us to manage your submissions and report your scores, but these tools require us to know that everyone's submission will be structured the same way. For this reason, we're providing you a tool that can gather your files into a single file whose format we can count on, which you'll then submit to Canvas.

To submit your work, follow these instructions:

Make sure that all of the .py files that make up your program are all in the same directory.
Download the Python script linked below, storing it in the same directory as your program:
- make_project3_submission.py
Run the Python script your downloaded in the previous step. It will gather all of the .py files in the same directory (except for ones that it intentionally skips), verify that they're readable as text, and will then generate a submission file named project3.zip in the same directory.
- If there are any issues — files in the wrong format, for example — they'll be reported to you.
- The files included in the submission will be listed in the script's output; you'll want to read that output to ensure that all of the files you want to be submitted are included.
Submit the submission file project3.zip (and only that file) to Canvas.

Note, too, that if you submit separate files, create your own Zip file arranged in your own way, or otherwise don't follow these instructions, we reserve the right to score your project as low as zero. There are no exceptions to this rule.

Be aware that you're responsible for submitting the version of the project that you want graded. We won't regrade a project simply because you submitted the wrong version accidentally.

Can I submit after the deadline?

Yes, it is possible, subject to the late work policy for this course, which is described in the section titled Late work at this link.

What do I do if Canvas slightly adjusts my filename?

Canvas will sometimes modify your filenames when you submit them (e.g., when you submit the same file twice, it will change the name of your second submission to end in -1.zip instead of just .zip). In general, this is fine; as long as the file you submitted has the correct name, we'll be able to obtain it with that same name, even if Canvas adjusts it.