ICS 32A Fall 2023
Project 1: Digging in the Dirt

Due date and time: Monday, October 23, 11:59pm


Introduction

While there are clear differences between the operating systems that run personal computers, one of the things that they all have in common is the notion of a file system, whose role is to manage the creation, arrangement, updating, and deletion of files. Storage devices such as hard drives and USB sticks are actually a lot more complicated than they look, but a file system provides an abstraction over that complexity, replacing it with a few simple concepts like files, directories, and paths.

Like many programming language libraries, Python's standard library provides a pre-built implementation of a variety of file system operations. It is possible to create files, copy and move files, search in directories, and manipulate paths. Once you know the abstract concepts that a file system is centered around, you can use Python's standard library to access a file system in terms of those same abstract concepts. We've seen already that there is a str type in Python that knows how to manage sequences of characters and a float type that knows how to store and manipulate floating-point numbers; because these are built into Python, you can use them without having to know every tiny detail of how they work internally. Similarly, the Python standard library offers types like Path that can handle the details of manipulating a file system, meaning that you won't have to understand every detail of how they work internally in order to make good use of them.

This project will give you a chance to explore a few parts of the Python standard library that you may not have seen before; provide practice in reading and digesting technical documentation in order to determine what functions in the standard library are appropriate in helping you to solve a problem; introduce you to some features of Python you might not have had the opportunity to use before, such as exception handling; and ask you to use recursion to traverse a recursive data structure (in this case, the file system).


File systems

Whether you run Windows, macOS, or some flavor of Linux, you've no doubt had experience with at least some of the following concepts, though you may not be familiar with all of the terminology, and not all of you will have seen the same subset of these concepts. To ensure that we're all on the same page, here are some things you'll need to know about file systems.

If most of these terms seem unfamiliar, it's worth doing a bit of research and experimentation to be sure you understand what they are before you proceed too far through this project. In addition to being useful in the context of the problem at hand, this is good practical knowledge that you'll need if you want to be successful as a programmer.


The program

The program you'll be writing for this project is one that can find and display the paths of all of the files in a directory (and, potentially, all of its subdirectories, and their subdirectories, and so on), and then take action on some of those files that have interesting characteristics. Both the notion of interesting characters and taking action will be configurable and each can work in a few different ways, but the core act of finding files will always be the same. One of your goals should be to avoid rewriting the same code over and over again (e.g., multiple functions that each perform a search for files with slightly different characteristics) whenever possible; in ICS 32A, we begin to concern ourselves more strictly with design issues, keeping an eye on how best to solve a program, as opposed to writing something that "just works."

The input

Your program will take input from the console in the following format. It should not prompt the user in any way. The intent here is not to write a user-friendly user interface; what you're actually doing is building a program that we can test automatically, so it's vital that your program reads inputs and writes outputs precisely as specified below.

A few additional notes

Since one of the goals of this project is to introduce you to the use of recursion to solve real problems, the search that includes subdirectories and their subdirectories must be implemented as a recursive Python function that processes all of the files in a directory and then processes any subdirectories recursively. (Note that this rules out certain features of the Python standard library — searches like this are actually built into the library, but would circumvent the learning goal here. More on that later.)

Outside of the occurrence of symbolic links, which we're ignoring for the purposes of this project, directory structures are hierarchical (i.e., directories have subdirectories inside of them, and those subdirectories have the same basic structure as their "parents").

An example of the program's execution

The following is an example of the program's execution, as it should work when you're done. Boldfaced, purple text indicates input, while italicized text indicates output. The directories and files shown are hypothetical, but the structure of the input and output is demonstrated as described above.

To reiterate a point from earlier, your program should not prompt the user in any way; it should read input, assuming that the user is aware of the proper format to use.


    ​R C:\Test\Project1\Example​
    C:\Test\Project1\Example\test1.txt
    C:\Test\Project1\Example\test2.txt
    C:\Test\Project1\Example\Sub\meee.txt
    C:\Test\Project1\Example\Sub\test1.txt
    C:\Test\Project1\Example\Sub\youu.txt
    C:\Test\Project1\Example\Zzz\zzz.py
    ​N​
    ERROR
    ​N test1.txt​
    C:\Test\Project1\Example\test1.txt
    C:\Test\Project1\Example\Sub\test1.txt
    ​Q​
    ERROR
    ​F​
    This is a line of text
    Hello, my name is Boo

Portability

It should be possible to run your program on any operating system — Windows, macOS, Linux, etc. — so long as there is a Python 3.11 implementation for it. By and large, this doesn't require much special handling: Python is already reasonably independent of its platform. However, since you're dealing with a filesystem, you do have to be cognizant of how filesystems are different from one operating system to another. Most notably, paths are written differently:

The way to isolate this distinction is to find tools in Python's standard library that isolate them for you. Don't store paths as strings; store them as path objects instead. (More on this below.)

Handling failure

You'll want to handle failure carefully in this program. In general, your program should not crash just because one activity fails; it should instead quietly skip the offending file or directory and continue, if possible. For example:

Organization of your program

There are a few things to note about how your program is organized.

A word about documentation

You are required to include type annotations and docstrings on every one of your functions. This will not only make your design clear to us when we're grading your project, but will also clarify your own design for yourself; details like this will help you to read and understand your own program. I find, as a general rule, that I will always write (or, at the very least, consider and understand) what the types my parameters and return value will be before I write any function; knowing the types is part of knowing what the function is supposed to do.

For example, if you were to write a function find_max that finds the maximum integer in a list of integers, you might start the function this way:


def find_max(numbers: list[int]) -> int:
    '''Finds and returns the maximum number in a list of numbers'''
    ...

Other aspects of how this project (and others) will be graded are described on the front page of the Project Guide. Keep in mind, in particular, that part of your score will be based on how well your program is designed; the front page of the Project Guide describes this in more detail.

An important design goal

As you write your program, you'll want to be on the lookout for complex functions that can be made simpler by breaking them up into multiple smaller functions. Taking complexity and putting it into a function with a well-chosen name and clearly-named parameters is a great way to tame that complexity; this is the first step in building an ability to write significantly larger programs than you've written so far, so be sure that you're getting some practice in taking that step in this project. One of the criteria we'll use in grading your project is to assess the quality of its design; in this project, the largest factor affecting quality is the extent to which functions were broken up when they are long or cumbersome.

A good rule of thumb is to consider what you would say if I asked you "What does this function do?" If the answer is more than a sentence or so — and probably a short one, at that! — it's probably doing too much, and might better be implemented as multiple functions that each do part of the job. (One of the good things about writing docstrings in your functions is that it forces you to test this; if your docstring needs to be especially long, your function is probably too complex.)


Wait... what kind of crazy user interface is this?

Unlike programs you may have written in the past, this program has no graphical user interface, or even an attempt at a "user-friendly" console interface. There are no prompts asking for particular input; we just assume that the program's user knows precisely what to do. It's a fair question to wonder why there isn't any kind of user interface. Not all programs require the user-friendly interfaces that are important in application software like Microsoft Word or iTunes, for the simple reason that humans aren't the primary users of all software. As we'll see going forward in this course, much of what happens in sofware is programs talking directly to other programs; in cases like that, each program generally presumes that the other one is aware of the right way to communicate, and will give up quickly if the other program isn't ready to play by those rules.

Now consider again the requirements of the program you're being asked to write for this project. It waits for requests to be sent in via the console — though they could almost as easily be sent across the Internet, if we preferred, which we'll see in the next project — in a predefined format, then responds using another predefined format. Our program, in essence, can be thought of in the same way as a web server; it's the engine on top of which lots of other interesting software could be built. When an attempt in being made to search for a file, that could be coming from a web form filled out by a user, from another program that needs to perform a search, or from a human user in the Python shell.

While we won't be building these other interesting parts, suffice it to say that there's a place for software with no meaningful user interface; it can serve as the foundation upon which other software can be built. You can think of your program as that foundation.

Additionally, we're using a strategy like this one to assist us in automating the grading of the correctness of your project. Since everyone's input and output have to be formatted in the same way, we will be able to grade your output without manually looking at every line.


The Python Standard Library documentation

At python.org, you'll find a comprehensive set of documentation about the Python language and its standard library. Two good places to start are these pages:

You'll probably need to spend a fair amount of time, especially early in your work on this project, looking at the Standard Library documentation and assessing what functions in that library might be able to assist you in your solution. The Standard Library documentation is mostly broken down by module. It's tough sometimes to know where to look, as there are so many modules, so we'll sometimes try to point you in the right direction. For this project, you'll want to pay special attention to (at least) these modules: pathlib, os, and shutil, though you might also find useful tools in other modules. Feel free to poke around and see what's available; if it's in the Standard Library, you can feel free to use it, except for one limitation pointed out below.

Note, too, that part of understanding what's available in the Standard Library is careful, targeted experimentation. If you're not sure what a library function does just by reading its documentation, you can fire up IDLE and experiment with it; one of the nice things about an interpreted programming language like Python is that this experimentation can be done in the interpreter, without having to make changes to your program until you're more sure that you've found library functions that will help. For the most part, this kind of experimentation is harmless — if things don't work the way you expect, you'll see an error message or behavior that doesn't match what you expect — though you do need to be a little bit careful calling functions that do potentially destructive things like deleting files.

A note about versioning

As you're reading through Python's documentation, be sure that you're reading the right version of it. Near the top of each page is a drop-down list that allows you to select a version. Be sure you're reading the documentation for version 3.11 — documentation for any version beginning with "3.11" (e.g., 3.11.5, 3.11.6rc1, etc.) is fine. (In particular, if search engines lead you to Python documentation, you'll find that they quite often lead you to the documentation for older or newer versions of Python than 3.11. But you can usually just select "3.11" in the drop-down list in order to bring up the corresponding page of the documentation for our version of Python.)

Be aware, as well, that there is a recently released 3.12.0 version of Python, but we aren't going to be using it in this course, so you'll want to be sure you're staying away from library functionality that was added or changed in 3.12.0. The safest approach is to run the same version we'll be using to grade your work: 3.11.5.

Suggestions and limitations on what you can use

For the most part, if you find a function or type that can assist in your solution, you can feel free to use it; note that some care is sometimes required in ensuring that the function you find is actually a good fit. A careful reading of the documentation — and sometimes some reasoned experimentation — are a good way to know whether something might be suitable, but sometimes you won't realize that something isn't a good fit until after you've tried it. And that's fine; no one — no matter how experienced — makes the right design decisions every time.

You should consider looking at the pathlib library, which has tools for storing and manipulating paths in a way that is, more or less, portable between operating systems. Since one of your requirements is to support any operating system — as opposed to just one — you'll want to use the Path type in the pathlib library to represent paths. We'll see more about how to do that in lecture, but it's definitely the case that you'll want to avoid simply using strings for this purpose.

There are a few functions in the Python Standard Library that are off-limits for your use in this project. In general, you are required to write your own recursive function to search the filesystem (i.e., a function that searches in one directory, then recursively searches its subdirectories), which is one of the skills I'd like you to build in this project. That rules out anything that does a complete search in a single function call. For example, os.walk and os.fwalk — both of which perform a full traversal of files in a directory structure — fall into this category, along with the glob and rglob methods of the Path type (and probably others that I've not listed here). If you find yourself solving the problem of finding files without writing a recursive function to traverse them, you've used something you shouldn't.

(Note that, in general, it's safe to assume that you can use anything I don't specifically call out as being off-limits. I do sometimes leave certain things off-limits, mainly so that you achieve the learning objectives of the project; in this case, one of the key objectives is learning about recursion.)


The value of working incrementally

You may be accustomed to solving relatively small, mostly self-contained problems that you can handle all at once. This program, while not giant by real-world standards, consists of more moving parts than you might be used to writing, so you will likely find that attempting to write this program all at once will lead you astray, even if an all-inclusive, everything-at-once approach worked well for you in past coursework.

A better approach for this project — and one that will increase in importance this quarter, as the problems we solve get larger and more complex — is to look for stable ground as often as you can find it. Rather than trying to write the entire program, write some small part of it that you understand well, then find a way to verify that the part you wrote works as you expected (e.g., by writing a function and then calling it in the Python interpreter manually). When it does, you've reached stable ground, and you're ready to choose the next small step you should take, again taking care to choose something that you'll be able to verify after you're done with it. Ideally, you'll find yourself reaching stable ground quite often — sometimes, every few minutes, if things are going well — and this will help you to feel confident that you're making progress.

If you find yourself stuck on one problem and have no idea how to move further, find another positive step you can take. For example, if you're not sure how to find all of the files in a directory structure, write a function that simply returns a hard-coded list of file paths and use that temporarily, then move on and work on something else, eventually working your way back to the places that were causing you trouble. The goal is always to be making some kind of progress, and it's not necessary to write the program in a particular order (e.g., the order in which things happen in the user interface).

Also, when you reach what you believe to be stable ground, it's not a bad idea to make a copy of your Python module before proceeding. That way, if your next step doesn't go as you planned, you maintain the option of "rolling back" to the previous, stable version and trying again. It also gives you something stable to turn in if you find that the deadline arrives and you're not finished; it's always better to turn in a partial program that does something correctly, rather than one that doesn't run. Don't feel like you need to keep every stable version, but it's not a bad idea to at least have the most recent one or two. (One of the practical skills you'll need to start thinking about, if you haven't already, is staying organized. If you have a couple of versions of a file and find that you're often not sure which is which, then you need a better organizational scheme — better file names, better directory names, or whatever helps it to be more obvious to you.)

So, what steps should I take?

There are a variety of ways to build this program from beginning to end, so don't go looking for the "perfect way" to do it. Find a small step you can take, take it, and then find another. Of course, there are missteps you might take along the way, but the best way to learn how to write programs this way is simply to do it; you'll learn as much from your missteps as you will from the ones that work out well.

As an example, though, think about the fact that the program is built around the core notion of "Find all of the unique files in a directory, its subdirectories, their subdirectories, and so on, and return a list of their paths." That sounds like a pretty good step to start with, except that it's actually a bigger step than it sounds like. So you could start even smaller: write a function that finds the files in a directory but ignores subdirectories. Once you can do that, add handling for one level of subdirectories (but assume they have no subdirectories inside of them). Then consider unbounded subdirectory depth and handle that scenario.

Whenever you're working on something and you feel you've bitten off more than you can chew, think about ways to break the problem into smaller ones; eventually, you'll be left with a problem you can think through and complete.

If you're not sure what step to take next, feel free to ask us; we'll help you find a task that will lead you to stable ground.


Sorting and lexicographical ordering

Sorting with key functions

When you want to sort numbers, there is a natural and obvious way that they should be ordered. If you don't say otherwise, the numbers will be sorted into ascending order, which means that smaller numbers would appear in the sorted order before larger ones. So, for example, if you sorted the list [4, 2, 7, 5, 1, 8, 6, 3] in a Python shell, it would become [1, 2, 3, 4, 5, 6, 7, 8].


>>> x = [4, 2, 7, 5, 1, 8, 6, 3]
>>> x.sort()
>>> x
    [1, 2, 3, 4, 5, 6, 7, 8]

We would say, then, that numbers have a natural ordering, which is that they be ordered in an ascending fashion — smaller ones before larger ones.

That's not to say that you can't sort them in some other way. For example, if you wanted to sort the numbers into descending order instead, you can pass a key function to the sort method, whose job is to generate a sort key for each value in the list. The values will then be sorted on the order of their sort keys, rather than the values themselves. For example, suppose you had this function:


def negate(n: 'number') -> 'number':
    return -n

If you passed that function as sort's key function, then you would be sorting the values on the basis of their negations. This would have the effect of making bigger numbers appear smaller and smaller numbers appear bigger, which would cause the elements to be sorted in the reverse of the order they'd appear by default — so they'd end up in descending, rather than ascending, order.


>>> x = [4, 2, 7, 5, 1, 8, 6, 3]
>>> x.sort(key = negate)
>>> x
    [8, 7, 6, 5, 4, 3, 2, 1]

You can use that same technique to sort values that aren't numbers, provided that you have a key function that can generate a sort key for each one. For example, you could sort strings based on their lengths — with shorter ones appearing before longer ones in the sorted order — by using the built-in len function as the key function:


>>> x = ['Boo', 'is', 'happy', 'this', 'afternoon']
>>> x.sort(key = len)
>>> x
    ['is', 'Boo', 'this', 'happy', 'afternoon']

So, generally, we would say that list's sort method works like this:

Lexicographical ordering: The natural ordering of strings

It's rare that someone is surprised when they learn that the natural ordering of numbers is ascending, because that tends to be the way they've learned about sorting numbers their whole lives. But what is the natural ordering of values that aren't numeric? The answer depends on what types of values we're talking about. Because it's germane to this project, let's focus our attention on the natural ordering of strings. First thing's first: Let's try an example and see what happens.


>>> x = ['f', 'b', 'a', 'h', 'c', 'g', 'd', 'e']
>>> x.sort()
>>> x
    ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

Having sorted a bunch of one-character strings, each containing a single lowercase letter, the strings were sorted in alphabetical order. This seems somewhat natural for a lot of people, because a lot of us have been sorting strings alphabetically our whole lives, too. But are strings really sorted alphabetically? Let's try another example, where some of the letters are uppercase and others are lowercase.


>>> x = ['F', 'b', 'A', 'h', 'C', 'g', 'D', 'e']
>>> x.sort()
>>> x
    ['A', 'C', 'D', 'F', 'b', 'e', 'g', 'h']

That's a somewhat more curious result. The strings weren't sorted entirely alphabetically this time. The uppercase letters all appear before any of the lowercase letters, but the uppercase letters are sorted alphabetically with respect to one another and so are the lowercase letters. That seems like a strange rule, though; why should uppercase letters be treated as completely separate from lowercase ones?

What about strings of varying length? What happens with those?


>>> x = ['alex', 'hello', 'boo', 'alexander', 'booboo', 'bio']
>>> x.sort()
>>> x
    ['alex', 'alexander', 'bio', 'boo', 'booboo', 'hello']

This is more like the standard ordering you might see in a printed dictionary or in the index at the back of a book, where words are alphabetized. When comparing two words, the following algorithm is applied:

But things aren't quite so clear when we mix in uppercase letters again.


>>> x = ['alex', 'hello', 'Boo', 'Alexander', 'booboo', 'bio']
>>> x.sort()
>>> x
    ['Alexander', 'Boo', 'alex', 'bio', 'booboo', 'hello']

It turns out that there's an algorithm that explains all of this, which is called lexicographical ordering. The lexicographical ordering of strings works almost like the alphabetical ordering you may have seen in printed books, but there is a key difference. It centers around the idea that we can take two characters and decide which one comes first. We'll say each character has a numeric ord value, so a character comes before another if its ord value is smaller.

Now the only remaining question is what the ord value of a character is. Python provides a built-in function called ord that can tell you this for any single-character string.


>>> ord('A')
    65
>>> ord('B')
    66
>>> ord('Z')
    90
>>> ord('a')
    97
>>> ord('b')
    98
>>> ord('z')
    122

And so we see the reason why uppercase letters sort earlier than lowercase ones; their ord values are smaller. The ord values are not arbitrary, as it turns out. Python uses a character set called Unicode, which specifies a numeric code for every possible character, including not just English letters, but also a variety of letters in other printed languages, along with symbols, emojis, and so on. Each of these characters has a code associated with it, and it's this code that is returned by ord. It turns out that Unicode specifies that uppercase English letters all have codes smaller than any lowercase English letters, so that explains some of the strangeness we were seeing before; it's not so strange after all. And we can find the ord value for any character this way:


>>> ord('>')
    62
>>> ord('}')
    125
>>> ord('⇒')
    8658

So, all you need to do if you want to understand the natural ordering of a sequence of strings is to find out the ord values for each of the characters of that string. In general, smaller ord values come before larger ones. It's really that simple.


Testing

Testing this program is going to seem cumbersome, because it will require creating directory structures and files in various configurations, then running the program to see how it behaves. You might find that it's worth your time to automate some scenarios, by writing short Python functions (perhaps in a separate module) that create directories and files in interesting combinations. You won't likely find that it will require writing a lot of code, but it will pay you back (and then some!) as you test your program.

It's quite common in real-world software development to write code whose sole use is as a development tool; it's not part of the program, per se, and will not be given to users of the program, but is strictly meant to make it easier to build the program. You'll be well-served to explore ways to use Python to lighten your testing burden; if there are five interesting scenarios you've identified, you should write five Python functions that can set them up for you automatically, so you can get them back any time you need to test them. If it takes a half-hour to write the test programs, but it takes five minutes to set up the tests manually, as soon as you've set up the tests six times, the time spent writing the test program will begin paying you back. Computers automate tasks that are otherwise cumbersome, and testing certainly falls into that category. (We'll see this theme repeated throughout the course.)

Sanity-checking your output

We are also providing a tool that you can use to sanity-check whether you've followed some of the basic requirements above. It will only give you a "passing" result in these circumstances:

Note that many additional test inputs will be used when we grade your project, as there are a number of combinations — searching recursively or not, multiple search characteristics, multiple acitons — that are possible. The way to understand the sanity checker's output is to think of it this way: Just because the sanity checker says your program passes doesn't mean it's close to perfect, but if you cannot get the sanity checker to report that your program passes, it surely will not pass all of our automated tests.

Running the sanity checker is simple. First, download the Python module linked below:

Put that file into the same directory as your project1.py file. Running the project1_sanitycheck.py module — for example, by loading it in IDLE and pressing F5 (or selecting Run Module from the Run menu) — will run the sanity checker and report a result, which will be printed to the Python shell.

Note, too, that this program's output depends not only on the program's input, but also on the presence of files and directories. For this reason, every time you run the sanity checker, it will create a test directory — a subdirectory in the same directory as project1_sanitycheck.py — and place some files into it that have the appropriate characteristics. So that you can understand its output better, the sanity checker will leave that subdirectory behind, though you can, of course, feel free to delete it when you're done with it.


Deliverables

Submit your project1.py file (and no others) to Canvas. There are a few rules to be aware of.

Can I submit after the deadline?

Yes, it is possible, subject to the late work policy for this course, which is described in the section titled Late work at this link. Beyond the late work deadline described there, we will no longer accept submissions.

What do I do if Canvas adjusts my filename?

Canvas will sometimes modify your filenames when you submit them (e.g., by adding a numbering scheme like -1 or a long sequence of hexadecimal digits to its name). In general, this is fine; as long as the file you submitted has the correct name prior to submission, we'll be able to obtain it with that same name, even if Canvas adjusts it.