ICS 32A Fall 2023
Notes and Examples: Files, Paths, and File Systems


Files and file systems

So far, the only kind of input our Python programs can accept comes from the keyboard, in response to a call to the built-in input() function. Of course, that's a nice start, but it feels awfully limiting. If we think about the programs we actually use, we quickly realize that they take input from sources other than just what we type. Another source of input that's quite common is to read information from a file system.

What is a file system?

If you've ever used a personal computer — a desktop machine or a laptop, for example — there's a good chance that you've interacted with a file system, even if you've never heard the term before. A file system is software that manages how information is stored on a storage device such as a hard drive or a USB stick. There are a number of different kinds of file systems in use — sometimes more than one kind on the same operating system! — but they mostly share the same basic characteristics, while differing mainly in the fine-grained details. So, if you know about those shared characteristics, you'll quickly find yourself at home using just about any file system on just about any operating system.

The basic abstraction in a file system is that of a file. A file is a container in which a sequence of bytes is stored. Each byte is effectively a sequence of eight "digits" that are either 1 or 0; each of these digits is called a bit. The bytes in each file are interpreted differently depending on what kind of file it is, and it should be noted that this is largely a matter of what the program reading the file expects that it should contain (e.g., text, an image, a song); the file system itself is mostly unconcerned with what's in each file, except for the metadata associated with the file, which keeps track of things like who owns the file, who has access to the file, and when the file was last modified. The file system manages the containers in which the bytes are stored, but cares little about the bytes inside of each file, other than to make sure that a file's contents don't change unless some program asks for them to be changed.

The most basic things we'll want to do with files in Python are reading data from them (so we can use it as input to our program) and writing data to them (so we can "save" it, or use it in another program). We'll start our story there.


Files

Interacting with files in Python requires first that we establish a sort of a connection to them. The values we store in variables are part of our program; they're part of what Python knows or calculates while the program runs. But the contents of files lie outside of our program, so we'll need some way to cross that boundary.

In Python, we do that by opening a file, which can most easily be done using the built-in function open(). If we call open() and pass it a single argument that is the path to a file — an indication of where the file is — then we'll get back a file object that we can use to interact with it.


>>> the_file = open('D:\\Examples\\data\\myfile.txt')

When we open a file, we also have to establish what we want to do with it, which requires us to specify a couple of things.

If you pass a second argument to the built-in open() function, you can specify both of these choices; the default, if you don't pass a second argument, is that you intend to read text. So, we would expect the_file to be a file object from which we can read text. Let's see what we got.


>>> type(the_file)
    <class '_io.TextIOWrapper'>

It's not especially important that we know exactly what a _io.TextIOWrapper is, but the name at least gives us the sense that it provides input or output capabilites — that what is often abbreviated in computing as I/O — and that it deals with text. In this course, I'll just refer to these as "file objects."

I tend to prefer to be explict about whether I intend to read from a file or write to it, so I'll generally pass the second argument. (It's a little more to type, but it makes clear that I intend to read from the file, as opposed to my simply having forgotten to say anything.) The way we say that we want to read text from the file is to pass the string literal 'r' in the second argument.


>>> the_file = open('D:\\Examples\\data\\myfile.txt', 'r')

Once you have a file object, you can read from it or write to it, which we'll return to shortly. But one important detail that we should consider first is that the notion of "opening" might make you wonder if there exists an inverse notion of "closing," as well. Indeed there is, and, in fact, it's a vital one. When you're done using a file, you're always going to want to close it, which you do by calling the close() method on the file object.


>>> the_file.close()

Once closed, you'll no longer be able to use it, but you'll have ensured that any operating system resources involved in keeping it open will no longer be in use, and that other programs that may want to open the file will be able to do so. We'll always close files we've opened after we're done with them.

Reading text from a file

If you open a file because you want to read text from it, then there are methods you can to read that text. There are a number of methods available, but we'll only need a small handful of them, so let's focus on the ones we need; we'll see others later if we find a use for them.

The readline() method reads a line of text from the file. The file object has a sort of "cursor" inside of it, which keeps track of our current position in the file; each time we read a line, that cursor is moved to the beginning of the next line, so that each subsequent call to readline() gives us back the next line of text that we haven't yet seen.


>>> the_file = open('D:\\Examples\\data\\myfile.txt', 'r')
>>> the_file.readline()
    "'Boo'\n"
>>> the_file.readline()
    'is\n'
>>> the_file.readline()
    'happy\n'
>>> the_file.readline()
    'today'
>>> the_file.readline()
    ''
>>> the_file.readline()
    ''
>>> the_file.close()

The contents of the file we're reading in the example above look like this, with newlines on the end of every line except the last one.


'Boo'
is
happy
today

There are a couple of wrinkles in the example above worth noting.

Given those two facts, we can write a loop that prints the contents of a file.


the_file = open('D:\\Examples\\data\\myfile.txt', 'r')

while True:
    line = the_file.readline()

    if line == '':
        break
    elif line.endswith('\n'):
        line = line[:-1]

    print(line)

the_file.close()

The one trick that might seem strange there is this part: line = line[:-1]. Recall that this is slice notation, which can legally be done on strings (and returns a substring of that string). When line is a string, the expression line[:-1] gives you back a string containing everything but the last character of line. We're using this technique to eliminate the newline if it's there.

Iterating through every line of a file is so common, there are a couple of techniques that automate it. One of them is the method readlines(), which returns all of the lines of the file, instead of one at a time; what you get back is a list of strings.


>>> the_file = open('D:\\Examples\\data\\myfile.txt', 'r')
>>> the_file.readlines()
    ["'Boo'\n", 'is\n', 'happy\n', 'today']
>>> the_file.close()

The good news is that readlines() provides a single method you can call to read an entire text file in a form that it can be handy to use — as a list of its lines. However, the bad news is that it read the entire file into that list. If you have no need to store the entire file at once, you might instead want to process one line at a time. It turns out that file objects that read text can be iterated using a for loop, in which case there is one iteration of the loop for each line of text. Like readline() and readlines(), you'll get the newline on the end of each line. Using that technique, we could rewrite our loop that prints the contents of a file more simply this way.


the_file = open('D:\\Examples\\data\\myfile.txt', 'r')

for line in the_file:
    if line.endswith('\n'):
        line = line[:-1]

    print(line)

the_file.close()

Writing text to a file

Opening a file to write text into it is similar to how we opened it for reading; the only difference is the second argument we pass to open().


>>> output_file = open('D:\\Examples\\data\\stuff.txt', 'w')

The file object you'll get back from open() turns out to have the same type.


>>> type(output_file)
    <class '_io.TextIOWrapper'>

However, it is configured differently, expecting to write text into the file instead of read from it. In fact, we can ask a file object what its mode is, which means whether it is intended to read or write, by accessing its mode attribute. (Note that mode is not a method, so we don't put parentheses after its name. It's not something we can call; it's more akin to a variable that lives inside the object.)


>>> output_file.mode
    'w'

Once you've got a file object whose mode is writing, you can call the write() method to write text into it. You can pass only one argument to write(), a string, and whatever text is in that string will then be written to the file. Newlines aren't added by default, so if you want them, you'll need to include them in that string.


>>> output_file.write('Hello there\n')
>>> output_file.write('Boo is ')
>>> output_file.write('perfect today')
>>> output_file.close()

After writing this text and closing the file, the file's contents will be:


Hello there
Boo is perfect today

Note that when you've written text to a file, closing the file when you're done becomes more than just good hygiene; it's essential to writing a program that works. It turns out that there's more going on than meets the eye when you write to a file. Writing data into a file on, say, a hard disk involves a fair amount of overhead, so that writing a lot of it isn't much slower than writing only a tiny amount. (It takes longer for a storage device to decide where to write it than the actual writing, as it turns out.) For this reason, file objects use a technique called buffering, which means that they don't write the text immediately. Instead, they store it internally in what's called a buffer. Once in a while, when there's enough text stored in the file object to make it worth writing to the file, the text is written and the buffer is emptied. If you're writing a lot of text to a file, but writing it a little bit at a time, this can make the entire process significantly faster, because the overhead of all of the tiny writes is eliminated.

The problem is that the buffer is only written when it's explictly flushed. (Flushing is the act of taking the text in the buffer and writing it into the file, then emptying the buffer.) One thing that causes a buffer to flush is when the buffer's capacity is exceeded; that happens automatically at some point. But when you're done writing to the file, there will probably be text still in the buffer. One of the things that happens when you close a file is the buffer is flushed. So, you'll really want to be sure that you close files when you're done with them, particularly when you're writing to them; otherwise, the text that was buffered but never flushed to the file will never appear in the file, even though your program ran to completion with no apparent errors.


Paths

You've no doubt seen before that each file on a file system has a name, which we quite often call its filename; the filename is one piece of metadata associated with each file. But there's more to identifying a file than its name. Because there are so many files stored on a typical file system — as I was writing this originally, I asked Windows to count how many files are stored on my laptop and found that the answer was about 700,000! — there needs to be some way to keep them all organized, so we can find files not only by their names, but by some other sort of categorization. While operating systems are gradually adding progressively better search capabilities, there is still an underlying reality that hasn't changed much in the last few decades: File systems are quite often a hierarchy of directories, with each directory containing both files and other directories. So, if we want to uniquely identify a file on our storage devices, we have to specify not only the file's name, but also where the file is stored in that hierarchy; without knowing more about the location of the file, the file system won't easily be able to find it. The location of a file is identified uniquely using a path.

Different operating systems use different conventions for paths, the most common two of which are these:

The complexities of these rules aside, the important thing to realize is that there are slightly different rules on different operating systems, though the ideas are similar on all of them: that directories form a recursive hierarchy (directories containing other directories, which contain other directories) is fairly standard — even mobile operating systems like Android and iOS have this notion, albeit more or less invisible to users — and the differences are mainly minor details. Still, if we want to write Python programs that work with file systems correctly regardless of operating system, as you're doing in Project 1, you're best off using the right kinds of tools for the job, so that you won't find yourself making assumptions (such as the character that separates directories in a path) that are correct on one operating system and wrong on another.


Finding what you need in the Python Standard Library

Python is not just a programming language. When you install Python, you also get the Python Standard Library, which is a large collection of pre-built components that solve a wide variety of commonly-occurring real-world programming problems, so that you won't have to. We've talked before about the benefits of using a library that already exists, especially one that's been in use for a long time by a large number of people. But, as a practical matter, there is a still a problem to be dealt with. When you have a problem to solve, how do you know whether the Python Standard Library solves it? And how do you find the right component to use?

Of course, one way to solve that problem is simply to use an Internet search engine and poke around online to see what information you can find. This is a fine approach sometimes, but what you'll find is that you have to develop a sense of what information is believable and reasonable, and what information isn't. You also have to be wary of when advice is correct but inapplicable, such as someone offering details of Python 2 — an older version of Python that is nonetheless still used quite widely — when you're using Python 3.

But I would suggest not giving in to the Google urge immediately. Especially when you're first learning, there's value in spending a little time hunting for your own solutions to problems. You tend to find solutions not only to the problem you have now, but to five related problems you don't realize you have yet. You begin to develop a sense for what kinds of things you would find in the Standard Library, and notice commonalities in the way those components are designed, which can help you write better code yourself. Learning the "lay of the land" in computing takes time, and there's no short-circuiting that process. When you talk to people that seem to have these things all figured out already, you don't realize just how much time those people have put into learning their craft. Don't worry; you'll get to that point, too, but you'll have to put in that time.

So, I suggest starting by taking a look through the Python Standard Library. Go to the front page of the library documentation, which you'll find here:

Look through the table of contents. Don't feel like you have to memorize everything you see, and don't worry if many of the terms are things you don't recognize. If you see things that you're curious about and you're not in a huge hurry, satisfy your curiosity and take a quick look at them; in so doing, you'll find yourself learning all kinds of terms that you haven't heard before. But, fundamentally, what you're looking for are things that can help you with the particular problem you have. So, for example, if you're looking for components that might help you deal with files, filenames, paths, and the like, then see which modules in the Python Standard Library sound like they might apply. (Go ahead. Check some of them out now! I'll wait...)

There are several modules in the Python Standard Library that should stand out as you look through the list. One whole section of the library contains tools used for File and Directory Access, which sounds like it should definitely have something to do with the problem at hand. You might also notice a section of the table of contents titled Generic Operating System Services, too, which might contain useful tools, since file systems are part of the operating system.

Importing a module

Most Python programs aren't entirely self-contained, in the sense that not all of the code that comprises the program is written in a single script. Instead, it is most often the case that the Python scripts you write will need to use code that's written in other modules. A Python module is similar to what we've been calling a Python script; you write modules as text in files whose names end in .py, just like scripts, and modules consist of Python code. The difference is that modules, on their own, don't do anything. Their job is to provide tools to programs, not themselves to be programs.

When you want to make use of code in another module, you'll need to first import it. Importing a module makes the tools provided by that module — functions, types, etc. — available in the module where it was imported. You import a module using a statement called import.


>>> import math
>>> math.sqrt(9)
    3.0
>>> math.gcd(15, 21)
    3

Note that import makes every definition in the module available, but its name must be qualified by the name of that module. In the example above, we imported the math module, which is part of the Python standard library. This made a variety of functions — such as sqrt() and gcd() — available to us. However, to call those functions, we had to precede their names with the name of the module (math) and a dot. I should point out that this, in and of itself, is not a bad thing; the acronym gcd might mean different things in different contexts, but math.gcd more clearly looks like what it is (greatest common divisor). More often than not, I'll import modules this way, and I'll use their names whenever I use the definitions inside of them; I'm less concerned about typing less and more concerned about being able to read a program later and quickly understand it.

That said, there is a way to import a definition from a module and make it directly available without qualification, using a variant of the import statement that you might call from ... import. For example, if we wanted to import the sqrt() function from the math module this way, we might do this.


>>> from math import sqrt
>>> sqrt(9)
    3.0

Again, I somewhat more rarely will do this, unless I think the name of what I'm importing makes its meaning pretty self-evident.

One thing I never do, but that is supported in Python, is to import every definition in a module and allow it to be used without qualification.


>>> from math import *
>>> sqrt(9)
    3.0
>>> gcd(15, 21)
    3

The main reason that I never do this is because the meaning of a program will potentially change every time that other module does. As new versions of Python are released, new functions may be added to the math module. (For example, the gcd function didn't exist until Python 3.5, which was released in the summer of 2015.) If I have a long-lived program that says from math import *, the meaning of a name may suddenly change, because I may upgrade Python and suddenly be importing new functions that I didn't used to be importing. I'm a big believer in using techniques that scale and that stand the test of time. So, I avoid anything that makes it more difficult to write large programs, and I also avoid anything that will allow someone else's change — someone adding a function to a module I didn't write, say — to change the meaning of my program.

Why it's less dangerous to say from math import sqrt is that I've only imported the one name. Designers of libraries are reluctant to change the meaning of an existing function, so it's pretty safe to assume that math.sqrt will always mean the same thing in Python. If new functions are added to the math library, that won't change the meaning of from math import sqrt, so that's safer ground.


Manipulating paths using Python's pathlib

Having taken a look through the Python Standard Library, you should have noticed a library called pathlib, whose documentation is linked below:

Take a quick look through that documentation, again focusing on getting a broad idea of what's there and what it can do. Don't worry if you don't understand everything you're seeing, and don't worry if you feel like a fish out of water because you've never read documentation like this before. See what things resonate and just get a broad mental picture of what's available. Then you can use the Python shell to experiment with it to find out more about how it works.

Creating a Path object

Path objects represent paths on a file system. They aren't files and they aren't strings; they're paths, which means they are explicitly intended to represent the way that a file system keeps track of where a file is. Since there are substantial commonalities between different kinds of file systems, there can be one kind of object that represents those commonalities. (Interestingly, there are also ways to represent the differences, as we'll see. Path objects handle all of those details, so you won't have to.)

In a Python shell, you first need to import the pathlib module and, specifically, it helps to import the Path type from it, which you can do like this:


>>> from pathlib import Path

(I'm using the from ... import syntax here, because I think the meaning of the word Path is pretty self-evident, while adding pathlib. to the front of it won't really make it any clearer.)

Now, anytime you use the word Path, you're specifically asking for the Path type in the pathlib library. Having imported it, you can now create objects of the Path type.


>>> p = Path('D:\\Examples\\data')
>>> p
    WindowsPath('D:/Examples/data')

Did you notice what happened when we showed the value of p in the shell? Its type appears to have changed! What we created was a Path object, but its type is something else called a WindowsPath!


>>> type(p)
    <class 'pathlib.WindowsPath'>

So, what happened? The answer is that creating a Path object automatically gives you the right kind of Path object depending on what operating system you're running. I ran this example on Windows, which is why I got a WindowsPath; if, instead, you did the same thing on macOS or another operating system that uses POSIX-style paths, you'd get a PosixPath object instead.

Another minor detail to note is that our backslashes got turned into forward slashes. That's mainly because the various Path types endeavor to hide as many of the differences between file systems as possible; internally, when we use Path objects to access the actual file system, they do the right thing in the right circumstance automatically.

What can you do with Path objects?

There are all kinds of useful things you can do with Path objects. This isn't an exhaustive list, but the examples below should give you an idea of what's available and how to use them.

When you manipulate paths, you often find yourself combining them together. That's easily done using the Path type; the / operator, when used between two Paths, combines the two paths together into a single one.


>>> p = Path('D:\\Examples\\data')
>>> q = p / Path('myfile.txt')
>>> q
    WindowsPath('D:/Examples/data/myfile.txt')

You can do the same thing to combine a Path with a string, too, which can be a handy shorthand.


>>> r = p / 'test.txt'
>>> r
    WindowsPath('D:/Examples/data/test.txt')

If you want to know if a Path object represents something that actually exists on your hard drive, you could just ask it:


>>> p.exists()
    True

If you want to know if it's a directory or a file, you can ask it those things, as well.


>>> p.is_file()
    False
>>> p.is_dir()
    True

If it's a file, you might like to open it; that's supported, too. Opening a Path is a lot like the built-in open function you've probably seen previously, except that it doesn't need a parameter specifying the location of the file, since the Path acts as that location already.


>>> f = q.open('r')
>>> f.readlines()
    ['Alex\n', 'is\n', 'happy\n', 'today']
>>> f.close()

If it's a directory, you might like to know what's in it. A method called iterdir can tell you the answer to that, though there's one wrinkle: It returns something called a generator (which is a story for ICS 33), but you can easily turn a generator into a list by simply calling the built-in function list.


>>> list(p.iterdir())
    [WindowsPath('D:/Examples/data/myfile.txt'), WindowsPath('D:/Examples/data/test.txt')]

Or you could iterate through the result with a for loop, rather than making a list out of it.


>>> for x in p.iterdir():
...     print(x)
...
    D:\Examples\data\myfile.txt
    D:\Examples\data\test.txt

There are lots of other things you can do with Path objects — get a filename, get a filename's extension, get a path's "parent", and so on — and I'd encourage you to take a look through the documentation to see what's available.

Why use pathlib and not os.path?

The pathlib library is a relatively recent addition to the Python Standard Library, having been added in version 3.4. Prior to that, there was a library — which still exists, but is a lot less useful — called os.path, which provides functions that manipulate strings instead of Path objects, but still hide some of the details between file systems, such as combining them with the right kinds of slashes depending on what operating system you're using.

But using strings and manipulating them with os.path nonetheless leaves you with a lot of room to make mistakes. Strings are simply text; you can store any text you want in a string, whether it's a valid path or not. So, you're better off using a tool that was built for the job at hand: When you want to manipulate paths, use the pathlib library.