A Quick note on Simple and Efficient File Reading Reading information from text files is a common and important operation in Python. In this lecture we will first discuss various options for reading files, and characterize them based on simplicity and efficiency (mostly with regards to efficiency in the use of space/memory, which is important when reading very large files). In a nutshell, you should avoid using the .read() and .readlines() methods, and instead directly iterate over open file objects (using a standard for loop or a for loop in a comprehension). Doing so requires less code and consumes less memory; this simple method is almost always the right way to read files. The second section in these notes defines the parse_line generator function (available in the goody module) and discusses how to use it: it supplies a general and easy way to read files that contain a fixed number of records (fields of values, possibly of different types). One parameter binds to a tuple/list of function objects, each specifying how to convert the field in its position (a substring of each line) into a value of the appropriate type. We will discuss how this method is implemented when we discuss generators (notice that parse_line use a yield statement, not a return statements) in Week #4. The third section discusses binary (non-text) files. There are many uses of binary files: here we base our presentation on the standard pickle module, which easily allows programmers to store data structures (typically large/complex ones, that a program spends a long time building) into files; then these file can be easily (and quickly) read in a subsequent program, restoring the contents of the complicated data structure. ------------------------------------------------------------------------------ Simple and Space Efficient Code: To start, let's suppose we are reading a file where each line in the file is just a string of text. The simplest and most efficient way to read such a file is iterating over an "open" (file) object. for line in open(file_name): # where file_name is a string naming a file process(line) # where process is a function or some code block Typically, we need to strip off the newline character at the end of each line as it is read from the file, before it is processed further. We can strip off this information by calling the .rstrip('\n') method (right strip). The code would become for line in open(file_name): process(line.rstrip('\n')) or for line in open(file_name): line = line.rstrip('\n') # re-bind line (also rebind on next loop iteration) process(line) Technically the os module binds the name linesep to the character(s) forming a newline on that operating system. For PCs os.linesep is '\r\n'; on Macs it is '\n'. So, we could call .rstrip(os.linesep) on the line. But when we read a line from a text file in Python, it converts any newline character(s) to just the single character '\n', regardless of the operating system. This makes it easier to write Python code that works on all operating systems. Note that if we call .rstrip() with no arguments, all white-space characters (including the newline character(s), spaces, and tabs) are stripped from the right end of the string. We can use this simpler alternative if we don't meaningfully process whitespace at the ends of lines, but if we need to preserve all text but the newline character at the end, we must call .rstrip('\n'). Recall that there are NO MUTATOR methods on strings: so, when we call the line.rstrip(...) method, it DOES NOT MUTATE the string object associated with line, but instead it produces a reference to a NEW STRING object that has the same contents, except with the requested character(s) stripped off its right end; we pass this new string object as an argument to the process function, or we rebind line to that new string. CORRECT INCORRECT for line in open(file_name): for line in open(file_name): line = line.rstrip('\n') line.rstrip('\n') # LINE HAS NO EFFECT! process(line) process(line) The for loops in the code fragments above are all space efficient, because at any time Python stores only one line of the file in memory (although the file itself is likely cached/stored in a memory buffer). Note that if we wanted to create a list of lines read from the file (where the newline character(s) are removed from each string), we can write the following simple comprehension. line_list = [line.rstrip('\n') for line in open(file_name)] Note that in this example, line_list occupies about the same amount of space as the file: all lines are stored in memory (one per list entry) at the same time. If we can process lines INDEPENDENTLY of each other (process each one without needing to know the contents of any previous or subsequent lines) we should NOT write code to store all the lines in a list, because it is not space efficient. We often process files by iterating through their lines, and an open file is iterable in the same way a list is. Finally, in ICS-32 you learned that we can use open as a context manager in a with statement, which handles file exceptions and automatically closes the file when the context manager finishes (which is often useful/important, but not always). With the open context manager, we would write the above code fragments as with open(file_name) as open_file: for line in open_file: process(line.rstrip('\n')) or with open(file_name) as open_file: line_list = [line.rstrip('\n') for line in open_file] Using context managers does not change the space efficiency of the file reading. We will study context managers so that you can write your own during Week #3. ---------------------------------------- The .readlines() and .read() methods: Less Simple and Less Efficient We can apply the .readlines() method to an open file: it produces a list of all the lines in the file, where each line still ends in the newline character '\n'. So, if open_file refers to the following open file, Line 1 Line 2 Line 3 calling open_file.readlines() returns the list ['Line 1\n', 'Line 2\n', Line 3\n'] If we wanted to call .readlines() and process every string in the file (without the '\n' characters at the end) we would write for line in open(file_name).readlines(): process(line.rstrip('\n')) or for line in open(file_name).readlines(): line = line.rstrip('\n') # re-bind line (also rebind on next loop itration) process(line) Notice that this code is LONGER than the loops written in the previous section, and it is LESS SPACE EFFICIENT, because it first computes a list of all the lines in the file (storing it in memory along with the file) and then it iterates over the strings in that list; the loop in the previous section stores in memory only one line at a time from the file, while it process that line, not an entire list of lines. If we wanted to create a list of lines without the '\n' characters at the end calling readlines, we could write line_list = [line.rstrip('\n') for line in open(file_name).readlines()] which is similar to, but also MORE COMPLICATED THAN, the code in the previous section that does the same thing. This code fragment takes TWICE THE AMOUNT OF SPACE to create line_list because calling .readlines() creates a list of all the lines in the file, and then the comprehension creates another/second list of the lines without '\n'. It is INEFFICIENT. Finally, if we wanted to compute a list of lines WITH all the '\n' characters at the end, then calling line_list = open(file_name).readlines() is simpler than the comprehension code below, which does this same task. line_list = [line for line in open(file_name)] It also occupies the equivalent amount of storage. So this is one of the few examples (and not really a common one) where calling the .readlines() method is useful. ---------- We can also apply the .read() method to an open file: it produces one giant string that contains all the lines in the file, each ended by the newline character '\n'. So, if open_file refers to the following open file, Line 1 Line 2 Line 3 calling open_file.read() returns the string 'Line 1\nLine 2\nLine 3\n' We can split this string into a list of strings by calling the .split method. line_list = open(file_name).read().split('\n') and this code is a bit simpler than what we have seen before, which is equivalent to line_list = [line.rstrip('\n') for line in open(file_name)] But, the .read().split('\n') code above is LESS SPACE EFFICIENT than the simpler comprehension, because it stores BOTH the ENTIRE FILE AS A STRING and a LIST OF ALL THE LINES IN THE FILE at the same time; the comprehension stores the list of all the lines in the file, but not a string whose contents is the entire file itself. Likewise, if we wanted to process every string in the file (without the '\n' characters at the ends we can write for line in open(file_name).read().split('\n'): process(line) This for loop code is MORE COMPLICATED than the for loop in the first section, and it USES SPACE MUCH LESS EFFICIENTLY: this code stores the entire file (and a list of lines in the file) in memory at one time; the loop in the first section stores in memory only one line at a time of the file. Note that because of the call to split after read, there is no need for a call to rstrip inside the loop: this code breaks the big string into a list of lines by splitting on (and removing) the '\n' characters at the end of each line. Bottom Line: There is little to be gained when reading files by calling the .readlines() or the .read() method. The simplest and most space efficient way to read a file is to iterate directly over the "open" file with a standard for loop, or a for loop inside a comprehension. You should use this simplest/most efficient form in your code to receive full credit. ------------------------------------------------------------------------------ Reading Files and Parsing their Contents: Some text files contain lines that store other types or mixed-types of information. Suppose that we wanted to read a text file that stored strings representing numbers (one number per line). We can easily rewrite our original code to the following, calling the int conversion function on each rstripped line. for line in open(file_name): process( int(line.rstrip('\n')) ) Here we are assuming process takes an integer value as an argument. In some files each line is a "record": a fixed number of fields of values, with possibly different types, separated by some special character (often a space or punctuation character like a comma or colon). To process each record in a file, we must (1) read its line (2) separate its fields of values (still each value is a string) (3) call a conversion function for each string to get its value The goody module contains the parse_lines function (technically a generator function, which we will study in Week #4 in ICS-33) that easily supports reading records from files (similarly to how lines are read from "open" files). We can define this (generator) function simply as follows (don't worry about how it is defined now, because you likely don't know what generator functions are, but I want to illustrate the code is simple for the behavior that I describe below). def parse_lines(open_file, sep, conversions): for line in open_file: yield [conv(item) for conv,item in zip(conversions,line.rstrip('\n').split(sep))] Here sep is the special character used to separate the fields in the record; conversions is a tuple (or list: technically it can be anything that is iterable) of function objects: they are applied in sequence to the string values extracted from the separated fields. When we iterate over a call to parse_lines (similar to iterating over a call to "open"), the index variable is bound to a list of the values of the fields in the record. For example, the following file contains fields of a name (str) followed by two test scores (ints) all separated by commas (like a .csv file in Excel). Bob Smith,75,80 Mary Jones,85,90 We could read this file and print out the names of each student and their average test score by for fields in parse_lines( open(file_name), ',' , (str,int,int) ): print(fields[0], (fields[1]+fields[2])/2) Here fields is repeatedly bound to a 3-list containing a name (str) followed by two test scores (ints): fields is first bound to ['Bob Smith', 75, 80] and then to ['Mary Jones', 85, 90]. -----Start: Details of behavior with bad arguments 1) Note that if we specified conversions as (str,int) it would return the 2-lists ['Bob Smith', 75] followed by ['Mary Jones', 85] (because looping over a zip stops when one of its arguments runs out of values: here each line contains more field values than conversion functions). If we supplied just two conversion functions Accessing fields[2] in the code above would raise an IndexError exception. 2) Likewise (because looping over a zip stops when one of its arguments runs out of values), if the a line contains a name and 3 integer values, only the name and first two integers would be returned in the 3-list: the line Paul White,80,75,85 returns only the 3-list ['Paul White', 80, 75] So parse_lines would not raise any exceptions in the code above; instead it incorrectly reads the file contents with no warning. We could define a more complicated parse_lines function that checked and immediately raised an exception if the number of separated field values in a record was not equal to the length of the tuple of conversion functions. -----End: Details of behavior with bad arguments A simpler way to write such code is to use multiple index variables and unpacking (as we do when we write: for k,v in adict.items()). I have found that students often don't understand the power and simplicity of unpacking; you should use this simple Python feature in your code to receive full credit. Unpacking is covered in detail in the review lecture notes. for name, test1, test2 in parse_lines(open(file_name),',',(str,int,int)): print(name, (test1+test2)/2) With this for loop, the first error noted above would also raise an exception because there would not be three values to unpack into name, test1, and test2; the second error would again go unnoticed. Finally, note that besides using the standard conversion function(s) like str and int, we can define our own more complicated conversion function(s). For example, suppose that each record in the file specified a string name, some number of int quiz results separated by colons, and an int final exam, with these three fields (name, quizzes, final) separated by commas. Such a file might look like Bob Smith,75:80,90 Mary Jones,85:90:77,85 Here Bob took two quizzes but Mary took three. We can process both lines by defining def quiz_list(scores): return [int(q) for q in scores.split(':')] and then write for name,quizzes,final in parse_lines(open(file_name),',',(str,quiz_list,int)): print(name, sum(quizzes)/len(quizzes), final) which would print Bob Smith 77.5 90 Mary Jones 84.0 85 Note that 77.5 is (75+80)/2 and 84 is (85+90+77)/3. If we instead wrote for fields in parse_lines(open(file_name),',',(str,quiz_list,int)): print(fields) it would print the following 3-lists: index 1 of each is a list of quiz scores. ['Bob Smith', [75, 80], 90] ['Mary Jones', [85, 90, 77], 85] Of course, we can also use lambdas (also covered in the review lecture) instead of named functions; below we have substituted a lambda for the quiz_list function. for name,quizzes,final in parse_lines(open(file_name),',', (str, lambda scores : [int(q) for q in scores.split(':')], int)): print(name,quizzes,final) which again prints ['Bob Smith', [75, 80], 90] ['Mary Jones', [85, 90, 77], 85] ------------------------------------------------------------------------------ Text and Binary Files: Pickling Python programs (.py files) are actually text files that are read by Python itself. Some Python programs also read or write (text and/or binary) files when executed. In this section we will briefly survey binary files (comparing them to text files) and how to use binary files with the pickle module to save the state of complicated data structures that a program creates, so they can be efficiently stored and loaded (read back) into a subsequent program. Text files contain ASCII (really Unicode, but the distinction is not important here) characters. We can use standard text-editors (like the one in Eclipse) to create/examine/update text files. We have seen how to read strings from the lines in a text file (throughout this lecture note) and convert these strings into other types (in the previous section). We might store a float value in a text file as the characters "2.99792458E8". If we store each character as one byte (8 bits) of information, then it takes 12 bytes to store this float; in addition, if we call the float(...) conversion function after reading this string (to get its true float value: a value on which we can perform arithmetic) the conversion takes additional time. But all float values are stored in a special 64 bit format (equivalent to 8 bytes). Instead of converting a float value to a string, storing it in a file, reading it back from the file, and converting it back into a float (a lengthy process needing 12 bytes of storage in the file for this example), we can write a float value directly into a (binary) file using 8 bytes, and then read back those 8 bytes, with no conversion writing or reading the file, and the value is stored in the file using only 8 bytes. So we gain efficiency (time and space) storing information in binary files; what we give up is the ability to easily "see" the contents of files: although we can still "edit" them with a text editor, they look garbled. For example, the text editor would interpret the 8 bytes in a float number as 8 characters (which they are not, and so the characters would look very weird)! The topic of data representation in computers/files at the lowest level (bits and bytes) is covered in depth in ICS-51. Once you have such knowledge, you can easily UNDERSTAND how Python can read and write binary files. For now, we will just illustrate USING binary files in conjunction with the pickle module, to do something useful and important, that can be understood on its own. For the discussion below of the pickle module (from the standard Python library) we will focus on two useful functions, which we will describe and illustrate briefly (avoiding unnecessary -but still interesting- details). pickle.dump(object, open-file) # called for effect; returns None pickle.load(open-file) # returns value of pickled data structure Here is a short program that stores the pickled version of a Python dictionary. When files are opened with only one argument (a name), the second argument by default is 'r', which means r(ead). For the code below, we must explicitly specify a second argument "wb". import pickle adict = dict(a=1,b=2,c=3) # create a dict with open('pickletest.dat', 'wb') as storage: # File mode is w(rite)b(inary) pickle.dump(adict,storage) # store dict in a binary file When this code executes, it writes a 38 byte binary file named pickletest.dat, storing the 3-item dictionary. Run this code and then try to load this file into a text-editor and observe the results. We can read the data file for this dictionary using the following program. import pickle with open('pickletest.dat', 'rb') as storage: # File mode is r(ead)b(inary) adict = pickle.load(storage) # restore dict from file print(adict) # Print orginal/pickled dict which prints: {'a': 1, 'b': 2, 'c': 3} If we needed to pickle more than one data structure, we can put each data structure in a list, pickle the list, then read back the list, and bind each individual data structure (mabye even using unpacking). The documention says the following can be pickled 1) None, True, and False 2) integers, floating point numbers, complex numbers 3) strings, bytes, bytearrays 4) tuples, lists, sets, and dictionaries containing only picklable objects 5) functions defined at the top level of a module (using def, not lambda) 6) built-in functions defined at the top level of a module 7) classes that are defined at the top level of a module 8) instances of such classes whose __dict__ or the result of calling __getstate__() is picklable Attempts to pickle unpicklable objects will raise the PicklingError exception. This explains only the tip of the pickling iceberg, but it encapsulates some interesting aspects of pickling and binary files that you might find useful when you write programs. ------------------------------------------------------------------------------ Questions: 1) Suppose that we want to process the lower case version of every word on every line (where the words on a line are separated by spaces) in a file named file.txt. Which of the following code fragments correctly does so? For those that don't, explain why they fail. For example, if the file contained the three lines: See spot See spot run Run spot run it should process the following words in the following order: 'see', 'spot', 'see', 'spot', 'run', 'run', 'spot', 'run'. for line in open('file.txt'): for word in line.rstrip().lower().split(): process(word) for line in open('file.txt'): for word in line.rstrip().split().lower(): process(word) for line in open('file.txt'): for word in line.lower().rstrip().split(): process(word) for line in open('file.txt'): for word in line.lower().split().rstrip(): process(word) for line in open('file.txt'): for word in line.split().rstrip().lower(): process(word) for line in open('file.txt'): for word in line.split().lower().rstrip(): process(word) for word in open('file.txt').read().lower().split('\n'): process(word) for word in open('file.txt').read().split('\n').lower(): process(word)