Path: news.service.uci.edu!csulb.edu!canoe.uoregon.edu!logbridge.uoregon.edu!netnews.com!nntp.abs.net!uunet!dca.uu.net!news.baymountain.net!not-for-mail From: Tim Peters Newsgroups: comp.lang.python Subject: RE: Newbie: finding the key/index of the min/max element Date: Sun, 05 May 2002 05:22:04 -0400 Organization: None Lines: 136 Message-ID: NNTP-Posting-Host: mail.python.org Mime-Version: 1.0 Content-Type: text/plain; charset=Windows-1252 Content-Transfer-Encoding: 7BIT X-Trace: news.baymountain.net 1020593547 11922 63.102.49.29 (5 May 2002 10:12:27 GMT) X-Complaints-To: abuse@baymountain.net NNTP-Posting-Date: 5 May 2002 10:12:27 GMT In-reply-to: <74rA8.41449$8D3.1220071@news1.tin.it> X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal X-Priority: 3 (Normal) X-MSMail-priority: Normal X-Spam-Status: No, hits=-6.0 required=5.0 tests=IN_REP_TO,BODY_PYTHON_ZOPE version=2.11 Errors-To: python-list-admin@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.0.10 (101270) Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: Errors-To: python-list-admin@python.org X-BeenThere: python-list@python.org X-Original-To: python-list@python.org Xref: news.service.uci.edu comp.lang.python:199582 [David Eppstein] > ... > I'm pretty sure it would usually be better just to sort the list and > return the middle item, instead of using a linear-time median > finding algorithm, despite the nonlinear big-O time of a sort. Many years ago I had to write a linear-time selector in Pascal. It was so horridly painful I never tried it again. But I'm old enough now to confront my fears , so I tried it in Python, and was pleasantly surprised at how easily it went. Its performance isn't bad, considering it's written in Python, and how much effort went into speeding Python's C sort. Here are the seconds each took for finding the median of various sizes of arrays of random doubles (these are the means of 3 runs at each size): n clever sort ------- ------ ------ 2 0.000 0.000 3 0.000 0.000 5 0.000 0.000 9 0.000 0.000 17 0.000 0.000 33 0.000 0.000 65 0.001 0.000 129 0.002 0.000 257 0.002 0.000 513 0.004 0.001 1025 0.008 0.002 2049 0.018 0.004 4097 0.037 0.009 8193 0.088 0.020 16385 0.166 0.047 32769 0.303 0.106 65537 0.638 0.236 131073 1.276 0.528 262145 2.579 1.152 524289 5.245 2.523 1048577 10.674 5.460 2097153 22.063 11.863 I got bored then. This is using -O with current CVS Python, which is somewhat zippier than currently released Pythons. For contrast, the Python version doesn't take much longer to find the median than it takes just to fill the array with random doubles (random.random() is coded in Python too). There are obvious but unpleasant ways to speed the Python version (it does a lot of data copying), which I'll skip. exorcising-old-demons-ly y'rs - tim # Find the rank'th-smallest value in a, in worst-case quadratic time. def short_find(a, rank): a.sort() return a[rank - 1] # Find the rank'th-smallest value in a, in worst-case linear time. def find(a, rank): n = len(a) assert 1 <= rank <= n if n <= 7: return short_find(a, rank) # Find median of median-of-7's. medians = [short_find(a[i : i+7], 4) for i in xrange(0, n-6, 7)] median = find(medians, (len(medians) + 1) // 2) # Partition around the median. # a[:i] <= median # a[j+1:] >= median i, j = 0, n-1 while i <= j: while a[i] < median: i += 1 while a[j] > median: j -= 1 if i <= j: a[i], a[j] = a[j], a[i] i += 1 j -= 1 if rank <= i: return find(a[:i], rank) else: return find(a[i:], rank - i) def tryone(n, tries): from random import random from time import clock as now x = [None] * n median_rank = (n+1) // 2 sum1 = sum2 = 0.0 for i in range(tries): for i in xrange(n): x[i] = random() y = x[:] start = now() got1 = find(x, median_rank) elapsed = now() - start sum1 += elapsed start = now() got2 = short_find(y, median_rank) elapsed = now() - start sum2 += elapsed if got1 != got2: print "OUCH got1 %r got2 %r" % (got1, got2) return sum1, sum2 def drive(tries): for i in range(23): n = 2**i + 1 fast, slow = tryone(n, tries) print "%8d %7.3f %7.3f" % (n, fast/tries, slow/tries) if __name__ == "__main__": drive(3) PS: If you're more interested in expected time than worst-case time, replacing the median-of-median-of-7s business with median = short_find([a[0], a[-1], a[n//2]], 2) yields a Python median-finder that's usually significantly faster than the sort method starting in the range of 512K to 1M elements. PPS: If you do care about worst-case time, the Python version can be made significantly faster by boosting the median-of-7 gimmick to median-of-k for larger odd values of k. Fun for the whole family: plot speedups against k until you're all baffled .