My daily routine - as a developer writing code for a living - involves visits to sites like Slashdot, Reddit and Hacker News. These visits usually end up with me downloading "stuff" that I can take back home for... research. I have a directory set-up in my PC (at work) for these "R&D" materials, and periodically empty it to a rewritable DVD I carry back and forth between work and home (Edit, April 2011: DVDs are now all but obsolete; USB sticks are used in their place, but the article is still valid - keep reading).
After a busy month of hacking, I realized that a lot of "R&D" material was there; and decided to burn a DVD and take it home. Only to find that more than 6GB were patiently waiting in the research folder...
And at that point, the question formed in my mind:
What would be the optimal way of doing that, filling up a DVD with a myriad files, making sure that it gets filled up as much as possible? With no free space left unused?
Think about it. When you have more than a hundred files, there are literally millions of combinations you can try. Which one is the best?
Is it really millions? Well, here's a quick "brute-force" attempt:
#!/usr/bin/env python import sys g_maxSize=4480 # we'll count in megabytes, and a DVD is ~4480M def main(): # the sets formed from the combinations collectionsToTest = [] # will be stored in this variable # Now read integers from STDIN for newNo in sys.stdin.readlines(): try: # Each integer is the size newNo = int(newNo) # in MB of one of the files except: print "%s is not numeric" % newNo continue newWing = [x+[newNo] for x in collectionsToTest] collectionsToTest.append([newNo]) collectionsToTest.extend(newWing) # Now check all the sets maximumLessThanMaxSize = 0 # and find the best one for set in collectionsToTest: # (the one with the closest total = 0 # sum to g_maxSize) for elem in set: total += elem if total > g_maxSize: continue if total > maximumLessThanMaxSize: maximumLessThanMaxSize = total print "Set:", set, "maximum:", \ maximumLessThanMaxSize if __name__ == "__main__": main()
Each new number that is input increases the sets to test: the new collectionsToTest is made of:
[4400] (4400 is input) [10] (10 is input) [4400, 10] [50] (50 is input) [4400, 50] [10, 50] [4400, 10, 50] [70] (70 is input) [4400, 70] [10, 70] [4400, 10, 70] [50, 70] [4400, 50, 70] [10, 50, 70] [4400, 10, 50, 70]..and the following result from the prototype:
bash$ ./brutus.py 4400 10 50 70 (Pressing Ctrl-D) Set: [4400] maximum: 4400 Set: [4400, 10] maximum: 4410 Set: [4400, 50] maximum: 4450 Set: [4400, 10, 50] maximum: 4460 Set: [4400, 70] maximum: 4470 Set: [4400, 10, 70] maximum: 4480Now let's revisit the set creation: each new number that we input to a previous collection of N sets, is adding a set of one element (with the new member) and N new sets (the previous sets, augmented to include the new number). This means that we go from a collection of N sets to one with N+1+N sets. The work to perform therefore, as we add elements, grows at a rate faster than 2^N; REALLY fast... After just 20 numbers, we have millions of combinations to test:
bash$ perl -e \ 'for($i=1; $i<20; $i++) { print $i."\n"; }' | \ ./brutus.py (you'll run out of memory, you better Ctrl-C while you're ahead)
How do we cope?
Our problem (optimally choosing a subset of integers from a given set to fill up a fixed size container) is one of a class of problems that share a common trait: they are difficult to solve in brute-force, but they allow you to easily find an optimal solution for problem "N+1" if you know the solution for problem "N".
Let's look at it in detail for our problem. We'll build a table:
Dynamic programming solver for my "R&D" directory migration | |||||
1 | 2 | 3 | ... | 198 | |
1 MB | |||||
2 MB | |||||
3 MB | |||||
... | |||||
4480 MB |
The row index provides the container size: we start from 1MB and move all the way up to 4480MB. The column index is the number of files we use from my directory to fill up the container. I currently have a total of 198 files in the directory, so the table has 198 columns.
The table stores in each cell the optimal (maximum) sum of filesizes that we can achieve for each cell - that is, for each combination of (container size, first j files). Notice that we have arranged the files in a particular order, so
Now imagine that we have filled up ALL lines up to line j, and
ALL cells of line j+1 up to column k. How do we fill cell (j+1,
k+1)?
Column k | Column k+1 | |
Line j | a | b |
Line j+1 | c | ? |
... |
In order to fill cell (j+1,k+1) of line j+1, we need to see whether the optimal sum of filesizes is the same as it was using the k first files (i.e. the same as in cell (j+1,k), "c"), or whether we can use the extra file (file k+1) and increase space coverage by adding it to the mix.
To do that...
#!/usr/bin/env python import sys g_maxSize=4480 def main(): # Read file sizes from standard input as integers: fileSizes = [] for key in sys.stdin.readlines(): try: key = int(key) fileSizes.append(key) except: print "%s is not numeric" % key continue print "Total of ", len(fileSizes), "items" print fileSizes # Time for dynamic programming optimalResults = {} # Line: from 0 .. g_maxSize for containerSize in xrange(0, g_maxSize+1): # Column: enumerate gives tuple of (index, file size) for idx, fileSize in enumerate(fileSizes): # We will index the "optimalResults" via tuples: # The current cell cellCurrent = (containerSize, idx) # The cell on our left cellOnTheLeftOfCurrent = (containerSize, idx-1) # Does the file on column "idx" fit inside containerSize? if containerSize<fileSize: # the file in column idx doesn't fit in containerSize # so copy optimal value from the cell on its left optimalResults[cellCurrent] = \ optimalResults.get(cellOnTheLeftOfCurrent, 0) else: # It fits! # Using file of column "idx", the remaining space is... remainingSpace = containerSize - fileSize # and for that remaining space, the optimal result # using the first idx-1 files has been found to be: optimalResultOfRemainingSpace = \ optimalResults.get((remainingSpace, idx-1), 0) # so let's check: do we improve things by using "idx"? if optimalResultOfRemainingSpace + fileSize > \ optimalResults.get(cellOnTheLeftOfCurrent, 0): # we improved the best result, store it! optimalResults[cellCurrent] = \ optimalResultOfRemainingSpace + fileSize else: # no improvement by using "idx" - copy from left... optimalResults[cellCurrent] = \ optimalResults[cellOnTheLeftOfCurrent] print "Attainable:", optimalResults[(g_maxSize, len(fileSizes)-1)] if __name__ == "__main__": main()
bash$ ./dynprog1.py 4400 10 50 70 (Pressing Ctrl-D) Total of 4 items [4400, 10, 50, 70] Attainable: 4480We've found the optimal result (4480) - but how do we get to it? What series of files led us to it?
Simple - we augment the algorithm to not only keep the optimal sum per cell, but also the last step we used to get it. To reproduce the steps that get us to the final (optimal) result, we start from the optimal result, and go backwards, subtracting the last "step" we used to reach each cell:
# Time for dynamic programming optimalResults = {} lastStep = {} ... # Does the file on column "idx" fit inside containerSize? if containerSize<fileSize: # the file in column idx doesn't fit in containerSize # so copy optimal value from the cell on its left optimalResults[cellCurrent] = \ optimalResults.get(cellOnTheLeftOfCurrent, 0) # Copy last step from cell on the left... lastStep[cellCurrent] = \ lastStep.get(cellOnTheLeftOfCurrent, 0) else: ... # so let's check: do we improve things by using "idx"? if optimalResultOfRemainingSpace + fileSize > \ optimalResults.get(cellOnTheLeftOfCurrent, 0): # we improved the best result, store it! optimalResults[cellCurrent] = \ optimalResultOfRemainingSpace + fileSize # Store last step - i.e. using file "idx" lastStep[cellCurrent] = fileSize else: # no improvement by using "idx" - copy from left... optimalResults[cellCurrent] = \ optimalResults[cellOnTheLeftOfCurrent] # Copy last step from cell on the left... lastStep[cellCurrent] = \ lastStep.get(cellOnTheLeftOfCurrent, 0) ... # Navigate backwards... starting from the optimal result: total = optimalResult[(g_maxSize, len(fileSizes)-1)] while total>0: # The last step we used to get to "total" was... lastFileSize = lastStep[(total, len(fileSizes)-1)] print total, "removing", lastFileSize # And the one before the last step was... (loop) total -= lastFileSize
bash$ ./dynprog2.py 4400 10 50 70 (Pressing Ctrl-D) Total of 4 items [4400, 10, 50, 70] Attainable: 4480 4480 removing 70 4410 removing 10 4400 removing 4400And now we can enjoy the real fruits of our labour: feeding the brute force implementation with the numbers 1-99 would lead to more than 2^100 sets - the universe would end while we waited for results to appear... Instead, our dynamic programming script only takes a couple of seconds:
(Note that numbers from 1 to 99 add up to 99*100/2 = 4950, so they accumulate to well above our target of 4480 - and the optimal sum of 4480 is achieved by dropping some files out, just as we wanted):
bash$ perl -e 'for($i=1; $i<100; $i++) { print $i."\n"; }' | \ ./dynprog2.py Total of 99 items [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99] Attainable: 4480 4480 removing 95 4385 removing 94 4291 removing 93 4198 removing 92 4106 removing 91 4015 removing 90 ... ... 21 removing 6 15 removing 5 10 removing 4 6 removing 3 3 removing 2 1 removing 1It works. We can now optimally fill storage space.
You can download an easy to use version here.
Usage is as plain as it gets:
bash$ fitToSize.py inputDirectory/ outputDirectory/ 4475 (you can choose the desired capacity, 4475 is just an example)
P.S. Using MBs to count is in fact cheating: the correct solution is to use the allocation block of the filesystem as a step size between lines. Well, it's close enough for my needs (using sectors - 512 bytes - would make the table taller by a factor of 2048x, and thus the execution would slow down by that same factor).
Edit, May 2011: An interesting discussion arose when I tried to implement this algorithm in F# - F# turned out to be slower than Python!
Index | CV | Updated: Sat Oct 8 12:33:59 2022 |
The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable; or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.