Sorting a million 32-bit integers in 2MB of RAM using Python
Original post from Guido Van Rossum:
Sorting a million 32-bit integers in 2MB of RAM using Python
The post analysis the above post:
Analyzing "Sorting a million 32-bit integers in 2MB of RAM using Python"
Key point: usage of array, heapq and tempfile#
With generators.
Using temporary files to store sorted chunks of integers, read those files with a generator/iterator, and merge them using heapq.merge
.
Each time only a small chunk of integers is in memory, next chunk is read when the previous one is sorted.
循环顶着 Memory 上线,对 chunk 进行 sort,写入 n
个文件。
对于每个文件,每次读取一点 m
,并用 generator iterator 给出 sort 好的数据。通过 heapq 来进行 merge。使用到的最多内存为 n*m
。
Summary#
What's the issue?#
A 32-bit integer is 4 byte. So a million 32-bit integers are 4MB. Which exceed the 2MB limit.
We couldn't simplistically using sorted
method to sort those numbers, which loads every number into memory.
A million integer source#
struct
module#
Converts between Python values and C structs represented as Python bytes object.
It's more compact compared to saving as a txt.
import random
import struct
numbers = [random.randint(-2**31, 2**31 - 1) for _ in range(1_000_000)]
with open("random_numbers.bin", "wb") as file:
# 'i' is the letter used by the formatting mini-language of the
# struct module to turn a Python number into a C like signed integer
# 'i' is signed integer with 4 bytes size
file.write(struct.pack(len(numbers) * "i", *numbers))
Read limited size data into memory#
array
module#
Compactly represent an array of basic values. Using much less memory than a list.
In array("i")
, the "i"
is the typecode. Which means elements in the array should be type i
, which is C type signed int and python type int.
def int_array_from_file(int_file, buffer_size=4000):
int_array = array("i")
int_array.frombytes(int_file.read(buffer_size))
return int_array
The above code means we read chunks of 4000 bytes from the file and convert them to an array of python integers. Not the entire file into memory at once.
Generator to yield integers#
def read_buffered_ints(int_file):
while True:
int_array = int_array_from_file(int_file)
if not int_array:
return None
yield from int_array
Generator are lazy, they don't take much memory.
Write each sorted chunk to a temporary file and got their generator#
int_arrays_generators = []
while True:
int_array = int_array_from_file(int_file, 4000*10)
if not int_array:
break
# For each chunk, we create a temp file...
temp_file = tempfile.TemporaryFile()
# we sort the arrays, convert them back into bytes,
# and put them in the file...
array("i", sorted(int_array)).tofile(temp_file)
# then save a generator ready to give us back those ints,
# in our list
temp_file.seek(0)
generator = read_buffered_ints(temp_file)
int_arrays_generators.append(generator)
Each generator hold a temporary file, which is a sorted chunk of 4000*10 bytes. Each of them only take 4000 bytes one time in memory.
Merge the sorted chunks using heapq.merge
#
heapq
module#
Implementations of heap queue algorithm.
heapq.merge
merge multiple sorted inputs into a single sorted output. Returns an iterator over the sorted values.
import heapq
int_buffer = array("i")
# heapq.merge give us all integer IN ORDER
for integer in heapq.merge(*int_arrays_generators):
# We store them all in an array, and once the array is big enough,
# we append its content into the final result file
int_buffer.append(integer)
if len(int_buffer) >= 1000:
int_buffer.tofile(final_result)
del int_buffer[:] # empty the array, it reached the size limit
# For the last array if it didn't reach 1000
if int_buffer:
int_buffer.tofile(final_result)
Source code#
https://gist.github.com/ksamuel/2352d2c10d89f9bed3da6d45ac4c8ce3