Python optimized base64 writer for streamed files











up vote
5
down vote

favorite












I needed to make a base64 file encoder where you can control the read buffer size. This is what I came up with and it's quite fast. It might be able to be simpler but still maintain its performance characteristics. Any suggestions?



def chunked_base64_encode(input, input_size, output, read_size=1024):
"""
Read a file in configurable sized chunks and write to it base64
encoded to an output file.

This is an optimization over ``base64.encode`` which only reads 57
bytes at a time from the input file. Normally this is OK if the
file in question is opened with ``open`` because Python will
actually read the data into a larger buffer and only feed out
57 bytes at a time. But if the input file is something like a
file stream that's read over the network, only 57 bytes will be
read at a time. This is very slow if the file stream is not
buffered some other way.

This is the case for MongoDB GridFS. The GridOut file returned by
GridFS is not a normal file on disk. Instead it's a file read in
256 KB chunks from MongoDB. If you read from it 57 bytes at a time,
GridFS will read 256 KB then make lots of copies of that chunk
to return only 57 bytes at a time. By reading in chunks equal
to the GridFS chunk size, performance is 300 times better.

Performance comparison:

File size 10 MB
Save to MongoDB took 0.271495819092 seconds
Fast Base 64 encode (chunk size 261120) took 0.250380992889 seconds
Base 64 encode (chunk size 57) took 62.9280769825 seconds

File size 100 MB
Save to MongoDB took 0.994009971619 seconds
Fast Base 64 encode (chunk size 261120) took 2.78231501579 seconds
Base 64 encode (chunk size 57) took 645.734956026 seconds

For regular files on disk, there is no noticeable performance gain
for this function over ``base64.encode`` because of Python's built
in buffering for disk files.

Args:
input (file): File like object (implements ``read()``).
input_size (int): Size of file in bytes
output (file): File like object (implements ``write()``).
read_size (int): How many bytes to read from ``input`` at
a time
"""
# 57 bytes of input will be 76 bytes of base64
chunk_size = base64.MAXBINSIZE
base64_line_size = base64.MAXLINESIZE
# Read size needs to be in increments of chunk size for base64
# output to be RFC 3548 compliant.
read_size = read_size - (read_size % chunk_size)
num_reads = int(ceil(input_size / float(read_size)))
# RFC 3548 says lines should be 76 chars
base64_lines_per_read = read_size / chunk_size

input.seek(0)
for r in xrange(num_reads):
is_last_read = r == num_reads - 1
s = input.read(read_size)
if not s:
# If this were to happen, then ``input_size`` is wrong or
# the file is corrupt.
raise ValueError(
u'Expected to need to read %d times but got no data back on read %d' % (
num_reads, r + 1))

data = b2a_base64(s)

if is_last_read:
# The last chunk will be smaller than the others so the
# line count needs to be calculated. b2a_base64 adds a line
# break so we don't count that char
base64_lines_per_read = int(ceil((len(data) - 1) / float(base64_line_size)))

# Split the data chunks into base64_lines_per_read number of
# lines, each 76 chars long.
for l in xrange(base64_lines_per_read):
is_last_line = l == base64_lines_per_read - 1
pos = l * base64_line_size
line = data[pos:pos + base64_line_size]
output.write(line)

if not (is_last_line and is_last_read):
# The very last line will already have a n because of
# b2a_base64. The other lines will not so we add it
output.write('n')









share|improve this question
























  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Heslacher
    Sep 28 at 4:35










  • Got it. Thanks.
    – six8
    Sep 28 at 18:36















up vote
5
down vote

favorite












I needed to make a base64 file encoder where you can control the read buffer size. This is what I came up with and it's quite fast. It might be able to be simpler but still maintain its performance characteristics. Any suggestions?



def chunked_base64_encode(input, input_size, output, read_size=1024):
"""
Read a file in configurable sized chunks and write to it base64
encoded to an output file.

This is an optimization over ``base64.encode`` which only reads 57
bytes at a time from the input file. Normally this is OK if the
file in question is opened with ``open`` because Python will
actually read the data into a larger buffer and only feed out
57 bytes at a time. But if the input file is something like a
file stream that's read over the network, only 57 bytes will be
read at a time. This is very slow if the file stream is not
buffered some other way.

This is the case for MongoDB GridFS. The GridOut file returned by
GridFS is not a normal file on disk. Instead it's a file read in
256 KB chunks from MongoDB. If you read from it 57 bytes at a time,
GridFS will read 256 KB then make lots of copies of that chunk
to return only 57 bytes at a time. By reading in chunks equal
to the GridFS chunk size, performance is 300 times better.

Performance comparison:

File size 10 MB
Save to MongoDB took 0.271495819092 seconds
Fast Base 64 encode (chunk size 261120) took 0.250380992889 seconds
Base 64 encode (chunk size 57) took 62.9280769825 seconds

File size 100 MB
Save to MongoDB took 0.994009971619 seconds
Fast Base 64 encode (chunk size 261120) took 2.78231501579 seconds
Base 64 encode (chunk size 57) took 645.734956026 seconds

For regular files on disk, there is no noticeable performance gain
for this function over ``base64.encode`` because of Python's built
in buffering for disk files.

Args:
input (file): File like object (implements ``read()``).
input_size (int): Size of file in bytes
output (file): File like object (implements ``write()``).
read_size (int): How many bytes to read from ``input`` at
a time
"""
# 57 bytes of input will be 76 bytes of base64
chunk_size = base64.MAXBINSIZE
base64_line_size = base64.MAXLINESIZE
# Read size needs to be in increments of chunk size for base64
# output to be RFC 3548 compliant.
read_size = read_size - (read_size % chunk_size)
num_reads = int(ceil(input_size / float(read_size)))
# RFC 3548 says lines should be 76 chars
base64_lines_per_read = read_size / chunk_size

input.seek(0)
for r in xrange(num_reads):
is_last_read = r == num_reads - 1
s = input.read(read_size)
if not s:
# If this were to happen, then ``input_size`` is wrong or
# the file is corrupt.
raise ValueError(
u'Expected to need to read %d times but got no data back on read %d' % (
num_reads, r + 1))

data = b2a_base64(s)

if is_last_read:
# The last chunk will be smaller than the others so the
# line count needs to be calculated. b2a_base64 adds a line
# break so we don't count that char
base64_lines_per_read = int(ceil((len(data) - 1) / float(base64_line_size)))

# Split the data chunks into base64_lines_per_read number of
# lines, each 76 chars long.
for l in xrange(base64_lines_per_read):
is_last_line = l == base64_lines_per_read - 1
pos = l * base64_line_size
line = data[pos:pos + base64_line_size]
output.write(line)

if not (is_last_line and is_last_read):
# The very last line will already have a n because of
# b2a_base64. The other lines will not so we add it
output.write('n')









share|improve this question
























  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Heslacher
    Sep 28 at 4:35










  • Got it. Thanks.
    – six8
    Sep 28 at 18:36













up vote
5
down vote

favorite









up vote
5
down vote

favorite











I needed to make a base64 file encoder where you can control the read buffer size. This is what I came up with and it's quite fast. It might be able to be simpler but still maintain its performance characteristics. Any suggestions?



def chunked_base64_encode(input, input_size, output, read_size=1024):
"""
Read a file in configurable sized chunks and write to it base64
encoded to an output file.

This is an optimization over ``base64.encode`` which only reads 57
bytes at a time from the input file. Normally this is OK if the
file in question is opened with ``open`` because Python will
actually read the data into a larger buffer and only feed out
57 bytes at a time. But if the input file is something like a
file stream that's read over the network, only 57 bytes will be
read at a time. This is very slow if the file stream is not
buffered some other way.

This is the case for MongoDB GridFS. The GridOut file returned by
GridFS is not a normal file on disk. Instead it's a file read in
256 KB chunks from MongoDB. If you read from it 57 bytes at a time,
GridFS will read 256 KB then make lots of copies of that chunk
to return only 57 bytes at a time. By reading in chunks equal
to the GridFS chunk size, performance is 300 times better.

Performance comparison:

File size 10 MB
Save to MongoDB took 0.271495819092 seconds
Fast Base 64 encode (chunk size 261120) took 0.250380992889 seconds
Base 64 encode (chunk size 57) took 62.9280769825 seconds

File size 100 MB
Save to MongoDB took 0.994009971619 seconds
Fast Base 64 encode (chunk size 261120) took 2.78231501579 seconds
Base 64 encode (chunk size 57) took 645.734956026 seconds

For regular files on disk, there is no noticeable performance gain
for this function over ``base64.encode`` because of Python's built
in buffering for disk files.

Args:
input (file): File like object (implements ``read()``).
input_size (int): Size of file in bytes
output (file): File like object (implements ``write()``).
read_size (int): How many bytes to read from ``input`` at
a time
"""
# 57 bytes of input will be 76 bytes of base64
chunk_size = base64.MAXBINSIZE
base64_line_size = base64.MAXLINESIZE
# Read size needs to be in increments of chunk size for base64
# output to be RFC 3548 compliant.
read_size = read_size - (read_size % chunk_size)
num_reads = int(ceil(input_size / float(read_size)))
# RFC 3548 says lines should be 76 chars
base64_lines_per_read = read_size / chunk_size

input.seek(0)
for r in xrange(num_reads):
is_last_read = r == num_reads - 1
s = input.read(read_size)
if not s:
# If this were to happen, then ``input_size`` is wrong or
# the file is corrupt.
raise ValueError(
u'Expected to need to read %d times but got no data back on read %d' % (
num_reads, r + 1))

data = b2a_base64(s)

if is_last_read:
# The last chunk will be smaller than the others so the
# line count needs to be calculated. b2a_base64 adds a line
# break so we don't count that char
base64_lines_per_read = int(ceil((len(data) - 1) / float(base64_line_size)))

# Split the data chunks into base64_lines_per_read number of
# lines, each 76 chars long.
for l in xrange(base64_lines_per_read):
is_last_line = l == base64_lines_per_read - 1
pos = l * base64_line_size
line = data[pos:pos + base64_line_size]
output.write(line)

if not (is_last_line and is_last_read):
# The very last line will already have a n because of
# b2a_base64. The other lines will not so we add it
output.write('n')









share|improve this question















I needed to make a base64 file encoder where you can control the read buffer size. This is what I came up with and it's quite fast. It might be able to be simpler but still maintain its performance characteristics. Any suggestions?



def chunked_base64_encode(input, input_size, output, read_size=1024):
"""
Read a file in configurable sized chunks and write to it base64
encoded to an output file.

This is an optimization over ``base64.encode`` which only reads 57
bytes at a time from the input file. Normally this is OK if the
file in question is opened with ``open`` because Python will
actually read the data into a larger buffer and only feed out
57 bytes at a time. But if the input file is something like a
file stream that's read over the network, only 57 bytes will be
read at a time. This is very slow if the file stream is not
buffered some other way.

This is the case for MongoDB GridFS. The GridOut file returned by
GridFS is not a normal file on disk. Instead it's a file read in
256 KB chunks from MongoDB. If you read from it 57 bytes at a time,
GridFS will read 256 KB then make lots of copies of that chunk
to return only 57 bytes at a time. By reading in chunks equal
to the GridFS chunk size, performance is 300 times better.

Performance comparison:

File size 10 MB
Save to MongoDB took 0.271495819092 seconds
Fast Base 64 encode (chunk size 261120) took 0.250380992889 seconds
Base 64 encode (chunk size 57) took 62.9280769825 seconds

File size 100 MB
Save to MongoDB took 0.994009971619 seconds
Fast Base 64 encode (chunk size 261120) took 2.78231501579 seconds
Base 64 encode (chunk size 57) took 645.734956026 seconds

For regular files on disk, there is no noticeable performance gain
for this function over ``base64.encode`` because of Python's built
in buffering for disk files.

Args:
input (file): File like object (implements ``read()``).
input_size (int): Size of file in bytes
output (file): File like object (implements ``write()``).
read_size (int): How many bytes to read from ``input`` at
a time
"""
# 57 bytes of input will be 76 bytes of base64
chunk_size = base64.MAXBINSIZE
base64_line_size = base64.MAXLINESIZE
# Read size needs to be in increments of chunk size for base64
# output to be RFC 3548 compliant.
read_size = read_size - (read_size % chunk_size)
num_reads = int(ceil(input_size / float(read_size)))
# RFC 3548 says lines should be 76 chars
base64_lines_per_read = read_size / chunk_size

input.seek(0)
for r in xrange(num_reads):
is_last_read = r == num_reads - 1
s = input.read(read_size)
if not s:
# If this were to happen, then ``input_size`` is wrong or
# the file is corrupt.
raise ValueError(
u'Expected to need to read %d times but got no data back on read %d' % (
num_reads, r + 1))

data = b2a_base64(s)

if is_last_read:
# The last chunk will be smaller than the others so the
# line count needs to be calculated. b2a_base64 adds a line
# break so we don't count that char
base64_lines_per_read = int(ceil((len(data) - 1) / float(base64_line_size)))

# Split the data chunks into base64_lines_per_read number of
# lines, each 76 chars long.
for l in xrange(base64_lines_per_read):
is_last_line = l == base64_lines_per_read - 1
pos = l * base64_line_size
line = data[pos:pos + base64_line_size]
output.write(line)

if not (is_last_line and is_last_read):
# The very last line will already have a n because of
# b2a_base64. The other lines will not so we add it
output.write('n')






python python-2.x base64






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 28 at 4:35









Heslacher

44.8k460155




44.8k460155










asked Sep 24 at 19:47









six8

1913




1913












  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Heslacher
    Sep 28 at 4:35










  • Got it. Thanks.
    – six8
    Sep 28 at 18:36


















  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Heslacher
    Sep 28 at 4:35










  • Got it. Thanks.
    – six8
    Sep 28 at 18:36
















Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
– Heslacher
Sep 28 at 4:35




Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
– Heslacher
Sep 28 at 4:35












Got it. Thanks.
– six8
Sep 28 at 18:36




Got it. Thanks.
– six8
Sep 28 at 18:36










2 Answers
2






active

oldest

votes

















up vote
0
down vote













The first thing I notice is that you are using Python2. This is almost certainly wrong. Python3 is faster for most applications, and Python2 is going EOL in 15 months.



Other than that, my main comments would be that this would probably benefit from async as this is an IO heavy function, so you could be doing the computation while waiting for a different IO task to finish.






share|improve this answer





















  • This is a legacy application. Upgrading Python is not always a practical option.
    – six8
    Sep 26 at 2:56










  • I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
    – Oscar Smith
    Sep 26 at 15:56










  • In this case, base64.encode is the bottleneck because of its 57 byte read size.
    – six8
    Sep 26 at 17:26


















up vote
0
down vote













I ended up using bytearray as an input and output buffer. If found that if the output was something that doesn't buffer output (like a socket), then writing 77 bytes at a time would be very slow. Also my original code rounded the read size to be advantageous for base64, but not advantageous for MongoDB. It's better for the read size to match the MongoDB chunk size exactly. So the input is read into a bytearray with the exact size passed in, but then read in smaller base64 size chunks.



def chunked_encode(
input, output, read_size=DEFAULT_READ_SIZE, write_size=(base64.MAXLINESIZE + 1) * 64):
"""
Read a file in configurable sized chunks and write to it base64
encoded to an output file.

Args:
input (file): File like object (implements ``read()``).
output (file): File like object (implements ``write()``).
read_size (int): How many bytes to read from ``input`` at
a time. More efficient if in increments of 57.
write_size (int): How many bytes to write at a time. More efficient
if in increments of 77.
"""
# 57 bytes of input will be 76 bytes of base64
chunk_size = base64.MAXBINSIZE
base64_line_size = base64.MAXLINESIZE
# Read size needs to be in increments of chunk size for base64
# output to be RFC 3548 compliant.
buffer_read_size = max(chunk_size, read_size - (read_size % chunk_size))

input.seek(0)

read_buffer = bytearray()
write_buffer = bytearray()

while True:
# Read from file and store in buffer until we have enough data
# to meet buffer_read_size
if input:
s = input.read(read_size)
if s:
read_buffer.extend(s)
if len(read_buffer) < buffer_read_size:
# Need more data
continue
else:
# Nothing left to read
input = None

if not len(read_buffer):
# Nothing in buffer to read, finished
break

# Base 64 encode up to buffer_read_size and remove the trailing
# line break.
data = memoryview(b2a_base64(read_buffer[:buffer_read_size]))[:-1]
# Put any unread data back into the buffer
read_buffer = read_buffer[buffer_read_size:]

# Read the data in chunks of base64_line_size and append a
# linebreak
for pos in xrange(0, len(data), base64_line_size):
write_buffer.extend(data[pos:pos + base64_line_size])
write_buffer.extend('n')

if len(write_buffer) >= write_size:
# Flush write buffer
output.write(write_buffer)
del write_buffer[:]

if len(write_buffer):
output.write(write_buffer)
del write_buffer[:]





share|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204302%2fpython-optimized-base64-writer-for-streamed-files%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    The first thing I notice is that you are using Python2. This is almost certainly wrong. Python3 is faster for most applications, and Python2 is going EOL in 15 months.



    Other than that, my main comments would be that this would probably benefit from async as this is an IO heavy function, so you could be doing the computation while waiting for a different IO task to finish.






    share|improve this answer





















    • This is a legacy application. Upgrading Python is not always a practical option.
      – six8
      Sep 26 at 2:56










    • I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
      – Oscar Smith
      Sep 26 at 15:56










    • In this case, base64.encode is the bottleneck because of its 57 byte read size.
      – six8
      Sep 26 at 17:26















    up vote
    0
    down vote













    The first thing I notice is that you are using Python2. This is almost certainly wrong. Python3 is faster for most applications, and Python2 is going EOL in 15 months.



    Other than that, my main comments would be that this would probably benefit from async as this is an IO heavy function, so you could be doing the computation while waiting for a different IO task to finish.






    share|improve this answer





















    • This is a legacy application. Upgrading Python is not always a practical option.
      – six8
      Sep 26 at 2:56










    • I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
      – Oscar Smith
      Sep 26 at 15:56










    • In this case, base64.encode is the bottleneck because of its 57 byte read size.
      – six8
      Sep 26 at 17:26













    up vote
    0
    down vote










    up vote
    0
    down vote









    The first thing I notice is that you are using Python2. This is almost certainly wrong. Python3 is faster for most applications, and Python2 is going EOL in 15 months.



    Other than that, my main comments would be that this would probably benefit from async as this is an IO heavy function, so you could be doing the computation while waiting for a different IO task to finish.






    share|improve this answer












    The first thing I notice is that you are using Python2. This is almost certainly wrong. Python3 is faster for most applications, and Python2 is going EOL in 15 months.



    Other than that, my main comments would be that this would probably benefit from async as this is an IO heavy function, so you could be doing the computation while waiting for a different IO task to finish.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Sep 25 at 21:44









    Oscar Smith

    2,698922




    2,698922












    • This is a legacy application. Upgrading Python is not always a practical option.
      – six8
      Sep 26 at 2:56










    • I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
      – Oscar Smith
      Sep 26 at 15:56










    • In this case, base64.encode is the bottleneck because of its 57 byte read size.
      – six8
      Sep 26 at 17:26


















    • This is a legacy application. Upgrading Python is not always a practical option.
      – six8
      Sep 26 at 2:56










    • I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
      – Oscar Smith
      Sep 26 at 15:56










    • In this case, base64.encode is the bottleneck because of its 57 byte read size.
      – six8
      Sep 26 at 17:26
















    This is a legacy application. Upgrading Python is not always a practical option.
    – six8
    Sep 26 at 2:56




    This is a legacy application. Upgrading Python is not always a practical option.
    – six8
    Sep 26 at 2:56












    I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
    – Oscar Smith
    Sep 26 at 15:56




    I figured it was worth a try. I know the pain of legacy environments (centos servers that only supported python 2.5). In that case my async advice won't work either. You're best bet for speeding this up then is probably docs.python.org/2/library/base64.html
    – Oscar Smith
    Sep 26 at 15:56












    In this case, base64.encode is the bottleneck because of its 57 byte read size.
    – six8
    Sep 26 at 17:26




    In this case, base64.encode is the bottleneck because of its 57 byte read size.
    – six8
    Sep 26 at 17:26












    up vote
    0
    down vote













    I ended up using bytearray as an input and output buffer. If found that if the output was something that doesn't buffer output (like a socket), then writing 77 bytes at a time would be very slow. Also my original code rounded the read size to be advantageous for base64, but not advantageous for MongoDB. It's better for the read size to match the MongoDB chunk size exactly. So the input is read into a bytearray with the exact size passed in, but then read in smaller base64 size chunks.



    def chunked_encode(
    input, output, read_size=DEFAULT_READ_SIZE, write_size=(base64.MAXLINESIZE + 1) * 64):
    """
    Read a file in configurable sized chunks and write to it base64
    encoded to an output file.

    Args:
    input (file): File like object (implements ``read()``).
    output (file): File like object (implements ``write()``).
    read_size (int): How many bytes to read from ``input`` at
    a time. More efficient if in increments of 57.
    write_size (int): How many bytes to write at a time. More efficient
    if in increments of 77.
    """
    # 57 bytes of input will be 76 bytes of base64
    chunk_size = base64.MAXBINSIZE
    base64_line_size = base64.MAXLINESIZE
    # Read size needs to be in increments of chunk size for base64
    # output to be RFC 3548 compliant.
    buffer_read_size = max(chunk_size, read_size - (read_size % chunk_size))

    input.seek(0)

    read_buffer = bytearray()
    write_buffer = bytearray()

    while True:
    # Read from file and store in buffer until we have enough data
    # to meet buffer_read_size
    if input:
    s = input.read(read_size)
    if s:
    read_buffer.extend(s)
    if len(read_buffer) < buffer_read_size:
    # Need more data
    continue
    else:
    # Nothing left to read
    input = None

    if not len(read_buffer):
    # Nothing in buffer to read, finished
    break

    # Base 64 encode up to buffer_read_size and remove the trailing
    # line break.
    data = memoryview(b2a_base64(read_buffer[:buffer_read_size]))[:-1]
    # Put any unread data back into the buffer
    read_buffer = read_buffer[buffer_read_size:]

    # Read the data in chunks of base64_line_size and append a
    # linebreak
    for pos in xrange(0, len(data), base64_line_size):
    write_buffer.extend(data[pos:pos + base64_line_size])
    write_buffer.extend('n')

    if len(write_buffer) >= write_size:
    # Flush write buffer
    output.write(write_buffer)
    del write_buffer[:]

    if len(write_buffer):
    output.write(write_buffer)
    del write_buffer[:]





    share|improve this answer

























      up vote
      0
      down vote













      I ended up using bytearray as an input and output buffer. If found that if the output was something that doesn't buffer output (like a socket), then writing 77 bytes at a time would be very slow. Also my original code rounded the read size to be advantageous for base64, but not advantageous for MongoDB. It's better for the read size to match the MongoDB chunk size exactly. So the input is read into a bytearray with the exact size passed in, but then read in smaller base64 size chunks.



      def chunked_encode(
      input, output, read_size=DEFAULT_READ_SIZE, write_size=(base64.MAXLINESIZE + 1) * 64):
      """
      Read a file in configurable sized chunks and write to it base64
      encoded to an output file.

      Args:
      input (file): File like object (implements ``read()``).
      output (file): File like object (implements ``write()``).
      read_size (int): How many bytes to read from ``input`` at
      a time. More efficient if in increments of 57.
      write_size (int): How many bytes to write at a time. More efficient
      if in increments of 77.
      """
      # 57 bytes of input will be 76 bytes of base64
      chunk_size = base64.MAXBINSIZE
      base64_line_size = base64.MAXLINESIZE
      # Read size needs to be in increments of chunk size for base64
      # output to be RFC 3548 compliant.
      buffer_read_size = max(chunk_size, read_size - (read_size % chunk_size))

      input.seek(0)

      read_buffer = bytearray()
      write_buffer = bytearray()

      while True:
      # Read from file and store in buffer until we have enough data
      # to meet buffer_read_size
      if input:
      s = input.read(read_size)
      if s:
      read_buffer.extend(s)
      if len(read_buffer) < buffer_read_size:
      # Need more data
      continue
      else:
      # Nothing left to read
      input = None

      if not len(read_buffer):
      # Nothing in buffer to read, finished
      break

      # Base 64 encode up to buffer_read_size and remove the trailing
      # line break.
      data = memoryview(b2a_base64(read_buffer[:buffer_read_size]))[:-1]
      # Put any unread data back into the buffer
      read_buffer = read_buffer[buffer_read_size:]

      # Read the data in chunks of base64_line_size and append a
      # linebreak
      for pos in xrange(0, len(data), base64_line_size):
      write_buffer.extend(data[pos:pos + base64_line_size])
      write_buffer.extend('n')

      if len(write_buffer) >= write_size:
      # Flush write buffer
      output.write(write_buffer)
      del write_buffer[:]

      if len(write_buffer):
      output.write(write_buffer)
      del write_buffer[:]





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        I ended up using bytearray as an input and output buffer. If found that if the output was something that doesn't buffer output (like a socket), then writing 77 bytes at a time would be very slow. Also my original code rounded the read size to be advantageous for base64, but not advantageous for MongoDB. It's better for the read size to match the MongoDB chunk size exactly. So the input is read into a bytearray with the exact size passed in, but then read in smaller base64 size chunks.



        def chunked_encode(
        input, output, read_size=DEFAULT_READ_SIZE, write_size=(base64.MAXLINESIZE + 1) * 64):
        """
        Read a file in configurable sized chunks and write to it base64
        encoded to an output file.

        Args:
        input (file): File like object (implements ``read()``).
        output (file): File like object (implements ``write()``).
        read_size (int): How many bytes to read from ``input`` at
        a time. More efficient if in increments of 57.
        write_size (int): How many bytes to write at a time. More efficient
        if in increments of 77.
        """
        # 57 bytes of input will be 76 bytes of base64
        chunk_size = base64.MAXBINSIZE
        base64_line_size = base64.MAXLINESIZE
        # Read size needs to be in increments of chunk size for base64
        # output to be RFC 3548 compliant.
        buffer_read_size = max(chunk_size, read_size - (read_size % chunk_size))

        input.seek(0)

        read_buffer = bytearray()
        write_buffer = bytearray()

        while True:
        # Read from file and store in buffer until we have enough data
        # to meet buffer_read_size
        if input:
        s = input.read(read_size)
        if s:
        read_buffer.extend(s)
        if len(read_buffer) < buffer_read_size:
        # Need more data
        continue
        else:
        # Nothing left to read
        input = None

        if not len(read_buffer):
        # Nothing in buffer to read, finished
        break

        # Base 64 encode up to buffer_read_size and remove the trailing
        # line break.
        data = memoryview(b2a_base64(read_buffer[:buffer_read_size]))[:-1]
        # Put any unread data back into the buffer
        read_buffer = read_buffer[buffer_read_size:]

        # Read the data in chunks of base64_line_size and append a
        # linebreak
        for pos in xrange(0, len(data), base64_line_size):
        write_buffer.extend(data[pos:pos + base64_line_size])
        write_buffer.extend('n')

        if len(write_buffer) >= write_size:
        # Flush write buffer
        output.write(write_buffer)
        del write_buffer[:]

        if len(write_buffer):
        output.write(write_buffer)
        del write_buffer[:]





        share|improve this answer












        I ended up using bytearray as an input and output buffer. If found that if the output was something that doesn't buffer output (like a socket), then writing 77 bytes at a time would be very slow. Also my original code rounded the read size to be advantageous for base64, but not advantageous for MongoDB. It's better for the read size to match the MongoDB chunk size exactly. So the input is read into a bytearray with the exact size passed in, but then read in smaller base64 size chunks.



        def chunked_encode(
        input, output, read_size=DEFAULT_READ_SIZE, write_size=(base64.MAXLINESIZE + 1) * 64):
        """
        Read a file in configurable sized chunks and write to it base64
        encoded to an output file.

        Args:
        input (file): File like object (implements ``read()``).
        output (file): File like object (implements ``write()``).
        read_size (int): How many bytes to read from ``input`` at
        a time. More efficient if in increments of 57.
        write_size (int): How many bytes to write at a time. More efficient
        if in increments of 77.
        """
        # 57 bytes of input will be 76 bytes of base64
        chunk_size = base64.MAXBINSIZE
        base64_line_size = base64.MAXLINESIZE
        # Read size needs to be in increments of chunk size for base64
        # output to be RFC 3548 compliant.
        buffer_read_size = max(chunk_size, read_size - (read_size % chunk_size))

        input.seek(0)

        read_buffer = bytearray()
        write_buffer = bytearray()

        while True:
        # Read from file and store in buffer until we have enough data
        # to meet buffer_read_size
        if input:
        s = input.read(read_size)
        if s:
        read_buffer.extend(s)
        if len(read_buffer) < buffer_read_size:
        # Need more data
        continue
        else:
        # Nothing left to read
        input = None

        if not len(read_buffer):
        # Nothing in buffer to read, finished
        break

        # Base 64 encode up to buffer_read_size and remove the trailing
        # line break.
        data = memoryview(b2a_base64(read_buffer[:buffer_read_size]))[:-1]
        # Put any unread data back into the buffer
        read_buffer = read_buffer[buffer_read_size:]

        # Read the data in chunks of base64_line_size and append a
        # linebreak
        for pos in xrange(0, len(data), base64_line_size):
        write_buffer.extend(data[pos:pos + base64_line_size])
        write_buffer.extend('n')

        if len(write_buffer) >= write_size:
        # Flush write buffer
        output.write(write_buffer)
        del write_buffer[:]

        if len(write_buffer):
        output.write(write_buffer)
        del write_buffer[:]






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Sep 28 at 18:39









        six8

        1913




        1913






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204302%2fpython-optimized-base64-writer-for-streamed-files%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Сан-Квентин

            8-я гвардейская общевойсковая армия

            Алькесар