Use this Python script to encode binary

Use this Python script to encode binary

Man transfers information in many different ways. On the Internet, the primary format is text, which you are reading this article. However, there is other data on the Internet, such as images and sound files and so on. Posting an image online or attaching a document to an email may seem easy, unless you know that HTTP / 1.1 and SMTP are text-based protocols. Data transferred on such protocols should be represented as a subset of ASCII text (specifically, 33 to 126).

A digital image is already encoded as binary data by the computer that displays it. In other words, a digital image is not like a physical image that is printed on paper: it is a collection of computer shorthand that the image viewer you are using to view is a web browser by the image viewer. Photo-editing applications, or any software that can display pictures.

To re-encode an image in ASCII, it is common to use base64, a system of binary-to-text encoding rules that can represent binary data as an ASCII string. Here is a single black pixel, saved in Base64 in web format:

UklGRiYAAABXRUJQVlA4IBoAAAAwAQCdASoBAAEAAIAOJaQAA3AA/v9gGAAAAA==

Convert binary files

The main purpose of such a converter is to implement a binary file to send over a channel with a limited range of supported symbols. A good example is any text-based network protocol, where all transmitted binary data must be inversely converted to a pure text form and the data must not contain any control marks. ASCII codes from 0 to 31 are considered control characters, and are lost when transmitting over any logical channel that does not allow the endpoint to transmit full eight-bit bytes (binary) with code from 0 to 255 gives.

The standard solution nowadays for this purpose is the Base64 algorithm, as shown above and defined in IETF’s RFC 4648. It also describes RFC Base 32 and Base 16 as possible. The main thing here is that they all share the same trait: they all have powers of two. The more widely supported a class of symbols (codes), the more efficient the conversion result. It will be big, but the question is how big. For example, base 64 encoding gives about 33% larger output, as three input (eight valuable bits) bytes are translated into four output (six valuable bits, 26 = 64) bytes. So, the ratio is always 4/3; That is, the output is greater than 1/3 or 33. (3)%. In practice, base 32 is very inefficient because it means that five input (eight valuable bits) bytes translate into eight output (five valuable bits, 25 = 32) bytes, and the ratio is 8/5; That is, the output is more than 3/5 or 60%. In this context, it is difficult to consider any kind of efficiency of base 16, as its output size is greater than 100% (each byte with eight-bit bits is represented by two four-valued bit bytes, called nibbles goies. Also called, 24 = 16). It is also not a translation, but only an eight-bit byte representation in a hexadecimal view.

  1. Base64 (Input: eight bits, Output: six bits):
    • LCM(8, 6) = 8*6/GCD(8,6) = 24 bit
    • Input: 24/8 = 3 bytes
    • Output: 24/6 = 4 bytes
    • Ratio (Output/Input): 4/3
  2. Base32 (Input: eight bits, Output: five bits):
    • LCM(8, 5) = 8*5/GCD(8,5) = 40 bit
    • Input: 40/8 = 5 bytes
    • Output: 40/5 = 8 bytes
    • Ratio (Output/Input): 8/5
  3. Base16 (Input: eight bits, Output: four bits):
    • LCM(8, 4) = 8*4/GCD(8,4) = 8 bit
    • Input: 8/8 = 1 byte
    • Output: 8/4 = 2 bytes
    • Ratio (Output/Input): 2/1

Use this python script

This solution is very simple, but this simplicity involves a significant computational constraint. The entire input file can be treated as a large number with base 256. This can be a really large number and requires thousands of bits. Then you just need to convert this large number to a different base. Just.

with open('input_file', 'rb') as f: in_data = int.from_bytes(f.read(), 'big')

#!/usr/bin/env python3

from sys import argv
from math import ceil

base = 42
abc = '''!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'''

def to_base(fn_src, fn_dst, base=base):
 out_data = []

  # represent a file as a big decimal number
  with open(fn_src, 'rb') as f:
 in_data = int.from_bytes(f.read(), 'big')
  
  # convert a big decimal number to a baseN
 d, r = in_data % base, in_data // base
 out_data.append(abc[d])
  while r:
 d, r = r % base, r // base
 out_data.append(abc[d])

  # write a result as a string to a file
  with open(fn_dst, 'wb') as f:
 f.write(''.join(out_data).encode())

def from_base(fn_src, fn_dst, base=base):
 out_data = 0

  # read one long string at once to memory
  with open(fn_src, 'rb') as f:
 in_data = f.read().decode()

  # convert a big baseN number to decimal
  for i, ch in enumerate(in_data):
 out_data = abc.index(ch)*(base**i) + out_data

  # write a big decimal number to a file as a sequence of bytes
  with open(fn_dst, 'wb') as f:
 f.write(out_data.to_bytes(ceil(out_data.bit_length()/8), 'big'))

def usage():
  print(f'usage: {argv[0]} <-e|-d> src dst [base={base}]')
  raise SystemExit(1)

def main():
  if len(argv) == 5:
 base = int(argv[4])
  elif len(argv) < 4:
 usage()

  if argv[1] not in ('-e', '-d'):
 usage()
  elif argv[1] == '-e':
 to_base(argv[2], argv[3], base)
  elif argv[1] == '-d':
 from_base(argv[2], argv[3], base)
  else:
 usage()

if __name__ == '__main__':
 main()

submit questions and problems

Leave a Reply

Your email address will not be published. Required fields are marked *