Thursday, December 26, 2024

Encryption: ciphers, digests, salt, IV




What is encryption
Encryption is a method of turning data into an unusable form that can be made useful only by means of decryption. The purpose is to make data available solely to those who can decrypt it (i.e. make it usable). Typically, data needs to be encrypted to make sure it cannot be obtained in case of unauthorized access. It is the last line of defense after an attacker has managed to break through authorization systems and access control.

This doesn't mean all data needs to be encrypted, because often times authorization and access systems may be enough, and in addition, there is a performance penalty for encrypting and decrypting data. If and when the data gets encrypted is a matter of application planning and risk assessment, and sometimes it is also a regulatory requirement, such as with HIPAA or GDPR.

Data can be encrypted at-rest, such as on disk, or in transit, such as between two parties communicating over the Internet.

Here you will learn how to encrypt and decrypt data using a password, also known as symmetrical encryption. This password must be known to both parties exchanging information.
Cipher, digest, salt, iterations, IV
To properly and securely use encryption, there are a few notions that need to be explained.

A cipher is the algorithm used for encryption. For example, AES256 is a cipher. The idea of a cipher is what most people will think of when it comes to encryption.

A digest is basically a hash function that is used to scramble and lengthen the password (i.e. the encryption key) before it's used by the cipher. Why is this done? For one, it creates a well randomized, uniform-length hash of a key that works better for encryption. It's also very suitable for "salting", which is the next one to talk about.

The "salt" is a method of defeating so-called "rainbow" tables. An attacker knows that two hashed values will still look exactly the same if the originals were. However, if you add the salt value to hashing, then they won't. It's called "salt" because it's sort of mixed with the key to produce something different. Now, a rainbow table will attempt to match known hashed values with precomputed data in an effort to guess a password. Usually, salt is randomly generated for each key and stored with it. In order to match known hashes, the attacker would have to precompute rainbow tables for great many random values, which is generally not feasible.

You will often hear about "iterations" in encryption. An iteration is a single cycle in which a key and salt are mixed in such a way to make guessing the key harder. This is done many times so to make it computationally difficult for an attacker to reverse-guess the key, hence "iterations" (plural). Typically, a minimum required number of iterations is 1000, but it can be different than that. If you start with a really strong password, generally you need less.

IV (or "Initialization Vector") is typically a random value that's used for encryption of each message. Now, salt is used for producing a key based on a password. And IV is used when you already have a key and now are encrypting messages. The purpose of IV is to make the same messages appear differently when encrypted. Sometimes, IV also has a sequential component, so it's made of a random string plus a sequence that constantly increases. This makes "replay" attacks difficult, which is where attacker doesn't need to decrypt a message; but rather an encrypted message was "sniffed" (i.e. intercepted between the sender and receiver) and then replayed, hoping to repeat the action already performed. Though in reality, most high-level protocols already have a sequence in place, where each message has, as a part of it, an increasing packet number, so in most cases IV doesn't need it.
Prerequisites
This example uses Golf framework. Install it first.
Encryption example
To run the examples here, create an application "enc" in a directory of its own (see mgrg for more on Golf's program manager):
mkdir enc_example
cd enc_example
gg -k enc

To encrypt data use encrypt-data statement. The simplest form is to encrypt a null-terminated string. Create a file "encrypt.golf" and copy this:
 begin-handler /encrypt public
     set-string str = "This contains a secret code, which is Open Sesame!"
     // Encrypt
     encrypt-data str to enc_str password "my_password"
     p-out enc_str
     @
     // Decrypt
     decrypt-data enc_str password "my_password" to dec_str
     p-out dec_str
     @
 end-handler

You can see the basic usage of encrypt-data and decrypt-data. You supply data (original or encrypted), the password, and off you go. The data is encrypted and then decrypted, yielding the original.

In the source code, a string variable "enc_str" (which is created as a "char *") will contain the encrypted version of "This contains a secret code, which is Open Sesame!" and "dec_str" will be the decrypted data which must be exactly the same.

To run this code from command line, make the application first:
gg -q

Then have Golf produce the bash code to run it - the request path is "/encrypt", which in our case is handled by function "void encrypt()" defined in source file "encrypt.golf". In Golf, these names always match, making it easy to write, read and execute code. Use "-r" option in gg to specify the request path and get the code you need to run the program:
gg -r --req="/encrypt" --silent-header --exec

You will get a response like this:
72ddd44c10e9693be6ac77caabc64e05f809290a109df7cfc57400948cb888cd23c7e98e15bcf21b25ab1337ddc6d02094232111aa20a2d548c08f230b6d56e9
This contains a secret code, which is Open Sesame!

What you have here is the encrypted data, and then this encrypted data is decrypted using the same password. Unsurprisingly, the result matches the string you encrypted in the first place.

Note that by default encrypt-data will produce encrypted value in a human-readable hexadecimal form, which means consisting of hexadecimal characters "0" to "9" and "a" to "f". This way you can store the encrypted data into a regular string. For instance it may go to a JSON document or into a VARCHAR column in a database, or pretty much anywhere else. However you can also produce a binary encrypted data. More on that in a bit.
Encrypt data into a binary result
In the previous example, the resulting encrypted data is in a human-readable hexadecimal form. You can also create binary encrypted data, which is not a human-readable string and is also shorter. To do that, use "binary" clause. Replace the code in "encrypt.golf" with:
 begin-handler /encrypt public
     set-string str = "This contains a secret code, which is Open Sesame!"
     // Encrypt
     encrypt-data str to enc_str password "my_password" binary
     // Save the encrypted data to a file
     write-file "encrypted_data" from enc_str
     get-app directory to app_dir
     @Encrypted data written to file <<p-out app_dir>>/encrypted_data
     // Decrypt data
     decrypt-data enc_str password "my_password" binary to dec_str
     p-out dec_str
     @
 end-handler

When you want to get binary encrypted data, you should get its length in bytes too, or otherwise you won't know where it ends, since it may contain null bytes in it. Use "output-length" clause for that purpose. In this code, the encrypted data in variable "enc_str" is written to file "encrypted_data", and the length written is "outlen" bytes. When a file is written without a path, it's always written in the application home directory (see directories), so you'd use get-app to get that directory.

When decrypting data, notice the use of "input-length" clause. It says how many bytes the encrypted data has. Obviously you can get that from "outlen" variable, where encrypt-data stored the length of encrypted data. When encryption and decryption are decoupled, i.e. running in separate programs, you'd make sure this length is made available.

Notice also that when data is encrypted as "binary" (meaning producing a binary output), the decryption must use the same.

Make the application:
gg -q

Run it the same as before:
gg -r --req="/encrypt" --silent-header --exec

The result is:
Encrypted data written to file /var/lib/gg/enc/app/encrypted_data
This contains a secret code, which is Open Sesame!

The decrypted data is exactly the same as the original.

You can see the actual encrypted data written to the file by using "octal dump" ("od") Linux utility:
od -c /var/lib/gg/enc/app/encrypted_data

with the result like:
$ od -c /var/lib/gg/enc/app/encrypted_data
0000000   r 335 324   L 020 351   i   ; 346 254   w 312 253 306   N 005
0000020 370  \t   )  \n 020 235 367 317 305   t  \0 224 214 270 210 315
0000040   # 307 351 216 025 274 362 033   % 253 023   7 335 306 320
0000060 224   #   ! 021 252     242 325   H 300 217   #  \v   m   V 351
0000100

There you have it. You will notice the data is binary and it actually contains the null byte(s).
Encrypt binary data
The data to encrypt in these examples is a string, i.e. null-delimited. You can encrypt binary data just as easily by specifying it whole (since Golf keeps track of how many bytes are there!), or specifying its length in "input-length" clause, for example copy this to "encrypt.golf":
 begin-handler /encrypt public
     set-string str = "This c\000ontains a secret code, which is Open Sesame!"
     // Encrypt
     encrypt-data str to enc_str password "my_password" input-length 12
     p-out enc_str
     @
     // Decrypt
     decrypt-data enc_str password "my_password" to dec_str
     // Output binary data; present null byte as octal \000

     string-length dec_str to res_len
     start-loop repeat res_len use i start-with 0
         if-true dec_str[i] equal 0
             p-out "\\000"
         else-if
             pf-out "%c", dec_str[i]
         end-if
     end-loop
     @
 end-handler

This will encrypt 12 bytes at memory location "enc_str" regardless of any null bytes. In this case that's "This c" followed by a null byte followed by "ontain" string, but it can be any kind of binary data, for example the contents of a JPG file.

On the decrypt side, you'd obtain the number of bytes decrypted in "output-length" clause. Finally, the decrypted data is shown to be exactly the original and the null byte is presented in a typical octal representation.

Make the application:
gg -q

Run it the same as before:
gg -r --req="/encrypt" --silent-header --exec

The result is:
6bea45c2f901c0913c87fccb9b347d0a
This c\000ontai

The encrypted value is shorter because the data is shorter in this case too, and the result matches exactly the original.
Use any cipher or digest
The encryption used by default is AES256 and SHA256 hashing from the standard OpenSSL library, both of which are widely used in cryptography. You can however use any available cipher and digest (i.e. hash) that is supported by OpenSSL (even the custom ones you provide).

To see which algorithms are available, do this in command line:
#get list of cipher providers
openssl list -cipher-algorithms

#get list of digest providers
openssl list -digest-algorithms

These two will provide a list of cipher and digest (hash) algorithms. Some of them may be weaker than the default ones chosen by Golf, and others may be there just for backward compatibility with older systems. Yet others may be quite new and did not have enough time to be validated to the extent you may want them to be. So be careful when choosing these algorithms and be sure to know why you're changing the default ones. That said, here's an example of using Camellia-256 (i.e. "CAMELLIA-256-CFB1") encryption with "SHA3-512" digest. Replace the code in "encrypt.golf" with:
 begin-handler /encrypt public
     set-string str = "This contains a secret code, which is Open Sesame!"
     // Encrypt data
     encrypt-data str to enc_str password "my_password" \
         cipher "CAMELLIA-256-CFB1" digest "SHA3-512"
     p-out enc_str
     @
     // Decrypt data
     decrypt-data enc_str password "my_password"  to dec_str \
         cipher "CAMELLIA-256-CFB1" digest "SHA3-512"
     p-out dec_str
     @
 end-handler

Make the application:
gg -q

Run it:
gg -r --req="/encrypt" --silent-header --exec

In this case the result is:
f4d64d920756f7220516567727cef2c47443973de03449915d50a1d2e5e8558e7e06914532a0b0bf13842f67f0a268c98da6
This contains a secret code, which is Open Sesame!

Again, you get the original data. Note you have to use the same cipher and digest in both encrypt-data and decrypt-data!

You can of course produce the binary encrypted value just like before by using "binary" and "output-length" clauses.

If you've got external systems that encrypt data, and you know which cipher and digest they use, you can match those and make your code interoperable. Golf uses standard OpenSSL library so chances are that other software may too.
Using salt
To add a salt to encryption, use "salt" clause. You can generate random salt by using random-string statement (or random-crypto if there is a need). Here is the code for "encrypt.golf":
 begin-handler /encrypt public
     set-string str = "This contains a secret code, which is Open Sesame!"
     // Get salt
     random-string to rs length 16
     // Encrypt data
     encrypt-data str to enc_str password "my_password" salt rs
     @Salt used is <<p-out rs>>, and the encrypted string is <<p-out enc_str>>
     // Decrypt data
     decrypt-data enc_str password "my_password" salt rs to dec_str
     p-out dec_str
     @
 end-handler

Make the application:
gg -q

Run it a few times:
gg -r --req="/encrypt" --silent-header --exec
gg -r --req="/encrypt" --silent-header --exec
gg -r --req="/encrypt" --silent-header --exec

The result:
Salt used is VA9agPKxL9hf3bMd, and the encrypted string is 3272aa49c9b10cb2edf5d8a5e23803a5aa153c1b124296d318e3b3ad22bc911d1c0889d195d800c2bd92153ef7688e8d1cd368dbca3c5250d456f05c81ce0fdd
This contains a secret code, which is Open Sesame!
Salt used is FeWcGkBO5hQ1uo1A, and the encrypted string is 48b97314c1bc88952c798dfde7a416180dda6b00361217ea25278791c43b34f9c2e31cab6d9f4f28eea59baa70aadb4e8f1ed0709db81dff19f24cb7677c7371
This contains a secret code, which is Open Sesame!
Salt used is nCQClR0NMjdetTEf, and the encrypted string is f19cdd9c1ddec487157ac727b2c8d0cdeb728a4ecaf838ca8585e279447bcdce83f7f95fa53b054775be1bb2de3b95f2e66a8b26b216ea18aa8b47f3d177e917
This contains a secret code, which is Open Sesame!

As you can see, a random salt value (16 bytes long in this case) is generated for each encryption, and the encrypted value is different each time, even though the data being encrypted was the same! This makes it difficult to crack encryption like this.

Of course, to decrypt, you must record the salt and use it exactly as you did when encrypting. In the code here, variable "rs" holds the salt. If you store the encrypted values in the database, you'd likely store the salt right next to it.
Initialization vector
In practice, you wouldn't use a different salt value for each message. It creates a new key every time, and that can reduce performance. And there's really no need for it: the use of salt is to make each key (even the same ones) much harder to guess. Once you've done that, you might not need to do it again, or often.

Instead, you'd use an IV (Initialization Vector) for each message. It's usually a random string that makes same messages appear different, and increases the computational cost of cracking the password. Here is the new code for "encrypt.golf":
 begin-handler /encrypt public
     // Get salt
     random-string to rs length 16
     // Encrypt data
     start-loop repeat 10 use i start-with 0
         random-string to iv length 16
         encrypt-data "The same message" to enc_str password "my_password" salt rs iterations 2000 init-vector iv cache
         @The encrypted string is <<p-out enc_str>>
         // Decrypt data
         decrypt-data enc_str password "my_password" salt rs iterations 2000 init-vector iv to dec_str cache
         p-out dec_str
         @
     end-loop
 end-handler

Make the application:
gg -q

Run it a few times:
gg -r --req="/encrypt" --silent-header --exec
gg -r --req="/encrypt" --silent-header --exec
gg -r --req="/encrypt" --silent-header --exec

The result may be:
The encrypted string is 787909d332fd84ba939c594e24c421b00ba46d9c9a776c47d3d0a9ca6fccb1a2
The same message
The encrypted string is 7fae887e3ae469b666cff79a68270ea3d11b771dc58a299971d5b49a1f7db1be
The same message
The encrypted string is 59f95c3e4457d89f611c4f8bd53dd5fa9f8c3bbe748ed7d5aeb939ad633199d7
The same message
The encrypted string is 00f218d0bbe7b618a0c2970da0b09e043a47798004502b76bc4a3f6afc626056
The same message
The encrypted string is 6819349496b9f573743f5ef65e27ac26f0d64574d39227cc4e85e517f108a5dd
The same message
The encrypted string is a2833338cf636602881377a024c974906caa16d1f7c47c78d9efdff128918d58
The same message
The encrypted string is 04c914cd9338fcba9acb550a79188bebbbb134c34441dfd540473dd8a1e6be40
The same message
The encrypted string is 05f0d51561d59edf05befd9fad243e0737e4a98af357a9764cba84bcc55cf4d5
The same message
The encrypted string is ae594c4d6e72c05c186383e63c89d93880c8a8a085bf9367bdfd772e3c163458
The same message
The encrypted string is 2b28cdf5a67a5a036139fd410112735aa96bc341a170dafb56818dc78efe2e00
The same message

You can see that the same message appears different when encrypted, though when decrypted it's again the same. Of course, the password, salt, number of iterations, and init-vector must be the same for both encryption and decryption.

Note the use of "cache" clause in encrypt-data and decrypt-data. It effectively caches the computed key (given password, salt, cipher/digest algorithms and number of iterations), so it's not computed each time through the loop. With "cache" the key is computed once, and then a different IV (in "init-vector" clause) is used for each message.

If you want to occasionally rebuild the key, use "clear-cache" clause, which supplies a boolean. If true, the key is recomputed, otherwise it's left alone. See encrypt-data for more on this.
Conclusion
You have learned how to encrypt and decrypt data using different ciphers, digests, salt and IV values in Golf. You can also create a human-readable encrypted value and a binary output, as well as encrypt both strings and binary values (like documents).

Wednesday, December 25, 2024

Golf 136 released

  • Any number expression can now use string subscription as a number, for instance:
    set-string str='hello'
    set-number num = 10+str[0]

    A character is treated as an unsigned number ranging from 0-255 (i.e. unsigned byte).

Tuesday, December 24, 2024

Golf 132 released

  • Individual bytes of a string (binary or text) can now be set using set-string by specifying the byte with a number expression within []. Since Golf is a memory-safe language, setting a byte this way is subject to a range check. For instance:
    set-string str[10] = 'a'
  • An individual byte of a string (binary or text) can now be obtained (as a number) with set-number using a number expression within []. Since Golf is a memory-safe language, getting a byte this way is subject to a range check. For instance:
    set-number byte = str[10]
Note that Golf is a very high level language, and it generally does not start with low-level constructs, such as setting and retrieving bytes from memory; rather its statements perform tasks that take typically many lines of code in other languages. So it makes sense an addition like this would be a "side-note" undertaken later in the language; it's not the focus of it. Still, Golf is also a high performance language and so the above two new capabilities are implemented with that in mind, with the minimum of overhead.  

Sunday, December 15, 2024

Distributed computing made easy




What is distributed computing
Distributed computing is two or more servers communicating for a common purpose. Typically, some tasks are divvied up between a number of computers, and they all work together to accomplish it. Note that "separate servers" may mean physically separate computers. It may also mean virtual servers such as Virtual Private Servers (VPS) or containers, that may share the same physical hardware, though they appear as separate computers on the network.

There are many reasons why you might need this kind of setup. It may be that resources needed to complete the task aren't all on a single computer. For instance, your application may rely on multiple databases, each residing on a different computer. Or, you may need to distribute requests to your application because a single computer isn't enough to handle them all at the same time. In other cases, you are using remote services (like a REST API-based for instance), and those by nature reside somewhere else.

In any case, the computers comprising your distributed system may be on a local network, or they may be worldwide, or some combination of those. The throughput (how many bytes per second can be exchanged via network) and latency (how long it takes for a packet to travel via network) will obviously vary: for a local network you'd have a higher throughput and lower latency, and for Internet servers it will be the opposite. Plan accordingly based on the quality of service you'd expect.
How servers communicate
Depending on your network(s) setup, different kinds of communication are called for. If two servers reside on a local network, then they would typically used the fastest possible means of communication. A local network typically means a secure network, because nobody else has access to it but you. So you would not need TSL/SSL or any other kind of secure protocol as that would just slow things down.

If two servers are on the Internet though, then you must use a secure protocol (like TSL/SSL or some other) because your communication may be spied on, or worse, affected by man-in-the-middle attacks.
Local network distributed computing
Most of the time, your distributed system would be on a local network. Such network may be separate and private in a physical sense, or (more commonly) in a virtual sense, where some kind of a Private Cloud Network is established for you by the Cloud provider. It's likely that separation is enforced by specialized hardware (such as routers and firewalls) and secure protocols that keep networks belonging to different customers separate. This way, a "local" network can be established even if computers on it are a world apart, though typically they reside as a part of a larger local network.

Either way, as far as your application is concerned, you are looking at a local network. Thus, the example here will be for such a case, as it's most likely what you'll have. A local network means different parts of your application residing on different servers will use some efficient protocol based on TCP/IP. One such protocol is FastCGI, a high-performance binary protocol for communication between servers, clients, and in general programs of all kinds, and that's the one used by Golf. So in principle, the setup will look like this (there'll be more details later):


Next, in theory you should have two servers, however in this example both servers will be on the same localhost (i.e. "127.0.0.1"). This is just for simplicity; the code is exactly the same if you have two different servers on a local network - simply use another IP (such as "192.168.0.15" for instance) for your "remote" server instead of local "127.0.0.1". The two servers do not even necessarily need to be physically two different computers. You can start a Virtual Machine (VM) on your computer and host another virtual computer there. Popular free software like VirtualBox or KVM Hypervisor can help you do that.

In any case, in this example you will start two simple application servers; they will communicate with one another. The first one will be called "local" and the other one "remote" server. The local application server will make a request to the remote one.
Local server
On a local server, create a new directory for your local application server source code:
mkdir ~/local_server
cd ~/local_server

and then create a new file "status.golf" with the following:
 begin-handler /status public
     silent-header
     get-param server
     get-param days

     pf-out "/server/remote-status/days=%s", days to payload
     pf-out "%s:3800", server to srv_location

     new-remote srv location srv_location \
         method "GET" url-path payload \
         timeout 30

     call-remote srv
     read-remote srv data dt
     @Output is: [<<p-out dt>>]
 end-handler

The code here is very simple. new-remote will create a new connection to a remote server, running on IP address given by input parameter "server" (and obtained with get-param) on TCP port 3800. URL payload created in string variable "payload" is passed to the remote server. If it doesn't reply in 30 seconds, then the code would timeout. Then you're using call-remote to actually make a call to the remote server (which is served by application "server" and by request handler "remote-status.golf" below), and finally read-remote to get the reply from it. For simplicity, error handling is omitted here, but you can easily detect a timeout, any network errors, any errors from the remote server, including error code and error text, etc. See the above statements for more on this.
Make and start the local server
Next, create a local application:
gg -k client

Make the application (i.e. compile the source code and build the native executable):
gg -q

Finally, start the local application server:
mgrg -w 2 client

This will start 2 server instances of a local application server.
Remote server
Okay, now you have a local server. Next, you'll setup a remote server. Login to your remote server and create a new directory for your remote application server:
mkdir ~/remote_server
cd ~/remote_server

Then create file "remote-status.golf" with this code:
 begin-handler /remote-status public
     silent-header
     get-param days

     pf-out "Status in the past %s days is okay", days
 end-handler

This is super simple, and it just replies that the status is okay; it accepts the number of days to check for status and displays that back. In a real service, you might query a database to check for status (see run-query).
Make and start remote server
First create your application:
gg -k server

Then make your program:
gg -q

And finally start the server:
mgrg -w 2 -p 3800 server

This will start 2 daemon processes running as background servers. They will serve requests from your local server.

Note that if you're running this example on different computers, some Linux distributions come with a firewall, and you may need to use ufw or firewall-cmd to make port 3800 accessible here. Also if you're using SELinux on this server, you may either need to allow binding to port 3800, or make SELinux permissive (with "sudo setenforce 0").
Run distributed calls
There is a number of ways you can call the remote service you created. These are calls made from your local server, so change directory to it:
cd ~/local_server

Here's various way to call the remote application server:
  • Execute a command-line program on local server that calls remote application server:


    To do this, use "-r" option of gg utility to generate shell commands you can use to call your program:
    gg -r --req "/status/days=18/server=127.0.0.1" --exec

    Here, you're saying that you want to make a request "status" (which is in source file "status.golf" on your local server). You are also saying that input parameter "days" should have a value of "18" and also that input parameter "server" should have a value of "127.0.0.1" - see get-param statements in the above file "status.golf". If you actually have a different server with a different IP, use it instead of "127.0.0.1".
    The result will be:
    Output is: [Status in the past 18 days is okay]

    where the part in between "[..]" comes from the remote server, and the "Output is: " part comes from the command line Golf program you executed.
  • Call remote application server directly from a command-line program:


    Do this:
    gg -r --req "/status/days=18/server=127.0.0.1" --exec --service --remote="127.0.0.1:3800"

    The result is, as expected:
    Status in the past 12 days is okay

    In this case, the output comes straight from the remote server, so the "Output is: " part is missing. The above simply copies the output from a remote service to the standard output.
  • Use a command-line utility to contact local application server, which then calls the remote server, which replies back to local application server, which replies back to your command-line utility:


    You will use cgi-fcgi to do this:
    gg -r --req "/status/server=127.0.0.1/days=10" --exec --service

    The result is:
    Output is: [Status in the past 10 days is okay]

    which is what you'd expect. In this case we first send a request to your local application server, which sends it to a remote service, so there is "Output is: " output.
You have different options when designing your distributed systems, and this article shows how easy it is to implement them.