Incompressible Encodings
>> Hello, my name is Daniel Wichs, I'm a senior scientist at NTT research and a professor at Northeastern University. Today I want to tell you about incompressible encodings. This is a recent work from Crypto 2020 and it's a joint work with Tal Moran. So let me start with a question. How much space would it take to store all of Wikipedia? So it turns out that you can download Wikipedia for offline use and some reasonable version of it is about 50 gigabytes in size. So as you'd expect, it's a lot of data, it's quite large. But there's another way to store Wikipedia which is just to store the link www.wikipedia.org that only takes 17 bytes. And for all intents and purposes as long as you have a connection to the internet storing this link is as good as storing the Wikipedia data. You can access a Wikipedia with this link whenever you want. And the point I want to make is that when it comes to public data like Wikipedia, even though the data is huge, it's trivial to compress it down because it is public just by storing a small link to it. And the question for this talk is, can we come up with an incompressible representation of public data like Wikipedia? In other words can we take Wikipedia and represent it in some way such that this representation requires the full 50 gigabytes of storage store, even for someone who has the link to the underlying Wikipedia data and can get the underlying data for free. So let me actually tell you what this means in more detail. So this is the notion of incompressible encodings that we'll focus on in this work. So incompressible encoding consists of an encoding algorithm and a decoding algorithm, these are public algorithms. There's no secret key. Anybody can run these algorithms. The encoding algorithm takes some data m, let's say the Wikipedia data and encodes it in some probabilistic randomized way to derive a codeword c. And the codeword c, you can think of it as just an alternate representation of the Wikipedia data. Anybody can come and decode the codeword to recover the underlying data m. And the correctness property we want here is that no matter what data you start with, if you encode the data m and then decode it, you get back the original data m. This should hold with probably one over the randomness of the encoding procedure. Now for security, we want to consider an adversary that knows the underlying data m, let's say has a link to Wikipedia and can access the Wikipedia data for free does not pay for storing it. The goal of the adversary is to compress this codeword that we created this new randomized representation of the Wikipedia data. So the adversary consists of two procedures a compression procedure and a decompression procedure. The compression procedure takes its input the codeword c and output some smaller compressed value w and the decompression procedure takes w and its goal is to recover the codeword c. And a security property says that no efficient adversary should be able to succeed in this game with better than negligible property. So there are two parameters of interest in this problem. One is the codeword size, which we'll denote by alpha, and ideally we want the codeword size alpha to be as close as possible to the original data size. In other words we don't want the encoding to add too much overhead to the data. The second parameter is the incompressibility parameter beta and that tells us how much space, how much storage and adversary needs to use in order to store the codeword. And ideally, we want this beta to be as close as possible to the codeword size alpha, which should also be as close as possible to the original data size. So I want to mention that there is a trivial construction of incompressible encodings that achieves very poor parameters. So the trivial construction is just take the data m and add some randomness, concatenate some randomness to it and store the original data m plus the concatenated randomness as the codeword. And now even an adversary that knows the underlying data m cannot compress the randomness. So the incompressibility, so we ensure that this construction is incompressible with incompressibility parameter beta that just corresponds to the size of this randomness we added. So essentially the adversary cannot compress the red part of the codeword. So this gets us a scheme where alpha the size of the codeword, is the original data size m plus the incompressible parameter beta. And it turns out that you cannot do better than this information theoretically. So this is not what we want for this we want to focus on what I will call good incompressible encodings. So here, the codeword size should be as close as possible to the data size, should be just one plus little o one of the data size. And the incompressibility should be as essential as large as the entire codeword the adversary cannot compress the codeword almost at all, the incompressible parameter beta is one minus little o one of the data size or the codeword size. And in essence, what this means is that we're somehow want to take the randomness of the encoding procedure and spread it around in some clever way throughout the codeword in such a way that's impossible for the adversary to separate out the randomness and the data, and only store the randomness and rely on the fact that it can get the data for free. We want to make sure it's impossible that adversary accesses essentially this entire code word which contains both the randomness and data and some carefully intertwined way and cannot compress it down using the fact that it knows the data parts. So this notion of incompressible encodings was defined actually in a prior work of Damgard-Ganesh and Orlandi from crypto 2019. They defined a variant of this notion, they had a different name for it. As a tool or a building block for a more complex cryptographic primitive that they called Proofs of Replicated Storage. And I'm not going to talk about what these are. But in this context of constructing these Proofs of Replicated Storage, they also constructed incompressible encodings albeit with some major caveats. So in particular, their construction relied on the random Oracle models, the heuristic construction and it was not known whether you could do this in the standard model, the encoding and decoding time of the construction was quadratic in the data size. And in particular, here we want to apply this, we want to use these types of incompressible encodings on fairly large data like Wikipedia data, 50 gigabytes in size. So quadratic runtime on such huge data is really impractical. And lastly the proof of security for their construction was flawed or someone incompleted, didn't consider general adversaries. And the slope was actually also noticed by concurrent work of Garg-Lu and Waters. And they managed to give a fixed proof for this construction but this required actually quite a lot of effort. It was a highly non-trivial and subtle proof to proof the original construction of Damgard-Ganesh and Orlandi secure. So in our work, we give a new construction of these types of incompressible encodings, our construction already achieved some form of security in the Common Reference String Model come Random String Model without the use of Random Oracles. We have a linear encoding time, linear in the data size. So we get rid of the quadratic and we have a fairly simple proof of security. In fact, I'm hoping to show you a slightly simplified form of it and the stock. We also give some lower bounds and negative results showing that our construction is optimal in some aspects and lastly we give a new application of this notion of incompressible encodings to something called big-key cryptography. And so I want to tell you about this application, hopefully it'll give you some intuition about why incompressible encodings are interesting and useful, and also some intuition about what their real goal is or what it is that they're trying to achieve. So, the application of big-key cryptography is concerned with the problem of system compromise. So, a computer system can become compromised either because the user downloads a malware or remote attacker manages to hack into it. And when this happens, the remote attacker gains control over the system and any cryptographic keys that are stored on the system can easily be exfiltrated or just downloaded out of the system by the attacker and therefore, any security that these cryptographic keys were meant to provide is going to be completely lost. And the idea of big-key cryptography is to mitigate against such attacks by making the secret keys intentionally huge on the order of many gigabytes to even terabytes. And the idea is that by having a very large secret key it would make it harder to exfiltrate such a secret key. Either because the adversary's bandwidth to the compromised system is just not large enough to exfiltrate such a large key or because it might not be cost-effective to have to download so much data of compromised system and store so much data to be able to use the key in the future, especially if the attacker wants to do this on some mass scale or because the system might have some other mechanisms let's say firewall that would detect such large amounts of leakage out of the compromised system and block it in some way. So there's been a lot of work on this idea building big-key crypto systems. So crypto systems where the secret key can be set arbitrarily huge and these crypto systems should testify two goals. So one is security, security should hold even if a large amount of data about the secret key is out, as long as it's not the entire secret key. So when you have an attacker download let's say 90% of the data of the secret key, the security of the system should be preserved. And the second property is that even though the secret key of the system can be huge, many gigabytes or terabytes, we still want the crypto system to remain efficient even though the secret is huge. And particularly this means that the crypto system can even read the entire secret key during each cryptographic operation because that would already be too inefficient. So it can only read some small number of bits of the secret key during each operation, then it performs. And so there's been a lot of work constructing these types of crypto systems but one common problem for all these works is that they require the user to waste a lot of their storage the storage on their computer in storing this huge secret key which is useless for any other purpose, other than providing security. And users might not want to do this. So that's the problem that we address here. And the new idea in our work is let's make the secret key useful instead of just having a secret key with some useless, random data that the cryptographic scheme picks, let's have a secret key that stores let's say the Wikipedia data at which a user might want to store in their system anyway or the user's movie collection or music collection et cetera and the data that the user would want to store on their system. Anyway, we want to convert it. We want to use that as the secret key. Now we think about this for a few seconds. Well, is it a good idea to use Wikipedia as a secret key? No, that sounds like a terrible idea. Wikipedia is not secret, it's public, it's online, Anyone can access it whenever they want. So it's not what we're suggesting. We're suggesting to use an incompressible encoding of Wikipedia as a secret key. Now, even though Wikipedia is public the incompressible encoding is randomized. And therefore the accuracy does not know the value of this incompressible encoding. Moreover, because it's incompressible in order for the adversary to steal, to exfiltrate the entire secret key, it would have to download a very large amount of data out of the compromised system. So there's some hope that this could provide security and we show how to build public encryption schemes and the setting that make use of a secret key which is an incompressible coding of some useful data like Wikipedia. So the secret key is an incompressible encoding of useful data and security ensures that the adversary will need to exfiltrate almost entire key to break the security of this critical system. So in the last few minutes, let me give you a very brief overview of our construction of incompressible encodings. And for this part, we're going to pretend we have something a real beautiful cryptographic object called Lossy Trapdoor Permutations. It turns out we don't quite have an object that's this beautiful and in the full construction, we relax this notion somewhat in order to be able to get our full construction. So Lossy Trapdoor Permutation is a function f we just key by some public key pk and it maps end bits to end bits. And we can sample the public key in one of two indistinguishable modes. In injective mode, this function of fPK is a permutation, and there's in fact, a trapdoor that allows us to invert it efficiently. And in the Lossy mode, if we sample the public in Lossy mode, then if we take some value, random value x and give you fpk of x, then this loses a lot of information about x. And in particular, the image size of the function is very small, much smaller than two to the n and so fpk of x does not contain all the information about x. Okay, so using this type of Lossy Trapdoor Permutation, here's the encoding of a message m using long random CRS come random string. So the encoding just consists of sampling the public key of this Lossy Trapdoor Permutation in injected mode, along with the trapdoor. And the encoding is just going to take the message m, x over it with a common reference string, come random string and invert the trapdoor permutation on this value. And then Coding will just be the public key and the inverse x. So this is something anybody can decode by just taking fpk of x, x over it with the CRS. And that will recover the original message. Now, to add the security, we're going to in the proof, we're going to switch to choosing the value x uniformly at random. So the x component of the codeword is going to be chosen uniformly random and we're going to set the CRS to be fpk of x, x over the message. And if you look at it for a second this distribution is exactly equivalent. It's just a different way of sampling the exact same distribution. And in particular, the relation between the CRS and X is preserved. Now in the second step, we're going to switch the public key to Lossy mode. And now when we do this, then the Codeword part, sorry then the CRS fpk of x, x over m only leaks some small amount of information about the random value x. In other words, even if that resists these, the CRS then the value x and the codeword has a lot of entropy. And because it has a lot of entropy it's incompressible. So what we did here is that we actually start to show that the code word and the CRS are indistinguishable from a different way of sampling them where we placed information about the message and the CRS and the codeword actually is truly random, has a lot of real entropy. And therefore even given the CRS the Codeword is incompressible that's the main idea behind the proof. I just want to make two remarks, our full constructions rely on a relaxed notion of Lossy Trapdoor Permutations which we're able to construct from either the decisional residuoisity or the learning with errors assumption. So in particular, we don't actually know how to construct trapdoor permutations from LWE from any postquantum assumption but the relaxed notion that we need for our actual construction, we can achieve from post quantum assumptions that get post quantum security. I want to mention two caveats of the construction. So one is that in order to make this work, the CRS needs to be long essentially as long as the message size. And also this construction achieves a weak form of selective security where the adversary decides to choose the message before seeing the CRS. And we show that both of these caveats are inherent. We show this by black-box separation and one can overcome them only in the random oracle model. Unless I want to just end with an interesting open question. I think one of the most interesting open questions in this area all of the constructions of incompressible encodings from our work and prior work required the use of some public key crypto assumptions some sort of trapdoor permutations or trapdoor functions. And one of the interesting open question is can you construct and incompressible encodings without relying on public key crypto, using one way functions or just the random oracle model. We conjecture this is not possible, but we don't know. So I want to end with that open questions and thank you very much for listening.
SUMMARY :
in order for the adversary to steal,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Daniel Wichs | PERSON | 0.99+ |
second step | QUANTITY | 0.99+ |
NTT | ORGANIZATION | 0.99+ |
two caveats | QUANTITY | 0.99+ |
17 bytes | QUANTITY | 0.99+ |
50 gigabytes | QUANTITY | 0.99+ |
two remarks | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
two procedures | QUANTITY | 0.99+ |
Wikipedia | ORGANIZATION | 0.99+ |
www.wikipedia.org | OTHER | 0.99+ |
two goals | QUANTITY | 0.99+ |
second parameter | QUANTITY | 0.99+ |
second property | QUANTITY | 0.99+ |
each operation | QUANTITY | 0.99+ |
two parameters | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Orlandi | PERSON | 0.98+ |
Tal Moran | PERSON | 0.97+ |
Today | DATE | 0.97+ |
one common problem | QUANTITY | 0.97+ |
One | QUANTITY | 0.97+ |
Garg-Lu | ORGANIZATION | 0.96+ |
Damgard-Ganesh | PERSON | 0.96+ |
Northeastern University | ORGANIZATION | 0.96+ |
two | QUANTITY | 0.95+ |
two indistinguishable modes | QUANTITY | 0.94+ |
Crypto 2020 | ORGANIZATION | 0.94+ |
about 50 gigabytes | QUANTITY | 0.94+ |
each cryptographic | QUANTITY | 0.94+ |
CRS | ORGANIZATION | 0.94+ |
Wikipedia | TITLE | 0.93+ |
90% of the data | QUANTITY | 0.89+ |
LWE | ORGANIZATION | 0.89+ |
Oracle | ORGANIZATION | 0.84+ |
terabytes | QUANTITY | 0.83+ |
Waters | ORGANIZATION | 0.79+ |
one way | QUANTITY | 0.77+ |
seconds | QUANTITY | 0.74+ |
Lossy Trapdoor | OTHER | 0.71+ |
Proofs of Replicated Storage | OTHER | 0.64+ |
2019 | DATE | 0.62+ |
second | QUANTITY | 0.56+ |
much | QUANTITY | 0.55+ |
lot of | QUANTITY | 0.54+ |
caveats | QUANTITY | 0.51+ |
gigabytes | QUANTITY | 0.48+ |
crypto | TITLE | 0.33+ |