> If you signal me, that this was understandable and want
> more, you will - i promise
Well, I do signal you that this was understandable and that I want more!
In fact
I hope that you'll deal us some good cards about "their delicate and secured data" and
about the
"lots of algorithms" used inside the black boxes... because we want to have some
more light inside all those black and dark boxes :-)
Little essay about the various methods and viewpoints of crunching. Part I: Introduction By Joa Koester (JoKo2000(at)hotmail(point)com) This essay is free to be copied and read and spread as long as you are decent enough to let my name in it. If you have corrections, comments, better examples than the ones used in this essay, etc. - drop me a line. But on with the essay... as i recently visited fravia+'s page on the net, i was amazed about the knowledge packed together, the philosophy and the attidude contained in the writings. But i missed a little bit the **other side**. That is, us programmers, condemned to write software for banks, insurance companys etc., so they can make a lot of money, ripping a lot of people of. These companies are often serious about data hiding and are always eager to have their delicate data secured. One way of securing data is crunching (packing) them. This has two valid points: - the data is secure (coz the crunching algorithm is not made public) - the data is more compact, meaning it can be easier transfered; the risk of a disc producing a read/write error, vaporising personal data is definitevly lowered, when the data only takes 50Kbyte on a disk rather than 200KByte (of course if a read write error happens exactly in these 50KByte, the file is also gone ;) This brings us to the question: WHAT IS CRUNCHING? Well, a pseudo-mathematical answer would be: everything that reduces a certain amount of data in its SPACE, but not in it's INFORMATION (lossless crunching that is, there are some quality (information) reducing algorithms out there - jpeg for example). So we have some data IN: AAAAACCCCCDDDDDDCCCCCefg and we have a black box: /-------\ / \ | Holy crap | | happening | | here | ------------- and we have more or less unreadable data OUT: @$/)!%A3yfg So, what's the principle of the black box? Is there one Super-Algorithm that you are not allowed to know ("Behold my son, tis knowledge is not for your eyes") ? Of course not. There are principles. And there are lots of algorithms, not just one. And you ALREADY know a lot of these priciples from the everyday live. When you have to cross a heavy-driven road you more or less wait at the traffic light to become green. You stand there and see a red light. Maybe the light is shaped like a standing person or a cross or something else. But it is red and you already know that this means: Hey, stop, don't go. If you go, you are violating three different traffic laws at least, obstruct the cars and impose the next world war. Besides you put your life in danger ;) And all this information just in the little red light on the other side of the street. Are all red lights this way? No. If you, for instance, are a musician and you are about to record a song, you will press record & play on your recorder and a RED light will show up, telling you, that you better not make thousands of mistakes and record what you are doing properly. The red light can be a circle, a square, even a text, it doesn't matter. But it will be red!!! Dr. Watson, what do you think? Well... What we have here is a case of crunching: The various informations are condensed into few different symbols as possible. In both examples, only one symbol (the red light) is needed to get the MEANING (the information) transmitted. Right, Watson, right. And could we switch the information contained in the symbols? That is, making the red light on the recorder telling us, when to stop before a traffic light? No. They are both red lights, that's true. But the red light in the recorder has nothing to do with crossing a road and the traffic light has nothing to do with us recording a song. The CONTEXT is different. The only thing that is similar is the color. Hm, hm. Condensing information (much source symbols -> less destination symbols) and keeping the CONTEXT in mind. Sounds pretty good to me. Kind of switch (context) { case traffic: if (red_Light) {...} else {...} break; case recording_music: if (red_Light) {...} else {...} break; default: No_Condensed_Symbols(); break; } ain't it? In all crunching we will always have something that will tell us, in which context we will have to switch and because of this we will know how the next following symbol(s) is/are to be interpreted. Dr. Watson, are all interpretations dependend on only one symbol? Hm, i would say no. There may be cases, where this is true, but in most cases, there are more than one symbol defining exactly, what's going on. There are crossroads where are streets leading straight ahead and right and there are crossroads where cars will drive left or straight ahead or right. This will depend on which part of the crossroad the car stands, so that the traffic for straight ahead can go but the traffic for the right has still to wait for THEIR specific traffic light to switch from red to green. Another example would be the position of the light. If the position of the red light and the green light would be switched, there would be some chaos, i bet. Sounds resonable, Watson. You say, that there are symbols for a general context which are finetuned thru other symbols defining the exact context to be interpreted? Exactly. But what do you think how it is possible that all people know that they have to stop on a red sign and go on a green one? Well, i would say that they know, because someone told them. The parents, perhaps. Or they are taught so in the school. In fact, to crunch and decrunch information correctly, both, the sender and the receiver have to use the same way of interpreting the data. Society has ruled that a red traffic light is a STOP. And so traffic lights will switch to red when it is time to stop the traffic for one direction. And on the other side YOU get taught by the society that you better not run across the street when you have a red or else you play 'Frogger' for real... So put in one sentence - Both the sender and the receiver use the same MODEL of data-interpretation. Dr. Watson, what if i would like to crunch the information transmitted in the red traffic light? This would be nearly impossible. The whole meanings of what the traffic light means is already emitted in only one symbol (the red light i mean now). There is a point where the number of informations can't be reduced any more without getting complications elsewhere. Imagine one would pack all three lights (red, green and yellow) into one light that would change it's color, depending on it's actual state. Ok, you would have less metal, only one light to look at and less glass to clean. The normal case would be condensed - not in interpretation but in material. The routine of Green - Yellow - Red - Yello - Green... would stay. So far so good. But traffic lights have the ability to send more complex signals also. When, for example, there is a traffic-jam ahead and the police notices this, they can (at least where i live) achieve that the traffic lights green and yellow will blink together to signal an upcoming jam so that the drivers can react to this signal. When all lights would be build in one case, one would have to think of a special flashing / flashing in a special speed or something like that. Not very good. Especially for older drivers whose reaction times may be slower - they would concentrate more on interpreting the flashing signal than on the traffic itself increasing the risk of producing an accident. One other point would be the shape of the light. A standing man in red and a walking man in green would mean a complex shape of glass illuminated with a complex circuitry. This would mean, if one part would activate falsely, you would have, for example, a red man standing with one green leg walking. Very confusing, eh? So condensing one thing over the point of information content (also known as ENTROPY) on it's maximum leads to enlarging other parts giving them biiiig overhead. How do we know that this process is worth doing all this? Well, a certain student once came up with exactly this question and he answered it by himself: It depends on the probability of the certain symbols. If some symbols are statistically so often in our stream of perception (analyzing, reading buffer data, etc.) that we can condense them enough that, even with the enlargement of the other symbols (which but are not so often) we have an overall crunching than it's worth it. The name of the student was Huffman... For example, you have: aaaaaaaaaabbbbbcccdd (10 x a, 5 x b, 3 x c, 2 x d) 20 chars then you would have 20 x 8 bits = 160 bits. If you now would transmit the 'a' with 7 and the 'c' and 'd' chars with 9 bits you would have 10 x 7 = 70 5 x 8 = 40 3 x 9 = 27 2 x 9 = 18 __ 155 So we would save 5 Bits. Not much, but better than nothing. Of course you have to come up with an optimized coding for these values as you wouldn't want to calculate by hand, which char you should with which number of bits without confusing with the handling of the other chars. But Huffman found a perfect algorithm giving you the optimized value-table for your chars. (but this is for the next part ;) To condense the intro a little bit: - To crunch is a way to transmit a certain amount of information in less symbols normally needed. - How you crunch/decrunch depends on the actual context of data actually received - Both, the sender and the receiver will build up the same way of interpreting the data, building up the same model - When transforming long information-symbols to shorter packages of symbols and thus reducing the output, we will face the case that there will some (hopefully seldom) symbols getting transformed into LONGER symbols. If we have totally random data crunching happens also totally random, making our affords nil. That is BTW the reason why packing an already packed zip or rar file is in almost all cases useless - the data written by those packers is nearly perfect random. I hope you enjoyed this intro. If you signal me, that this was understandable and want more, you will - i promise. Greetings from a programmer Joa