data formats, protocols, algorithms of the Web
Chapter K Basic resources, data formats, and protocols of the WebRichard Joyce Montana
tech
The various data formats and protocols of the Web are described in numerous places, however we aggregate cover them here for several reasons. It is worthwhile to review those you have seen before, and to introduce those you may not have seen. It is also useful to see them close to one another in time and space since any given task transacted over the Web will typically involve the information passing through more than one encoding format and often transmitted via more than one protocol.
First of all, what do we mean when we say “data format”? As we move through the world we are constantly presented with an enormous amount of information. When you walk to your car you may see trees and sky and grass, a city sidewalk scene, your car outside, and part of its inside as you see through the windshield, the keys in your hand… the list can go on as far and with as much detail as you like. “Detail” is at the heart of the matter. We factor out the very few bits of information that are important to our task at hand, say the position and distance of the lock, the choice of key and its orientation, and we throw all the rest away. We may have observed a passing bicyclist, but it is forgotten along with 99.9999’ish percent of all the other information. We pick out a few pertinent pieces of information, represent them symbolically, then reason with those symbols to produce knowledge or action that is relevant - that is, which helps us in the real world. This ability is certainly one of the greatest achievements of the human mind.
What symbols shall we use, what will they mean, and how will we manipulate them? In other words, what algebraic structure shall we use, for what is an algebra but a set along with operation(s) which combine elements of the set to produce other elements in the set? Perhaps at this point you are thinking about Boolean Algebra. At the lowest level we use the set containing a 1 and a 0, which have 16 operations on them though we don’t need most of them. This would be quite tedious, so we choose sets of other symbols which are closer to our understanding and use software to automatically convert them to and from 1’s and 0’s.
We define data formats based on the type of information we represent using them. Thus we will have one data format for text, a different format for pictures, another for audio, and so on. They are further subdivided, so we may have one format for spoken audio and a different one for music. We have in fact gotten quite carried away, e.g. several different audio formats depending on what type of device you use, how much space you have, where you might transmit the audio at some future date, even based on the company for which you work. It is safe to say that the various data formats in use number in the thousands. So what is worthy of our time? The rest of this chapter covers the most common data formats in use over the Web for text, images, video, and audio.
We should take a moment to draw a distinction between a data format and a file format. It is common, though not required, that a file holds a single type of data, often along with some extra information. For this text we will use the two notions interchangeably. An image may be stored in gif, jpg, png, or tif format, and each of these would correspond to a single file.
We also note that complete format specifications are often designed by committee and are almost always broad and burdensome as they try to be too accommodating. In reality, there is usually a small set of pathways (sometimes 1) through each specification which accounts for virtually all usage. We ignore, extensions and ‘features’ which are largely unused and/or repetitive and add to the learning burden without benefit.
Say we see the following pattern of 1’s and 0’s:
0100 0001 0100 0010 0100 0011 0100 0100
What is it? The answer depends on which data format we consider it to be. Without context, or the choice of a data format, the string has no meaning whatsoever. If it is in an ASCII text file it is the string “ABCD”, if we find it in an image file it is probably a translucent grey pixel, in a binary file it may be the number 1,094,861,636. For the common types of data we now look at the most common file formats.
Text One of the earliest and still very widely used formats for representing textual data is ASCII. 7-bit ASCII uses the low order 7 bits of a byte to represent a text character. 7 positions, each of which can be either a ‘1’ or a ‘0’ gives us 128 different combination, and therefore 128 different characters which can be represented. This worked (and still works) pretty well for text in the English language. 7-bit ASCII encodes all the symbols you see on your keyboard plus a few more. Figure xxx is a chart which shows each of the 128 ASCII bit patterns and what they represent.
Remember that every file on your computer is filled with a bunch of 1’s and 0’s. These are grouped 8 at a time into ‘bytes’. An image, a text file, an mp3 recording, a movie, an sms message… every single file contains only a series of bytes, each of which is a grouping of eight 1’s and/or 0’s.
Notice that the characters numbered from 32 up to 127 are all found on the keys of your keyboard. So a dollar sign character ‘$’ is represented by the bit string: 0010 0100. The numbers below 31 are called control characters. They are not printed but rather each is a directive for the display device to do something special. For a computer display the carriage return: 0000 1101 means to move the cursor back to the first column of the row. When received by a dot matrix printer, it means to return the print head to the start of the line. The line feed: 0000 1010 means move the cursor down a line, or advance the printer paper one line respectively. The original notion was that computers would be connected to each other and a myriad of devices and use ASCII over serial communications links to communicate. Many of the control functions envisaged and represented by the other ASCII codes < 32 are rarely used. -- more Richard Joyce Montana
tech
The various data formats and protocols of the Web are described in numerous places, however we aggregate cover them here for several reasons. It is worthwhile to review those you have seen before, and to introduce those you may not have seen. It is also useful to see them close to one another in time and space since any given task transacted over the Web will typically involve the information passing through more than one encoding format and often transmitted via more than one protocol.
First of all, what do we mean when we say “data format”? As we move through the world we are constantly presented with an enormous amount of information. When you walk to your car you may see trees and sky and grass, a city sidewalk scene, your car outside, and part of its inside as you see through the windshield, the keys in your hand… the list can go on as far and with as much detail as you like. “Detail” is at the heart of the matter. We factor out the very few bits of information that are important to our task at hand, say the position and distance of the lock, the choice of key and its orientation, and we throw all the rest away. We may have observed a passing bicyclist, but it is forgotten along with 99.9999’ish percent of all the other information. We pick out a few pertinent pieces of information, represent them symbolically, then reason with those symbols to produce knowledge or action that is relevant - that is, which helps us in the real world. This ability is certainly one of the greatest achievements of the human mind.
What symbols shall we use, what will they mean, and how will we manipulate them? In other words, what algebraic structure shall we use, for what is an algebra but a set along with operation(s) which combine elements of the set to produce other elements in the set? Perhaps at this point you are thinking about Boolean Algebra. At the lowest level we use the set containing a 1 and a 0, which have 16 operations on them though we don’t need most of them. This would be quite tedious, so we choose sets of other symbols which are closer to our understanding and use software to automatically convert them to and from 1’s and 0’s.
We define data formats based on the type of information we represent using them. Thus we will have one data format for text, a different format for pictures, another for audio, and so on. They are further subdivided, so we may have one format for spoken audio and a different one for music. We have in fact gotten quite carried away, e.g. several different audio formats depending on what type of device you use, how much space you have, where you might transmit the audio at some future date, even based on the company for which you work. It is safe to say that the various data formats in use number in the thousands. So what is worthy of our time? The rest of this chapter covers the most common data formats in use over the Web for text, images, video, and audio.
We should take a moment to draw a distinction between a data format and a file format. It is common, though not required, that a file holds a single type of data, often along with some extra information. For this text we will use the two notions interchangeably. An image may be stored in gif, jpg, png, or tif format, and each of these would correspond to a single file.
We also note that complete format specifications are often designed by committee and are almost always broad and burdensome as they try to be too accommodating. In reality, there is usually a small set of pathways (sometimes 1) through each specification which accounts for virtually all usage. We ignore, extensions and ‘features’ which are largely unused and/or repetitive and add to the learning burden without benefit.
Say we see the following pattern of 1’s and 0’s:
0100 0001 0100 0010 0100 0011 0100 0100
What is it? The answer depends on which data format we consider it to be. Without context, or the choice of a data format, the string has no meaning whatsoever. If it is in an ASCII text file it is the string “ABCD”, if we find it in an image file it is probably a translucent grey pixel, in a binary file it may be the number 1,094,861,636. For the common types of data we now look at the most common file formats.
Text One of the earliest and still very widely used formats for representing textual data is ASCII. 7-bit ASCII uses the low order 7 bits of a byte to represent a text character. 7 positions, each of which can be either a ‘1’ or a ‘0’ gives us 128 different combination, and therefore 128 different characters which can be represented. This worked (and still works) pretty well for text in the English language. 7-bit ASCII encodes all the symbols you see on your keyboard plus a few more. Figure xxx is a chart which shows each of the 128 ASCII bit patterns and what they represent.
Remember that every file on your computer is filled with a bunch of 1’s and 0’s. These are grouped 8 at a time into ‘bytes’. An image, a text file, an mp3 recording, a movie, an sms message… every single file contains only a series of bytes, each of which is a grouping of eight 1’s and/or 0’s.
Notice that the characters numbered from 32 up to 127 are all found on the keys of your keyboard. So a dollar sign character ‘$’ is represented by the bit string: 0010 0100. The numbers below 31 are called control characters. They are not printed but rather each is a directive for the display device to do something special. For a computer display the carriage return: 0000 1101 means to move the cursor back to the first column of the row. When received by a dot matrix printer, it means to return the print head to the start of the line. The line feed: 0000 1010 means move the cursor down a line, or advance the printer paper one line respectively. The original notion was that computers would be connected to each other and a myriad of devices and use ASCII over serial communications links to communicate. Many of the control functions envisaged and represented by the other ASCII codes < 32 are rarely used. -- more Richard Joyce Montana