Monday, November 07, 2005

Text files in Windows and Unix

This is the best explanation that I saw on the topic that bothers almost everyone who has dealt with text files on Windows and Linux. Picked up from: Knoppix FAQ.
----------------------
Excursion: Files under windows and UNIX: Two worlds collide!

Are you so old that you remember and used the mechanical typewriter? Maybe you still have one on your attic somewhere...
With such a mechanical typewriter, when you are at the end of the line, you had to pull a lever. Two things happened then mechanically. In two to three rattles the paper was transported up one or more lines (the line feed or LF) and then you pulled the lever through from the far right to the left and returned the carriage with the paper (the carriage return or CR).

In your PC and on your hard disk the characters of a text line are stored in ASCI encoded characters in successive bytes. To indicate the end of the line it is necessary to mark this in some way or another.

The first computer "screens" were no more than electromechanical controlled typewriters or TELEX terminals, so they had something like the carriage return / linefeed mechanism of the old typewriters in them. These mechanisms were controlled by the ASCI characters CR and LF. The MS_DOS and later on Windows developers simply used these ASCI characters to indicate the end of text lines in their text files. For instance if you type in Notepad under Windows the string abc def, then internally in the file the ASCI characters "abcCRLFdef" are stored. And when Notepad reads this it displays:

abc
def

The editor interprets the CRLF characters as indication that it has to start on the beginning of a new line. That 's the way it should be.

The UNIX developers (at the university of Berkeley) decided that actually only one character would do, as they always come in pairs, and moreover the electronics of the computer screens can handle this command perfectly on their own. So in Unix/Linux only the LF character is used. When vi / gvim reads the ASCI string "abcLFdef" then this shows correctly as:

abc
def

The big fuzz starts when you directly want to edit Linux text files under Windows or vice versa. When Notepad under Windows read the ASCI string: "abcLFdefLFghi" then it shows as:

abc
def
ghi

It still only does the linefeed and not the carriage return, so the next line does not start at the beginning of the line.

When vi / gvim reads a Windows file with the string: "abcCRLFdefCRLFghi" then it shows this as:

abc^M
def^M
ghi

It replaces the (invisible) ASCI control character CR by a pair of visible ASCI characters (^M) and starts due to the LF at the beginning of a new line.
So that's why you need this cumbersome translation between windows text files and unix text files, swapping CRLFs by LFs and vice versa.

No comments: