开发者

ReadLine() vs Read() to Get CR and LF Efficiently?

开发者 https://www.devze.com 2023-04-02 01:50 出处:网络
I am working on a C# program to determine the line length for each row in multiple large text files with 100,000+ rows before importing using an SSIS package. I will also be checking other values on e

I am working on a C# program to determine the line length for each row in multiple large text files with 100,000+ rows before importing using an SSIS package. I will also be checking other values on each line to verify they are correct befor importing them into my database using SSIS.

For example, I am expecting a line length of 3000 characters and then a CR at 3001 and LF at 3002, so overall a total of 3002 characters.

When using ReadLine() it reads a CR or LF as and end of line so that I can't check the CR or LF characters. I had been just checking the length of the line at 3000 to determine if the length was correct. I have just encountered an issue where the file has a LF at position 3001 but was missing the CR. So ReadLine() says it is 3000 char witch is correct but it will fail in my SSIS package because it is missing a CR.

I have verified that Read() will reach each char 1 at a time and I can determine if each line has a CR and LF but this seems rather unproductive, and when some files I will encounter with have upwards of 5,000,000+ rows this seems very inefficient. I will also need to then add each char into a string or use ReadBlock() and convert a char array into a string so that I can check other values in the line.

Does anyone have any ideas on an efficient way to check the line for CR and LF and other values on a giv开发者_运维百科en line without wasting unnecessary resources and to finish in a relatively timely manner.


have verified that Read() will reach each char 1 at a time and I can determine if each line has a CR and LF but this seems rather unproductive

Think about this. Do you think ReadLine() has a magic wand and does not have to read each char?

Just create your own ReadMyLine(). Something has to read the chars, it doesn't matter if that's your code or the lib. I/O will be buffered by the Stream and Windows.


Can you use an override of StreamReader.Read OR an override of TextReader.Read which accepts 3 parameters - string buffer (in your case a 3002 character array), startint index (you will handle this in a loop each time incrementing the index by 3002), number of characters to read (3002). From the read buffer, you can check the last two characters for your conditional evaluation of CR and LF.


I believe you will find this version to be efficient:

    static bool CheckFile(string filename)
    {
        const int BUFFER_SIZE = 3002;

        var Reader = new StreamReader(filename, Encoding.ASCII, false, BUFFER_SIZE);

        var buffer = new char[BUFFER_SIZE];

        int offset = 0;
        int bytesRead = 0;

        while((bytesRead = Reader.Read(buffer, offset, BUFFER_SIZE)) > 0)
        {
            if(bytesRead != BUFFER_SIZE 
                || buffer[BUFFER_SIZE - 2] != '\r' 
                || buffer[BUFFER_SIZE - 1] != '\n')
            {
                //the file does not conform
                return false;
            }

            offset += bytesRead;
        }

        return true;
    }

The reason I'm optimistic about this is that according to the docs, efficiency is increased if the size of the underlying buffer is matched to the buffer that is used for reading. Caveat: this code has not been tested or timed.


I may be missing something here, but if the data in each line is always exactly 3000 characters (excluding CR and LF)?

Why not just read each line and then take the first 3000 characters only, using string.Substring(). This way you don't have to worry about exactly how the string is terminated.

ie

 using (StreamReader sr = new StreamReader("TestFile.txt")) 
    {
       String line;
       while ((line = sr.ReadLine()) != null) 
          {
            // string data = line.subString(0,3000); 
            // edit, if data is sometimes < 3000 ....  
            string data = line.subString(0,line.length < 3000 ? line.length : 3000);
            // do something with data
          }
     }


I think I have finally figured out the code to get exactly what I want, thoughts? The main issue I was encountering was that I am not guaranteed my line length is going to correct. Other wise the method mentioned by @Paul Keister would have worked great, and did as I tested it. Thanks for the help!

int asciiValue = 0;

while (asciiValue != -1)
{

Boolean endOfRow = false;
Boolean endOfRowValid = true;

string currentLine = "";

while (endOfRow == false)
{
    asciiValue = file.Read();

    if (asciiValue == 10 || asciiValue == 13)
    {
        int asciiValueTemp = file.Peek();

        if (asciiValue == 13 && asciiValueTemp == 10)
        {
            endOfRow = true;
            asciiValue = file.Read();
        }
        else
        {
            endOfRowValid = false;
            endOfRow = true;
        }
    }
    else if (asciiValue != -1)
        currentLine += char.ConvertFromUtf32(asciiValue);
    else
        endOfRow = true;
}

Edit: I forgot to mention that this seems to be just as efficient as using ReadLine(). I was really afraid this wouldn't have worked as well. It appears I was wrong.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号