C# · 12月 24, 2021

c# – 使用UTF-8解码文件流

我有一个非常大的 XML文档(大约120M),我不想立刻将它加载到内存中.我的目的是检查此文件是否使用有效的UTF-8编码.

有没有想过快速检查而没有以byte []的形式将整个文件读入内存的想法?

我正在使用VSTS 2008和C#.

当使用XMLDocument加载包含无效字节序列的XML文档时,会有一个异常,但是当将所有内容读入字节数组然后检查UTF-8时,没有异常,任何想法?

这是显示我的XML文件内容的屏幕截图,或者您可以从here下载该文件的副本

编辑1:

class Program{ public static byte[] RawReadingTest(string fileName) { byte[] buff = null; try { FileStream fs = new FileStream(fileName,FileMode.Open,FileAccess.Read); BinaryReader br = new BinaryReader(fs); long numBytes = new FileInfo(fileName).Length; buff = br.ReadBytes((int)numBytes); } catch (Exception ex) { Console.WriteLine(ex.Message); } return buff; } static void XMLtest() { try { XmlDocument xDoc = new XmlDocument(); xDoc.Load(“c:\\abc.xml”); } catch (Exception ex) { Console.WriteLine(ex.Message); } } static void Main() { try { XMLtest(); Encoding ae = Encoding.GetEncoding(“utf-8”); string filename = “c:\\abc.xml”; ae.GetString(RawReadingTest(filename)); } catch (Exception ex) { Console.WriteLine(ex.Message); } return; }}

编辑2:当使用新的UTF8Encoding(true,true)时会出现异常,但是当使用新的UTF8Encoding(false,true)时,不会抛出异常.我很困惑,因为它应该是控制是否抛出异常的第二个参数(如果有无效的字节序列),为什么第一个参数很重要?

public static void TestTextReader2() { try { // Create an instance of StreamReader to read from a file. // The using statement also closes the StreamReader. using (StreamReader sr = new StreamReader( “c:\\a.xml”,new UTF8Encoding(true,true) )) { int bufferSize = 10 * 1024 * 1024; //Could be anything char[] buffer = new char[bufferSize]; // Read from the file until the end of the file is reached. int actualsize = sr.Read(buffer,bufferSize); while (actualsize > 0) { actualsize = sr.Read(buffer,bufferSize); } } } catch (Exception e) { // Let the user kNow what went wrong. Console.WriteLine(“The file Could not be read:”); Console.WriteLine(e.Message); } }解决方法 var buffer = new char[32768] ;using (var stream = new StreamReader (pathToFile,new UTF8Encoding (true,true))){ while (true) try { if (stream.Read (buffer,buffer.Length) == 0) return GoodUTF8File ; } catch (ArgumentException) { return BadUTF8File ; }}