Exercise 2 — Filtering UTF8

This week's exercise is to create a C program in order to determine if a bytestream contains UTF-8 data. Please name this file "filter-utf8.c".

UTF-8 is a somewhat complex binary format; in particular, it is multi-byte and it is possible to have both "well-formed" and "ill-formed" character sequences. Please see this chapter from the Unicode standard for the full story; the most important 11 pages of that chapter are at https://www.cs.fsu.edu/~langley/CIS4930/2023-Spring/ch03--50-60.pdf.

So your program should read from stdin (file descriptor 0), and depending on the flags that you are passed on the command line, should do these things:

If two or three options are passed on the command line, your program do each of the indicated options; for instance, if both "-u" and "-a" are passed, then the original file should be displayed. If all three are passed, the original file should be displayed, with any ill-formed UTF-8 encodings noted in-line with the display (not delayed.)

For example, if you give no flags, then you should get no output:

    
$ ./filter-utf8 < Nikkei-excerpt.txt
$ 

If you give both the "-a" and "-u" flags, you get the original text:

    
[langley@localhost Exercise3]$ ./filter-utf8 -a -u < Nikkei-excerpt.txt
米連邦準備理事会(FRB)は1日開いた米連邦公開市場委員会(FOMC)で0.25%の利上げを決めた。利上げ幅は2会合連続で縮小し、通常のペースに戻った。同時に公表した声明文では政策金利の先行きについて「継続的な引き上げが適切」とした前回までの表現を維持し、利上げの停止時期がまだ先であることを示唆した。

If you give all three, then intermingle error messages with the both the ASCII and UTF-8 output:

$ ./filter-utf8 -aup < Nikkei-excerpt-w-bogus-character.txt 

There is a single ill-formed byte at offset 0: 81
米連邦準備理事会(FRB)は1日開いた米連邦公開市場委員会(FOMC)で0.25%の利上げを決めた。利上げ幅は2会合連続で縮小し、通常のペースに戻った。同時に公表した声明文では政策金利の先行きについて「継続的な引き上げが適切」とした前回までの表現を維持し、利上げの停止時期がまだ先であることを示唆した。
		       
$ ./filter-utf8 -aup < Nikkei-excerpt-w-bogus-character-v2.txt

There is a single ill-formed byte at offset 0: 81
米連邦準備理事会(FRB)は1日開いた米連邦公開市場委員会(FOMC)で0.25%の利上げを決めた。利上げ幅は2会合連続で縮小し、通常のペースに戻った。同時に公表した声明文では政策金利の先行きについて「継続的な引き上げが適切」とした前回までの表現を維持し、利上げの停止時期がまだ先であることを示唆した。AND
There is a single ill-formed byte at offset 429: 81

$
  

The following files might be of some use:

If you are using a Windows machine, please do check your md5sums. Windows has a very unpleasant habit of changing UTF8 files to other binary representations. It's always a good idea to check with any operating system, but it is important with Windows.

Here's a sample skeleton version. However, using this approach requires a lot of detailed state transitions, so you probably want to do this as a series of if/else statements rather than using a state machine unless you are interested in such approaches.