This post is about an interesting bug I found in MyStudio IDE which I felt needed to be shared here.
I saw "File Not Supported" (declared here) error when opening the
requirements.txt file on the repository.
This error was meant for binary files like images, PDFs, binaries or executable files, etc. so it didn't make sense why would this file trigger it.
I switched to VSCode and opened the same file. It was working fine; I could see the content and can make changes with ease.
Take a moment and check the error declaration link above. See if you can find why this was happening.
...Done? Okay, good!
I began searching for clues as to what is different with this file and a few moments later, as I glanced at the bottom-right of the window, I realized what's wrong.
The file was in UTF-16 encoding! This tracks because I worked on this on Windows environment and as you might know, Windows uses UTF16 as the default encoding.
Did you know Windows Kernel still has a few pieces with UTF-16 encoding? This is their commitment to ensure backwards compatibility with legacy software in Windows.
LE stands for Little Endian and it determines the byte-ordering used in a file. Here's a Wikipedia link that explains it better.
In simple terms, it determines the order or sequence with which set of bytes are stored in a file. In this case, the
UTF-16 LE means its a text which is UTF-16 encoded whose byte ordering is in Little Endian format.
Still with me? Don't worry. All this tech jargon was meant to help you understand what
UTF-16 LE meant.
Rust + UTF-8
Rust uses UTF-8 internally and so it's standard library assumes UTF-8 when accepting input or processing IO operations.
There are few places like String where exceptions are made however file IO doesn't work with UTF-16 out-of-the-box.
For instance, we don't have to concern ourselves with all of this in say, Java.
ByteArrayOutputStream would take care of storing given input file as a form of bytes without making it. We can also go with Apache Commons JAR for abstracting this even further.
So, what now?
Once I realized what went wrong, I wanted to add a GUI for showing currently open file's encoding. This write-up focuses on the software implementation. I'll make a new post about the GUI adventure in the near future.
I went on crates.io which is like a repository of third-party packages for Rust. In Rust-land, what's called a
crate. I don't know why that is though :\
Anyway, the goal was to locate a crate that can work with different file encodings which potentially also handles endianness. As a text editor, a user can open a text file of any encoding and be able to work on it without issue just like I did with VSCode when testing this bug.
So, I found this crate called
encoding_rs which supports a wide variety of encoding standards. The exact list can be found here. I was very excited to work on this but then came more problems.
Working without encoding_rs_io
I kind-of brought this on myself by choosing not to use their helper crate, encoding-rs-io which integrates with
std::io to read/write non-UTF-8 files.
I tried to approach this in different ways like trying to obtain
u8 vector slices (byte arrays) off UTF-16 content and saving them to disk all the while preserving UTF-16 LE. So standard IO function expects
u8 vector slices.
I could've gone down the rabbit hole by left-shift any bytes out of range of
u8 and maybe try to save it but I never got that far.
Looking back, I should've gone with
encoding_rs_io crate and implement it in
libmystudio but we developers can be stubborn at times!
Anyway, this went on for few more days where I continued to fail this task and a few more days later, I couldn't find time anymore as I had to prioritize the day-job. I finally decided it was time to use a work-around and move this to TODO state.
The catch with Rust's File IO APIs is that they default to UTF-8 encoding. This means, when MyStudio tries to save a UTF-16 encoded file, it will instead be overwritten as a UTF-8 encoded file or panic.
Hence, I chose to not allow saving a file if it's not using UTF-8 encoding. This way, I can prevent potential data loss or file content mismatch. This is temporary and once I find the time, I plan on fixing it.
Check my Git commit here.
The first thing to do is update
libmystudio to detect UTF-8, UTF-16 encoding and return its contents as
If we encounter an "unsupported file type" like binary files or text files with UTF-32 encoding, we return
None which is a special type in Rust.
This is facilitated by Option keyword and you can think of it as the
Optional class in Java-land.
Lastly, when saving file changes, we use similar logic to determine if file encoding is UTF-8 and if not, we return a
Write support for xxx encoding is unavailable. error which is visible in the status bar.
I hope you've learned something interesting here. Give a Like to this post (on the right) and don't forget to add a comment on your thoughts about this.
You can also @ me on Twitter.
Bye for now :-)