When more memory is less

Quick question; when can increasing the size of your data structures lead to less memory being allocated?

While you consider your answer to the conundrum, I’ll lay out the scenario for mine.

I have a Lexer I’ve written in C# for an expression evaluation language. A Lexer (more properly a lexical analyzer, sometimes called a tokeniser) takes a string of text and splits it into tokens which can be usefully consumed by a parser to do something cool.

Now the Lex function on my Lexer class takes as its argument a single string, and returns an IEnumerable.
Tokens look like this:

struct Token
{
  public string Text;
  public int Position;
  public TokenType Type;
}

Text is the part of the original expression which this token represents – I take a substring of the original string.

Position is where the token occurs in the original string – Used for producing useful error messages when parsing fails.

TokenType is an enum which specifies what sort of token it is. The most functionally important part of the lexer’s output, but you can ignore it for the purposes of this discussion.

Now I’m parsing 2 large text files with this lexer (around 35MB of data). Memory requirements for this and the other data structures involved are just under 500MB – so non trivial.

When looking at performance I noticed the huge number of memory allocations that were being done by the lexer. Not I’m parsing over 300,000 expressions with this lexer, so there are well over four million tokens being generated. The tokens are kept alive as long as the files are open in the app, as I use them for more than just parsing. That means I have over four million strings allocated. Thinking about it that way, I decided to change my implementation of Token to the following:

struct Token
{
   public string Expression;
   public int Position;
   public int Length;
   public TokenType Type;
   public Text { get { return Expression.Substring(Position, Length); } }
}

I’ve added a Line and a Length field. Notice that text is now a property with a custom getter and no backing storage and hence doesn’t take up memory. Total net change in size is a 4 byte increase.

I refactor my code to store the entire expression in the Expression field and the length of the token in the Length field. The text property gives the same result as before, but computes its value from the other information in the token.

So my four million tokens now take up an extra 16MB of space right?

Actually the program posted a 90MB decrease in memory usage. That’s nearly a full 20% less. How come?

This optimisation is all about strings, and relies on two factors of strings – immutability and “referenceness”.

Strings are immutable reference objects. This means you are really just passing around pointers to memory that doesn’t change. Once it’s been allocated it doesn’t change until it gets collected by the garbage collector. So I can have lots of copies of that string without ever needing to allocate more memory. If the string is changed somewhere, a new string is allocated and used instead, preserving the original, so all other references are safe.

So I re-use my already allocated expression string instead of allocating a new one. The net result is 4 million less allocations (10% speed boost) and 90MB less memory allocated. Why 90MB for 35MB of text data? Well my text data is stored in UTF-8 where as internally a .NET string is encoded as UTF-16, so they are approximately twice as big in memory. They have a int length indicator, which is another 4 bytes times the length of the string, which adds 16MB. So that’s about 86MB, which is exactly the same as 90MB when you are writing a blog post.

Leave a Reply Cancel reply