The String Memory Gotcha

April 27, 2005 4:44 PM

<edit date="December 2010"> This blog post describes an old Java bug that has since (to the best of my knowledge) been fixed and all affected versions EOL'd. Regardless, it remains a cautionary tale about the problem of leaky abstractions, and why it's important for developers to have some idea of what's going on under the hood. </edit>

Every Java standard library I have seen uses the same internal representation for String objects: a char[] array holding the string data, and two ints to represent the offset from the start of the array that the String starts, and the length of the String. So a String with an array of [ 't', 'h', 'e', ' ' , 'f', 'i', 's', 'h' ], an offset of 4 and a length of 4 would be the string "fish".

This gives the JDK a certain amount of flexibility in how it handles Strings: for example it could efficiently create a pool of constant strings backed by just a single array and a bunch of different pointers. It also leads to some potential problems.

In the String source I looked at (and I'm pretty sure this is consistent across all Java standard library implementations), all of the major String constructors do the 'safe' thing when creating a new String object - they make a copy of only that bit of the incoming char[] array that they need, and throw the rest away. This is necessary because String objects must be immutable, and if they keep hold of a char[] array that may be modified outside the string, interesting things can happen.

String has, however, a package-private "quick" constructor that just takes a char array, offset and length and blats them directly into its private fields with no sanity checks, saving the time and memory overhead of array allocation/copying. One situation this constructor is used in is String#substring(). If you call substring(), you will get back a new String object containing a pointer to the same char[] array as the original string, just with a new offset and length to match the chunk you were after.

As such, substring() calls are incredibly fast: you're just allocating a new object and copying a pointer and two int values into it. On the other hand, it means that if you use substring() to extract a small chunk from a large string and then throw the large string away, the full data of that large string will continue to hang around in memory until all its substrings have been garbage-collected.

Which could mean you carrying around the complete works of Shakespeare in memory, even though all you wanted to hang on to was "What a piece of work is man!"

Another place this constructor is called from is StringBuffer. StringBuffer also stores its internal state as a char[] array and an integer length, so when you call StringBuffer#toString(), it sneaks those values directly into the String that is produced. To maintain the String's immutability, a flag is set so that any subsequent operation on the StringBuffer causes it to regenerate its internal array.

(This makes sense because the most common case is toString() being the last thing called on a StringBuffer before it is thrown away, so most of the time you save the cost of regenerating the array.)

The potential problem again lies in the size of the char[] array being passed around. The size of the array isn't bound by the size of the String represented by the buffer, but by the buffer's capacity. So if you initialise your StringBuffer to have a large capacity, then any String generated from it will occupy memory according to that capacity, regardless of the length of the resulting string.

How did this become relevant?

Well, some guys at work were running a profiler against Jira to work out why a particular instance was running out of memory. It turned out that the JDBC drivers of a certain commercial database vendor (who shall not be named because their license agreement probably prohibits the publishing of profiling data) were consistently producing strings that contained arrays of 32,768 characters, regardless of the length of the string being represented.

Our assumption is that because 32k is the largest size these drivers comfortably support for character data, they allocate a StringBuffer of that size, pour the data into it, and then toString() it to send it into the rest of the world.

Just to put the icing on the cake, if you have data larger than 32k characters, you overload the StringBuffer. When a StringBuffer overloads, it automatically doubles its capacity.

As a result, every single String retrieved from the database takes up some multiple of 64KB of memory (Java uses two-byte Unicode characters internally), most of it empty, wasted bytes.

Ouch.

The first computer I owned had 64KB of memory, and almost half of that was read-only. Which means every String object coming out of that driver is at least twice the size of a game of Wizball.

This turned out to be false. According to a reddit comment: “The c64 used bank switching to allow for a full 64KB of RAM and still provide for ROM and memory-mapped I/O.”

One possible workaround is that the constructor new String(String s) seems to "do the right thing", trimming down the internal array to the right size during the construction. So all you have to do is make an immediate copy of every String you get from the drivers, and make the artificially bloated string garbage as soon as possible.

Still, ouch.

1 TrackBacks

Listed below are links to blogs that reference this entry: The String Memory Gotcha.

TrackBack URL for this entry: http://fishbowl.pastiche.org/mt-tb.cgi/633

Memory Leak from The Cramer Family on April 28, 2005 2:05 PM

Just found this article on a possible memory leak in Java. I've been under the understanding that StingBuffers where the preferred method of building large strings. While this is still the case, it turns out that when you call the StringBuffer.toSt... Read More

9 Comments

I can't add anything to this -- an interesting read, thanks -- other than, woah! Wizball was my favourite game on the Amstrad way back when...

The StringBuffer/String char[] sharing "feature" changed in Java 5.0.

We found a similar problem perhaps with the same drivers - strings returned from ResultSet were rather large. Chucking them through new String() for cached reference data solved the problem - in Sun JRE 1.4, there is this comment (the new array is of minimal length):

// Perhaps this constructor is being called
// in order to trim the baggage, so make a copy of the array.

BTW, the reusing-StringBuffer-anti-pattern is in the Java2 Performance and Idiom Guide, 2000.

The Eclipse people have made similar experiences: http://bugs.eclipse.org/bugs/show_bug.cgi?id=84872

I'm glad I finally found some else with the same problem.

About a year ago we had the same problem with the same nameless commercial database. Out of shear curiosity, we illegally decompiled the java classes. Wow what a mess! The database company also has and application server product. How can their app server function using these drivers? Well, by not using the them of course! For their application server product the database company uses third party drivers from DataDirect. Nice.

Nice Post. Is the database DB2? I think I have faced this wiered "32K" string issue in the past with DB2 app driver (JDBC Type 2 Driver).

I also ran into this issue when manipulating XML data. It looks like the String/StringBuffer issue has been around quite a while. I think the change in memory management between 1.3 and 1.4 seems to have brought it into the spotlight, though. Java 1.5 does indeed clean up this mess.

Matt


The SUN bug report can be found here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724129


An IBM forum entry (where I was a little frustrated - sorry) can be found here:
http://www-106.ibm.com/developerworks/forums/dw_thread.jsp?forum=367&thread=68634&cat=10

Re - JC Mann's comments above - I totally agree.

I had a problem with ResultSet.getMetaData - calling this threw UnsupportedOperationException
despite the call being fully documented as working.

Some investigation on the support web site yielded that many people had raised this as a bug.

The initial support response stated that the software does not necessarily conform to documented behaviour - WTF - why bother writing it then!!

Eventually a bug was raised and fixed in a later release. So I tried it like this :-

VendorNameResultSet r = .......
VendorNameResultSetMetaData m = r.getMetaData();

// Fine so far - looks like the bug is fixed
// so lets do something useful with the MetaData

int i = m.getColumnCount();

// WTF - it throws UnsupportedOperationException

Yes, the bug had been fixed by returning a ResultSetMetaData object that threw UnsupportedOperationException on every method!!

This companies JDBC development team is a joke.

What a find! I've been tearing my hair out looking for a slow memory leak at work, and this may be the culprit.

Previously: Competition in a Nutshell

Next: Apple: What's Next?