Find the Artifacts!

This post and related work was made possible by what I learned from the clustimage module, and blog posts of Dr. Erdogan Taskesen. I highly recommend anyone interested in computer vision read them. Also The Hacker Factor blog by Dr. Neal Krawetz was extremely helpful. This is a case study on how I used image hashing concepts in a game development context.

In the interest of privacy I’ve purposely only mentioned former colleagues by title. If you’d like me to mention you by name please let me know and I’d be happy to edit it.

Prelude
#

It was 2021. I was working in the art department at Bethesda Games, on Starfield. One day a Slack message went out letting everyone know that there were certain artifacts scattered across the texture data that we could not ship with. What were they? QR codes, aka the ubiquitous square doodads you must scan with your phone to read an infuriating number of public notices. For legal reasons (I guess?) these things could not be in the final product. But they were there nonetheless.

Why did these exist in our game? They were being used in “texture maps”, which are images that add surface detail to virtual 3D objects. In the game world you could find these codes on the backs of cereal boxes, robot parts, crates, and other bits and bobs in the virtual world that we collectively called “clutter”.

The good news was that these were purely “filler”. They only existed to make the world look more detailed. None of these codes mattered from a gameplay perspective.

a grid of black and white squares representing a 'QR Code' pattern — an example QR code

The bad news was the search space was big. The game was inching towards the ship date, and the texture dataset had grown to tens of thousands of high resolution images scattered across hundreds of directories, referenced in thousands of materials. Not “big data” huge, but these things could in lots of places: weapons, outfits, spaceships, robots, architecture, almost everything but people and creatures. Big enough that a manual search would be a serious cognitohazard to some poor QA technician’s brain.

Glory to Arstotzka
#

Any sort of real world artifacts like QR codes had been explicitly banned by the art guidelines. But BGS (like many studios) outsourced quite a bit of “clutter” asset development. Outsourcing made enforcing standards more complicated. Vendors were in different time zones, communicated via translators, and delivered their assets in large “batches” which needed to be accepted or rejected within a set time frame.

Handling this data fell onto the shoulders of in-house “integration” artists whose job it was to QC, organize, and ingest it into the game.

An example of a texture map — As game devs know, texture files are a mishmash of many image planes crammed together in a flat projection called “UV space”. Here is an example I cobbled together with public domain textures. Note the sneaky QR code in the lower left. It could end up being on the bottom of a soda can or the size of a building. It’s impossible to know without more information.

Imagine searching hundreds of such images for these artifacts, in addition to checking all the other guidelines, and doing all your other duties, hundreds of times a month. Doing outsource integration might not even be your only job! It’s like the world’s hardest game of Papers, Please. Absolutely impossible.

It appeared that most QR codes entered the pipeline this way, although I don’t blame vendors. It’s common for artists to grab mundane things and paste them like this to fill visual space. In fact this practice predates electronic computers (see below).

Photocollage by Hannah Höch — A 1919 photocollage by Hannah Höch that hangs in Berlin’s Staatliche Museen. Today Hannah would be demonitized instantly by some algorithm, although I suppose being a Dadaist in post-Weimar Germany was worse.

If a texture artist never got the memo about QR codes then it’s reasonable that they might throw some into a world like Starfield, where sci-fi corporations exist, making sci-fi packaged goods with sci-fi product labels.

None of the QR codes were malicious, it turns out. Although one did lead to a personal ArtStation account. Nobody caught much flak for this as far as I know. It wasn’t a huge problem.

However contemporary IP law makes this a murky practice, and large publishers with deep pockets (who are juicy targets for lawsuits) are extremely paranoid about avoiding litigation. At least that’s what we all believed. I don’t actually know if an attorney was involved in this process. So how to fix it?

Fix It By Hand!
#

The aforementioned Slack message asked everyone in the art department to please keep an eye out and enter bugs manually. We already had a custom sci-fi doodad to use instead of QR codes, so the fix was simply swapping them out. We just had to know where they were.

In a procedural, open-world game, however, finding things organically is not systematic. Clutter assets were small and highly reused, tucked into nooks and crannies in hundreds of randomly generated spaces, and we had no tracking of where these codes may be.

A screenshot of the custom sci-fi 'cube code' meant to be used in Starfield — The sci-fi ‘cube code’, designed by a lead artist at BGS This is what was supposed to be used instead of real QR codes. Image from Reddit user AbuckingNMS.

Please See Attached CV
#

Manual searching for QR codes stuck in my craw. It wasn’t an unreasonable request on the face of it, but QR codes are inherently designed to be discoverable by computer vision. Not fancy deep, neural whatsits, but good old-fashioned computer vision from the dark ages of the 1990s.

The go-to tool for this sort of job was typically OpenCV. There was even a turnkey QR code detector, where given an image file and it returned the bounds (and optionally the contents) of any detected QR code. It was as easy as something like this Python code:

import os
import cv2

tex_dir = r'C:\your_game\assets\textures'
detector = cv2.QRCodeDetector()

for root, _, files in os.walk(tex_dir):
    for file in files:
        if not file.endswith('.tif'):  # or whatever
            continue
        fullpath = os.path.join(root,file)
        found, _ = detector.detect(cv2.imread(fullpath))
        if found:
            print('QR code detected!: %s' % fullpath)

The results of running a QR detector, if found then a green square highlights the QR code — On the left the QR Code is detected. On the right nothing is found (all clear).

I didn’t care about the contents of the QR code. If there was something even remotely QR code-ish present then that texture needed to be reviewed.

This approach gave good results and some interesting failure cases. QR Codes are really clever and cool, but I won’t get into them here because this story is mainly about what came next. Suffice to say the sorts of things an artist might do to a QR code can actually invalidate them, and computer vision isn’t always rock solid. I decided to error on the side of caution. I tweaked the knobs so it spit out 100 or so images, far more than actually existed. It was then manageable to look through them by hand and discard false positives. I also kept an eye out for the images that had already been flagged to see what the false negative rate was.

It looked good on a sample set. I pulled down all the data, which took several hours. The script was set to scan this with OpenCV and write out any hits as a small thumbnail image with the findings outlined in a gaudy chartreuse.

It actually worked, and found some previously unknown QR codes. That’s good! But it also found something else.

Something really bad.

The Big Problem
#

Here is what I expected to get back from the first few positive hits (mocked up with public domain textures from opengameart.org):

Multiple different textures, each with a QR code within them

Here is a mock-up of what I actually got back:

Many duplicates of a texture, each with a QR code within them

What’s with the duplicates? I thought my script was broken, so I checked a few by hand. No, the script worked. These were actually different files in different places but with the same visual data.

But I wasn’t looking for duplicates, I was looking for QR codes. The images with QR codes represented a very small fraction of the total data. If I was seeing this many duplicates in such a tiny sample how many duplicates existed in the whole? This could potentially signal massive memory waste. Well, shit.

I made a note to bring this up with the leads at our next meeting, but I didn’t have to wait that long.

Simultaneous Invention
#

A day or two later I got a message from a lead artist. It turns out that he had noticed the exact same issue on his side of things. Profiling data from the Tools Department had also showed that texture data was keeping the game from fitting in memory. People had closed in on this from all sides.

His ideas for a fix relied on the fact that the graphics programmers had (very wisely) created a text-based materials file format. That is, all the descriptions of every object’s surface appearance in the game were stored in JSON format. That included textures. That meant Tech Art could inspect and edit this data (carefully) separately from the game editor (the game editor itself did not have this functionality, I’m not sure if it’s been added since).

He wanted a tool to quickly search and replace this text to fix the duplicate issue. We could fix the materials, then delete any newly “orphaned” textures. Sounds simple enough, but how many duplicates were there? We would need to know exactly what they were, and their file paths. Trying to manually find duplicates was a lot like trying to manually find QR Codes. Could computer vision help us here too?

The Solution
#

Finding QR codes is one thing. This was entirely different. Most textures in Starfield were 2024x2048 pixels or larger. To find true duplicates each image would need to be compared, pixel by pixel, to every single other image. That is n(n-1)/2 comparisons. Not feasible. Also, what about images that were visually close, but not numerically close? We might want to combine those as well, right?

This made me think of Shazam, Tineye and other services that instantly found matches to inputs. These things predated modern “AI” solutions. They were doing something more straightforward.

Dab on em
#

Searching for this answer led me to hashing. Hashing is the process of taking some input and making a fixed-size, numerical “fingerprint” called a hash. Since hashes retain the character of their source data, but are much smaller, comparing hashes has meaning in regards to comparing the original input, but faster.

For example here is the size of Tolstoy’s War and Peace, from Project Gutenburg, compared with its MD5 hash:

An image of the MD5 hash of war and peace.txt

The document is 3.7MB, and the hash is 128 bits. If you wanted to know if two copies of War and Peace were identical you could compare their hashes almost instantly and see, rather than checking every word.

A hash is the output of a mathematical function. There are as many different kinds of hashes as there are hashing functions. In cryptographic hash functions (like MD5) if a given input changes even slightly the output hash changes completely. This is critical to cryptography for reasons I am too dumb to fully understand.

For example, appending “lol” to War and Peace totally changes the hash:

An image of the MD5 hash of war and peace.txt with added lol

This is great for the cryptographers, but in the case of image comparison this isn’t what we want. In our case, if two images are only slightly different then the hash itself should only be slightly different. In fact, the “distance” between hashes should correspond to the visual “distance” between two images. But how do we hash a visual?

Average Hash
#

One way we could do this is shrink the image down, convert it to grayscale, then threshold every value against the average. This would give us a small, black and white image. Reading off each pixel value as a 1 or 0 would give us what’s called the “average hash”.

Average hash is fast, easy, and intuitive. You can look at the hash and still basically see the image itself. It does have some shortcomings though. If an image had lots of noisy pixels with values that were hovering around the average, those pixels wouldn’t contribute much to the way the image looked, but they would contribute lots to the character of hash. Also, sorting the hashes becomes problematic, since there’s no way to know which pixels are more significant or less so. Is there another way?

Frequency Decomposition
#

In a normal image the pixels are arranged in the “spatial domain”. That is, the intensity of each pixel corresponds to the image’s intensity at that X,Y location. We look at the pixels, we see the picture. Pretty intuitive.

But there is another way to represent the data of an image and that is by representing the component frequency “terms” that, when combined, recreate the image. A single term looks like a 2D sine/cosine wave.

In this domain the image doesn’t look like much because each pixel stores information about the amplitude, orientation, and frequency of one “term”. To see the picture all the terms must be re-composed. Here is a visual example:

Tito next to his component frequencies. Each frequency get added on a weight value or “coefficient” (not pictured). Frequency decomposition code from Stephen Gruppetta.

Notice especially how the image comes together quite well within the first couple thousand terms. That’s because lower frequencies contribute far more to the image’s character than the higher frequencies. In other words, larger shapes matter more than smaller details. That’s why artists squint their eyes when blocking in the values on a drawing. Cracking an image open and exploring its frequency domain gets to the marrow of how we actually perceive things.

A very popular way to do this decomposition is the “Discreet Cosine Transform”. JPG images use this form of frequency decomposition to do compression, by discarding less important frequencies. Here is Tito decomposed using the DCT (I’ve cropped it to by 8x8 for visibility, in reality his decomposition was much larger).

The image, components, and coefficients of the DCT — In the middle are the coefficients, on the right are the frequency components. The thing to notice is that coefficients in the upper left are brighter. That means they contribute more to the final look.

Perceptual Hashing
#

“PHash” captures the character of an image’s DCT. Instead of averaging the pixel values themselves it works by hashing the coefficients. This way the hash really gets to the essence of how an image looks. The size of this hash can be increased or decreased by expanding it to the right and down,which is analogous to increasing or decreasing the “quality/size” of a JPG image. The hashes can also be sorted such that like images will automatically move together, with likeness decreases as the hash distance increases.

All of these operations are supported in the ImageHash module, including subtracting two hashes to get their “distance”. If you’d rather not do the legwork yourself the clustimage module does all of this for you. The result is a list of images that fit inside an arbitrary number of clusters based on your desired threshold.

Clustimage also supports additional hashes beyond ahash and phash. There isn’t a single “best” hash, each one has a different character that may suit different purposes. I recommend testing a few. I found that many kinds of hashing could give good results if you tweaked the knobs a little on each process. We ended up using phash because it worked well.

Green arrows point to textures that are within a small distance, therefore similar visually — A given texture and its matches when choosing a relatively close hash distance. Notice that it even picks up the textures with small variations. We found that a hash distance of around 10-20 percent of the maximum hash size was a good starting point. Use a hash distance of zero to find true duplicates. When using clustimage each hash is packed into an adjacency matrix and all hash distances are computed at once, returning all the clusters in your entire data set.

Now Fix it
#

Back to the past. I hacked up clustimage to spit out the data I needed into a text file. I then made a script to take that data and generate thumbnail images as well as metadata for each image and cluster. All of this got wrapped into a package and pushed out to artists with a PyQT UI that allowed them to easily browse image clusters and pick which ones to “search” and “replace”. It then executed that replacement (using some regex safety bumpers to not touch any other data in the material files).

An arrow traces the path of most significant to least significant bits of the hash — If you want to sort your phashes, order the bits this way, so the “most significant” bits come first, perceptually.

After art did several consolidation passes, the code department was able to provide a list of the many textures that were now “unused”. All of these I eventually deleted from Perforce in a giant, ultra-satisfying commit.

How Many??
#

A shit ton. I can’t recall the exact number but it was hundreds, possibly thousands, of duplicates or near duplicates. If I recall correctly, at least 10GB (not compressed).

Fortunately we were able to remove these from the archives that otherwise would have shipped and wasted space on millions of customers’ machines. Not 10GB, since they are compressed, but a significant waste. During runtime the difference in memory was less noticeable. We found that most duplicate textures were not being used simultaneously. For example, several textures were duplicated across each weapon. There’s a lot of weapons, so that’s terrible, but since the player only draws one weapon at a time this didn’t really gobble up extra VRAM. So this solution was needed, but not miraculous.

Postmortem
#

Just like with QR codes, the root of this issue came down to workflow. Lots of artists, both internal and external, working under pressure with tight time constraints and not having great tools. One vendor even duplicated an entire texture directory tree, I assume with the plan to merge it all back later, but then that step was never called out and didn’t take place. I don’t blame the individuals involved for this, this is what always happens when humans are in these circumstances. The amount of data in a AAA game is gigantic and manual review simply breaks down. You see the exact same situation happen in big data situations all over the place (youtube is a great example).

This could have been mitigated with actual ingest tooling, which unfortunately the Tech Art department did not have time to make before things got cooking. Analyzing images and saving metadata (including hashes) then using that data to prevent duplicates would have gone a long way. It also would have created a pipeline step where other Quality Control checks could be implemented. Artists would definitely have appreciated a message that said “hey, this texture is identical to an existing one, use this one instead?".

In retrospect having outsource ingest tools may be more important than having internal asset tools, just because of the volume of data, the “batch” system, and communication. If I was designing a texture ingest tool today I would calculate some hashes, and perhaps some other metadata, and save that out too, just to have it in the back pocket.

Code
#

All the original code for this is proprietary, but I’ve created a simple, open source version of the duplicate detector program here. This is a standalone program written in Rust instead of Python, and is a command line utility.

BBarker/image_dups_rust

Image duplicate finder written in Rust

Rust

Prelude #

Glory to Arstotzka #

Fix It By Hand! #

Please See Attached CV #

The Big Problem #

Simultaneous Invention #

The Solution #

Dab on em #

Average Hash #

Frequency Decomposition #

Perceptual Hashing #

Now Fix it #

How Many?? #

Postmortem #

Code #