Murphy Stein is a PhD candidate in the NYU Vision Learning Graphics group.

Ken Perlin

He’s been working with the Kinect_

Hi Murphy, how’s it going?

Good. How are you Greg?

Good. I asked you to come in today to talk about some of the finger tracking work you’ve been doing with Ken Perlin and the guys in your lab, and the marching squares algorithm that you use to find fingers. How did you guys get started with finger tracking?

So we were really excited about the Kinect and its ability to do skeleton tracking and disappointed that it didn’t do anything more than tell you where the centers of the hands were. So we figured for the research that we wanted to do, which was really use the body tracking and the depth sensor itself to explain complicated 3D objects floating in air in front of us, that we needed something like finger tracking. We didn’t know how much we needed so we started with some simple ideas, and marching squares ended up being a decent middle ground between trying to write very sophisticated algorithms using machine learning and just using image processing techniques. And also it’s a well-known algorithm.

So what is the marching squares algorithm? So you’re starting with a depth image and using marching squares to get what?

Right, so I guess the best way to think about marching squares, first of all it’s a 2D version of a very famous 3D algorithm written by Bill Lorensen and Harvey Cline, so you have to credit them. And this is the 2D version. You start with some sort of grayscale image, and you’re asking the question, for example, show me all the contours. Trace all the contours for me around the boundary between the dark spots and the light spots. So if you think about a checkerboard pattern and you want to trace around the dark squares, then it’s going to give you back if you will contours, say red lines, around all of the black squares. So the algorithm takes in a grayscale image and it gives you back a list of contours.

So is that a little bit like the magic wand in Photoshop?

Yeah, it’s a lot like the magic wand tool. And just like in the magic wand tool, you have the ability to dial in what threshold you want to cross over to include those pixels or exclude those pixels.

So you’re setting which color you’re looking for and how tightly you’re looking for it?

Right. So you can think of it as a pre-processing step that you start with a grayscale image, which is let’s just say arbitrarily a bunch of pixels in range 0 to 255, and you want to say that everything below 128 is going to be black and everything above 128 is going to be white. So you binarize that grayscale image and turn it into this black and white image, not grayscale, and then you run the algorithm on that. It’s exactly the same as just telling the algorithm ahead of time just trace a line through 128. Is that clear?

Yeah, that makes sense. So the goal then is to find a boundary or a loop around the image of a continuous area of color. Is that the right idea?

Yeah. I’m hesitant to use the word “color” because it doesn’t really know much about RGB. In theory you could do it on R then do it on G then do it on B, but it doesn’t make a lot of sense to think about it in RGB. So I very much in my head have a black and white image in mind.

So a continuous area of brightness.

Brightness, yeah.

Does it find only one loop around a particular continuous area or does it find multiple ones?

It finds all of them, and that’s really cool. Another way to think about it in the case of a depth image coming from the Kinect, what you have is for every single pixel you have the distance away from the camera in millimeters let’s say. So if I had my hands up towards you and I knew at what depth my hands were roughly, say they were about 1000 millimeters – a meter – away from the camera and I wanted to trace a contour around everything that’s closer than 1000 millimeters then the way to think about what the depth image actually looks like is: if you were to take it from this view and put it flat, you have these hills where there was something closer to the camera and these depressions where there was something further away from the camera.

Another way to view the marching squares algorithm for depth images is that what it’s doing is it’s pouring water onto this topographical map and it’s showing you exactly where the water gets to. So in the case of my hands it would trace a line which is around my fingers because through these spots its really, really deep and so the water would come up to here if I knew that my hands were at a fixed distance.

So that way it will catch all of those spots including the two separate ones?

And it gets both my hand, yeah.

So and then you’re using that to then find your fingers. So in a minute we’ll get ito talking more about how that algorithm works to find that silhouette. Once you’ve got that contour of the hands how do you then find the ends of the fingers?

Yeah, that’s really a good question. People have been looking at these sorts of images processing techniques for 25 years and so we have been using some of the more well known techniques. The basic idea is that if you think about that contour as a long piece of rope that’s connected, the fingers look like a part of the rope where it got bent. If you were an ant walking along that rope, the ant would have to change direction really really quickly when it gets to the top of this finger. So it’s not really turning, not really turning, and then it’s turning really, really, really fast and then it’s coming down here. And so the way we talk about that is it has a high curvature. The basic notion of curvature actually has a precise definition which I don’t bother with. For what we’re doing you can think about it from the ants point of view how much did I turn compared to how far did I walk. If you look for places on that rope where it has to turn a maximum amount, so the most on the entire rope, that’s usually the fingers. That’s the trick we use.

So the tip of the finger and also at the valley in between the fingers, it’s almost turning around 180 degrees over a very short period?

That’s right and if you think about what the ant’s thinking about, it knows the difference between going here (point at top of finger) and coming down here (points at valley between fingers) and the reason is that if the ant is traveling from this side to this side of my hand, as it goes down into this valley it’s turning to the left. And if it goes up here it’s turning to the right. So you can just say fingers are turning to the right and that’s the one you’re going to care about. But it also is useful to know what these are here too.

By finding both those valleys and the tips can you figure out the length of the finger from that?

Yeah, definitely. Because you have the depth information from the original sensor and so yeah. You can look at what are the two points there. In practice you have to get pretty close to the Kinect camera in order to really resolve lots of different fingers. The thumb is always pretty easy and doing that (making an L-shape with thumb and index finger) definitely guarantees you got this guy (the index finger). But getting these valleys in practice is pretty tricky from the Kinect because the closer you get the bigger the image gets but unfortunately it cuts off right at about 50cm which is right when it starts getting good. Even at 40cm with the new near mode on the Kinect is frustratingly sort of still too far.

It seems like a lot of gestures except for a few wel- known expressive ones tend to mostly use the thumb and the first finger or a few fingers together

Yeah, there’s a ton of mileage just out of these two. Yeah, I mean I think ant eaters use that.

So, maybe let’s move over to the white board and talk a little bit about how marching squares works.

Yeah, cool. Awesome.

[…] Murphy moves to whiteboard […]

Marching squares

The final state of the whiteboard after Murphy finished explaining Marching Squares.

We’re talking about marching squares which is the idea that if you start with some image, which in this case is represented here by these pixels. Let’s say dark means it’s filled in (draws). So it’s not a very interesting picture, but these eight pixels and these four pixels are filled in. Marching squares asks: what is the contour that connects these two together? So for example given this image it would give you back this contour.

Whether you make this a single contour or two contours is a little bit up to you, but I’ll give you the high level. Ok so how does this work? I’m going to draw myself another set of pixels here.

The pixel values are inside these squares. In a grayscale image those values are 0 to 1 as floating point numbers or 0 to 255. By convention I’ll say each pixel is in 0 to 255. In one each of these spots is some value so let’s say 255 255 255 and on.

Let’s do it with 1s and 0s instead because it will be simpler.

So let me start this again. Let’s look more specifically at how it finds those contours. Let me draw those contours. So here I have 16 pixels and each pixel is in the range 0 to 1. We’ll say 1 is white and 0 is black. By convention we can decide we’re interested in drawing contours around the black stuff or around the white stuff. I’m just going to say we’re going to draw contours around the black stuff. So if these pixels here are black then we want to draw this contour. But now since these are pixels we’re not going to shade them we are going to draw the values and say this is 0, 0, 0, 1, 1, 1, 1, 1, 1.

And the first step that marching squares does is that it doesn’t use this grid. It uses the grid consisting of the centers of each of these points. Let me back up for a second. The way to view what marching squares is doing is it’s looking over the array of pixels and it’s trying to find all of those thresholds between neighboring pixels where you jump from below the threshold to above the threshold. And for us we’ll say the threshold for this image is 0.5. So it’s exactly in half; it’s grey. So you can sort of intuitively already see, right? The threshold jumps from 1 to 0 here, 1 to 0 here, 1 to 0 here, etc. So where we’ve drawn the contour makes sense with that new definition which is more precise.

Now the trick, I would say, of marching squares, the ingenious trick, is that it turns out that if you look at 2x2 groups of pixels, so here’s a 2x2 group of pixels, this guy could be a 1 or a 0, this guy could be a 1 or 0, this guy could be 1 or 0, this guy could be 1 or 0. If you ask how many different possibilities are there, each one has two possibilities. There’s four of them, so two times two is four, times two is eight, times two is 16. There’s 16 possibilities. What are those possibilities? Well, we can draw them out. And we can draw them out as follows. We can say okay, so this guy can be a 1, this guy can be a 1, this guy can be a 1, this guy can be a 1.

Or we can have 1, 1, 1, 0, etc. So I’m going to skip ahead and I’m going to write down all these possibilities over here. And I’m going instead of possibilities I’m going to call them codes. And remember there’s 16 codes, okay? And so we’ll call the 0 code is 0000. The 1 code is 0001. The 2 code is 0010. Zero to 15 is 16 codes. Okay, so now we have all the ones with a single 1 and the rest 0s, and we have the one that’s all 0s. Now we need to do the ones with two 1s and two 0s. So this is 0011, 0101, 1001, and continuing, 0110, 1100 and I think that’s it. There should be four of them, right? You agree? Nope, I’m missing one. 1001, 0010, do you see it? 0011 is there. 0101, this guy goes in here. Didn’t get them all.

1010

1010, well done. Extra points. Okay, now it’s all the ones with three 1s and one 0. So this is 0111, 1011, 1101, 1110 and the last one is all 1s and done.

Alright, so there’s 16 codes. So now coming back here, what do the codes mean? We can finish that up. We can call this guy is top-left, this guy is top-right, this guy is bottom-right and this guy is bottom-left, right? So arbitrarily I’ll just say the first column is top-left, second column is top-right, then bottom-right and then bottom-left.

So what this code is telling you is for the 16 possibilities what is it expecting for each pixel value, a zero or a one? The way to view zero or one is that it’s above the threshold or below the threshold. So zero means that it’s below the threshold and one means that it’s above the threshold. So that should be clear.

Leave those codes. Onto the algorithm. Any questions so far? You are the Ur student.

Nope.

Okay, so how do we use this? Basically we have to turn this image into now a collection of codes because for each two by two group of pixels… [draws at the whiteboard] So here’s a two by two group of pixels: one, one, zero, one. So it would get code 13. Similarly here is another set of two by two pixels. This guy is one, one, zero, zero which is number nine.

This guy is one, one, one, zero. He’s 14. Okay? And on and on and on and on. Right? So one thing to notice is that we started with an image which was four by four but we’re going to end up with codes that are three codes across and three codes down. So one smaller in each dimension. That’s because we’re thinking about groups of two pixels, not individual pixels.

Is that because for the pixels on the border there’s no pixel outside of it? So those pixels don’t have a neighboring one.

Yes, exactly. Well said. You nailed it. So, here’s the trick. So that was one trick. I guess maybe I should quickly finish these codes? Yeah. We’ll quickly finish these codes.

[Proceeding through the pixels in his image, labeling the codes based on looking up the patterns of each 2x2 set of pixels in the list of codes established earlier.]

So this is one zero zero one. If you see the code call it out. One one zero one. Thirteen again. Is that right? Oh, no. One zero zero one. Seven. Thank you.

And then this is zero zero zero zero. Zero. And this guy is zero one one zero. Eight. And this guy is one zero one one. Twelve. And this guy is zero zero one one. And this guy is zero one one one. Eleven. Nice. Thank you.

Okay, so the algorithm as a whole has two steps and this has a part A and a part B. We’ll go through all of them. So the first step is what we just did. You run through the entire image and you assign a code to each two by two set of pixels. So assign a code to two-by-two pixels.

So now we’ve created this code book where this top left pixel has the value of 13 in it. The next pixel to the right of him has the value nine. The top right most pixel has the value of 14, etc. Okay?

The next step is to transform the codes into contours. The reason that works is: let’s take a simple example here. So I’m going to represent my code using this diagram. So this is two codes here. Let’s say that this represents a boundary between black and white.

So this is going to be zeros here, which I’ll just fill in to represent black. And this guy’s going to be white, and this guy’s going to be white. I’m just going to leave them the way they are. So just to be clear, this guy has some code, right? His code is 0,1,1,0, which is eight. And this guy has some code which happens to be all ons(i.e. 1,1,1,1), which is 15. So now if you were going to put an edge here, you have a couple of things you have to decide. One is you have to remember, what did we agree on in the first place? We’re going to draw contours around black regions. The other thing we have to decide is which way are we going to go? Are we going to left around the black region or are we going to go right around the black region? That is to say, are we going to try to keep black on our left or are we going to try to keep black on our right?

It’s not the same, right? Say that we have this black region here and we’re a little ant. If we’re headed this way then we’re keeping black on our right, whereas if we’re heading this way then we’re keeping black on our left. So we can just arbitrarily decide, it doesn’t really matter, we want to go one way or the other. I’m going to keep black on my left, okay?

So I’m keeping black on my left. I’m marching through my codes now doing step two, and I’m trying to convert this into contours. So it’s pretty clear that if you think about what this image looked like it looked like some sort of black region next to some sort of white region. So you want to draw the line here. And since we agreed already that we want to keep black on our left then really we’re going to go this way around the contour. In a sense you can think I want to put an edge here and I want the edge to be facing up so I know which direction to go next. So that’s what you do. You put an edge here and the next space you’re going to explore is the pixel group, in this case a 2x2 pixel group, which is in the direction that the arrow is pointing in.

So this has all been a little bit fast and loose buy what you’re doing in step 2a is starting at the top-leftmost pixel code, and if you’re not already drawing a contour then you’re looking for an edge. And an edge is something that looks different than 0 and different than 15. Zero’s totally boring, you’re deep inside some black region. Fifteen is totally boring, you’re deep inside some white region. There’s no edges. All the other codes have interesting cases because it means you’re looking at some sort of black region or you’re looking at some sort of white region and you can cross over.

So starting here, the code here is 13. We’re not already drawing a contour because we just started. Code 13 is not 0 and it’s not 15, so it’s an interesting case, which means we start a contour. So we know exactly what it looks like, right? Thirteen corresponds to this case, where those are all white and this guy’s black. And so we know how we want this edge to go. We want it to go like that. There’s different ways you can think about it, but that’s sort of the simplest way to imagine drawing an edge. Let’s just see how we get that from the code and it should be straightforward how we get it for the rest of the codes too.

So our value is for 13, 1101, and I find this representation a little bit easier to think about. So 1101 corresponds to this guy being filled in. Now you want to keep black on your left, right? We already decided that. So the only way that makes sense is you’re going to come from this edge here and you’re going to head to this edge here. That’s keeping black on your left. Little ant is here, little ant is there and he’s headed this way. So great, let’s just keep track of our edges. You can do that in an array that’s the size of the image, you can do it in a linked list, it’s up to you. But you have to keep track of the edge. And the edge I’m just going to represent as a sequence of pairs. So this guy is going from some X value here, and the X value is given by the X value of these pixels. Remember each one of these corresponds to a pixel. I’m not going to write them out because it’s kind of annoying to have to compute them, but it’s easy when you have a computer. So it’s going to be some X, x0, and some y0. The y0 is coming from the Y-value in between these two pixels.

You have a choice here. There’s the Ford Focus version of this algorithm where you get a pixilated contour that has sharp edges, and then there’s the Jaguar of this algorithm which is that you get really nice, smooth contours that look bendy and undulate if that’s what your underlying image looks like. I’m going to show you both, but right now just keep in the back of your mind the X is obvious because you’re on this edge and it has to take the X value from these two pixels. The Y value is not obvious. You could arbitrarily say it’s going to be midway between these two, so this guy’s Y value – y1, y2 – could be midway between those y1 and y2, but that’s going to give you the Ford Focus version.

The more interesting version let’s come back to, the Jaguar, okay? But now continuing, so this edge corresponds to x0, y0. Let’s keep track of the edges here, so x0, y0 is an edge. So because this arrow’s pointing down, and y is going down because we’re an ant who’s keeping black on our left, then it says go explore the next code straight down because it goes through this edge. So the next one straight down is code 7, and code 7 corresponds to 1001. So you can just see, right? White on this side, white on this side, black, black. How do you want to draw the edge? Do you want to draw it like this? No, doesn’t make sense. You want to draw it straight down because that’s keeping black on your left.

So actually if you look here, we have three points that we figured out: this point, this point and now we’re at this point. I only drew one. Let’s say the next one is x1, y1 and that’s this guy here, and the next one will be x2, y2. As you go from here to here, it’s not clear in this case where the X value should be. It’s also not clear here where the X value should be. But the Y is totally clear. The Y has to be the Y value of this guy, which is the same as the Y value of this guy. So this one it has to be the same Y value as this, which is the same Y value as this. The question here is the X values, okay? Again because this is the Ford Focus version, when we don’t know the answer we’re just setting it to the average of the two which is a fine thing to do. Very fuel efficient. Okay, so this guy’s saying explore down right? He’s saying continue down. I don’t know if you can see that, but I’m going to draw the continuation of this.

This code value is 12. Twelve’s code says 1011. So 1011. It looks just like this as I’ve drawn it. So since we’re an ant keeping black on our left how do you want to draw the edge? Do you want to draw it straight down? Nope. You want to draw it in through this edge. That’s how it looks. That’s what the ant would do. So we’re going to add a point for where we’re connecting this edge to this here. So continuing because this arrow pointing into this edge says you want to explore to the right. When you continue exploring to the right we look up code. It says five. What is five? Five is 0011. Well it’s going to be really obvious by now you want to draw the arrow this way.

What should this be? This guy’s code is 11. 11 is 0111. So this is empty inside, right? So if you’re the ant keeping black on your left you want to go up through here. Okay?

Did I forget a point? Yes. 1, 2, 3, 4, 5, 6. So we want to be at x4,y4 x5, y5. And continuing this guy’s saying explore up this way. Okay, happy to do that. And this guy’s code is 8. Eight means 0110.

So what do we want to do? We want to draw the edge this way. Add a little point to our list of vertices.

At this point you actually have the middle square you’ve seen is all zeros, but you don’t actually have to look at it.

But I don’t explore it. Yeah. I’m not changing anything but I’m just now noticing that I’m coming around I’m just going to correct how I’ve drawn this so that it lines up with guy up here. So I said this is code 8 and 8 means 0110 which is dark 110. So I’m going to keep black on my left I go here and that guy should be…And then one more. So we’re exploring up here.

This guys code is 14. 14 looking here, 1110. Already done. So this guy, since these are all ones, he must want to come in here and just connect that and now we want to explore this way.

Coming in through here what is this guy? He’s 9. 9 means 1100. So black is down here. We want to go this way. So you can see we’ve drawn out this stop sign looking shape and just to redraw it for you we went down here, we went down here, we went down here, we went down here, we went down here, over here, up, over there. So it’s like a stop sign.

As Greg pointed out we didn’t explore this at all and we should think about that. Right? So what does that mean? Well, what just happened is when we got back to this spot we re-encountered the beginning of our contour. The really magical thing, nature just giving us something free about marching squares, is that — and you should think about this — you are always going to get back to that point For me it’s the clearest if you think about it like the topographic map that I was describing before. If you’re thinking about pouring water into that map the border around that hillside where the water is coming up to a certain level is definitely going to be a ring around that hill. There’s no question. So that ant is definitely going to get back to where he started. But I still think it’s kind of amazing and magical.

So since the ant got back to his starting point this contour is now done. Remember it’s the Ford Focus version but that’s ok. We actually have the values for x1, y1, x2, y2 so you can think of this as being a contour now, but we haven’t explored the whole image. In this case, as Greg pointed out, there is still one pixel we haven’t explored, which is this guy. In step A what you’re doing is you’re exploring codes in an order, like say top left to bottom right. If you need to make a new contour, if you encounter a code other than zero or 15 then make a contour and if not continue exploring the codes. So we just painstakingly went through a tiny step of A which is we found a contour at the very first code we encountered and then we went to step B and we created a whole contour. Now we finished, we got right back to where we started, which is this one, and we should continue marching along our pixels and see which codes have not been explored yet. So coming back to this representation, this guy, ok, then we went to this guy, then we went to this guy, this guy, this guy, this guy, this guy, this guy. So as we march along from top left to bottom right we say ok, is this guy explored? Yeah he’s explored. Is this guy explored? Yeah he’s explored. Explored, explored, (gasp) not explored. So we should re-run step B for this pixel, except this guy is totally boring. He’s exactly code zero. He doesn’t get a contour, so we’ve explored him. and continuing and continuing and continuing and continuing. So now we’ve done the whole image, and we have the Ford Focus version of the contour surrounding of these pixels. Maybe next time I can show you the Jaguar version.

Thank you.

Alright. Awesome.