Sometime in December 1975, a Kodak engineer named Steven Sasson captured the first digital photograph. Sasson’s prototype camera took 23 seconds to record an image onto a cassette tape. The picture was black and white and only 100 by 100 pixels in size before being interpolated to be viewed on a television set.1 In the nearly forty years since that day digital camera resolution has improved by approximately an order of magnitude per decade, a curve of improvement reminiscent of the legendary Moore’s Law that predicted the geometric rate of increase in affordable computational power that has so changed our contemporary world. (See chart.)
I believe that this geometric increase in available image capture capacity (or the More Pixels Law, if you will) is changing our world in a manner analogous to the much better recognized Moore’s Law. This piece aims to explore the mechanisms by which this change will happen and to suggest some of the consequences for visual culture, the arts, and design of a world where first 100 megapixel and then gigapixel cameras are available and, eventually, ubiquitous.2
I will argue that the combination of the More Pixels Law with accompanying revolutions in mobile computing and computer vision is transforming the use of images in our culture. Digital images are moving from being pictures to being data. Rather than capturing solely what things look like, they’ve begun to capture something of what they are – albeit always filtered through the specific textures of digital sensing, storage, and computation. As a result, the visual vocabulary of data processing and computation is entering the realms of traditionally pictorial visual culture. And, further, cameras are taking their place at the center of a whole new kind of art and design practice that uses them not to create images, but to sense and track the world in order to create interactions.
The alchemy transmuting digital images from pictures into data involves three ingredients: high resolution cameras, an increase in the computational power available to process these images (especially in mobile devices), and radically improved software techniques for extracting interesting information from these images.
What does a higher resolution camera get you? We can get a sense of this by peaking ahead at the results of the next order-of-magnitude increase coming from the More Pixels Law. A group of researchers at Duke University working for the Defense Department recently announced the existence of a gigapixel camera, constructed from an array of small sensors usually used in smartphones. The images created with this camera capture so much of the scene that the researchers provide a Google Maps-style interface for diving deep into their details.
Images of this resolution make even objects that are quite small in the scene available to be detected by computer vision algorithms. In fact, as you can see in the wired.com illustration above, they capture details that are far smaller than could be distinguished by the human eye, details like faces, license plate numbers, and sign text that can be processed by current computer vision techniques, but would usually be too small in images captured with current cameras.
Obviously, a gigapixel camera produces huge image files that would be incredibly computationally intensive to process. Today, most consumer camera-computer setups cannot even stream 1080p video at 30 frames per second (a typical minimum standard for “real time” applications) let alone process the input from a gigapixel camera in order to detect the distant people and objects revealed by its incredible resolution.3 In fact, many contemporary computer vision projects use cameras such as the PS3 Eye, whose 640 by 480 resoultion (which would have been top-of-the-line in 2000) is well below that of contemporary DSLRs, specifically because the lower resolution image means fewer pixels for the algorithms to process and, therefore, better performance. On the other hand, most high resolution cameras (for example, your typical 20 megapixel DSLR) are attached to incredibly primitive computers meant only to drive the user interface and store the images captured by the sensor.
The rise of the smartphone is changing this dynamic. Smartphones are rapidly becoming both the world’s most widely used computers and cameras. While smartphones still have less computational power than top-tier desktop computers, Moore’s Law is conspiring with the fierce competition in the market to improve them rapidly. One relevant point of comparison: the processors in today’s iPhones are more powerful than the SGI workstations that were used to do the 3D rendering for Star Wars: The Phantom Menace in 1999.4
Another factor accelerating image processing capacity is the rise of GPUs, or Graphical Processing Units, chips that are purpose-built to manipulate high resolution images and 3D graphics in real time. The rise of powerful, programmable GPUs is the most recent large-scale change in computer hardware design and existing computer vision software and research approaches are still being reconfigured to take full advantage of it. Similarly, powerful GPUs are just beginning to become widespread on smartphones where their capabilities are central to the performance of touchscreen interfaces and rich graphical games, but their power consumption is a major challenge for mobile battery capacity. We can expect to see dramatic improvements both in mobile GPU capacity and in programming techniques that take advantage of it.
The result of these improvements is that most consumer cameras today are connected to what would have been considered supercomputers less than a decade ago. These cameras’ resolution is improving on a curve equivalent to higher-end cameras. The sensors on current iPhones, for example, are as good as any that existed, even at the high end, only a few years ago (see the lower line on the chart). And these smartphones come with a vigorous market for applications that use these cameras in new and creative ways, from Microsoft’s Photosynth to Autodesk’s 123D Catch to an endless parade of augmented reality aps. This market constitutes a powerful pipeline for turning computer vision research into new interactions and aesthetic experiences for a huge set of people.
So what is this computer vision research doing? What data has the field managed to extract from images? The last few years have seen a number of prominent breakthroughs in computer vision from the Google self-driving car to the Microsoft Kinect. The field is so rich and diverse that its progress is difficult to summarize concisely. I’ve written before about its history and evolution. In this section, I’ve chosen to highlight three recent research projects I find especially intriguing from an art and design perspective. They’re intended to capture some of the field’s diversity, and suggest some of how it operates, in the briefest possible surey. They range from detecting the most intimate biological processes of individuals to tracking objects in the home to analysis of the behavior of mass numbers of people in public space, all by extracting data from camera images.
First, take this 2012 SIGGRAPH paper demonstrating Eulerian Video Magnification. This technique allows researchers to amplify small changes in videos such as small movements or changes in color. In their impressive demonstration video, they use the technique to take a person’s pulse by amplifying the flush of their face and they track an infant’s breathing by exaggerating the movement of their chest:
As you can see in the video, this technique enables computer vision applications to substitute in situations that formerly required attaching sensors to people such as heart rate and breath monitoring. This is a theme, I’ll return to below when disucssing the role of the camera in interactive art and design.
Another recent project from the University of Virginia focuses on tracking objects instead of people: Kinsight: Localizing and Tracking Household Objects Using Depth-Camera Sensors. Kinsight uses a series of Kinects to observe the movement of people in an enclosed space. It performs object recognition so that it can tell what objects are present in each area before a person arrives and after they depart and it uses the resulting information to track objects as people move them around. Kinsight combines many current approaches in computer vision: it uses machine learning to determine what objects are visible from each Kinect, it uses depth cameras to track people and objects in space5, and it uses the Kinect’s skeleton tracking capability to determine which objects people interact with. The result is a comprehensive object monitoring system built with cameras as the only sensors.
Where the first project analyzed something as intimate as the flush of an individual’s skin and the second one tracked objects and products in the home, this third project deals with the movement of large numbers of people in public space.
In the last few years, Oxford University’s Active Vision Group has conducted a sustained research project on Coarse Gaze Estimation in Visual Surveillance. They’ve demonstrated the ability to detect large numbers of people simultaneously, locate the position of their heads, and estimate the direction that they’re looking, all from low resolution color video.
While this work has obvious security and commercial applications, as an artist and designer I can also imagine a myriad of ways to use it in an installation or museum context, from making work that directly addresses issues of surveillance itself to building a system to provide metrics for museum designers to measure engagement with exhibits.
These techniques (and the many hundreds more published in computer vision journals every year) will only become more sophisticated and add new capacities as the More Pixels Law does its inevitable work in the coming decades. Further, applications which now seem infeasible due to limited camera or computational capacity will rapidly come within reach. We’ll be able to extract ever more data from our images and we’ll be able to do it on the tiny camera-computers we carry with us at all times.
Moore’s Law has been in effect for more than 50 years. In 1965, when Gordon Moore published his paper describing the effect, he cited data dating back to 1958. Even with that sustained rate of growth, the real cultural impacts caused by powerful affordable computation didn’t hit until the 80s with the popularization of the personal computer.
The More Pixels Law has been in effect for barely half the time of Moore’s Law. It has just entered its 80s stage, with cultural effects just beginning to accumulate. In the coming decades we can expect to see those effects grow as the impact of computation did in the 90s and 00s with the rise of the internet and smartphones. The tenuous experimental trends we see now will blossom into something enormous.
In these last sections, I’ll take a look at how artists and designers have reacted to the results of the More Pixels Law thus far. First, I’ll describe how the data derrived from computer vision is being used to construct a new photographic rhetoric of realism. Artists and designers are applying (and in the process modifying and reinventing) long-established ideas about the source of authenticity in photographs – from Roland Barthes’ Reality Effect to the idea of the Photographic Trace established by early photographers like Henry Fox Talbot – to the new images that are the result of computer vision: debug screens, heads-up displays, data visualizations, and depth data.
Secondly, cameras have become the center of a new style of interactive art and design. Cameras and computer vision algorithms are used to track the movements of people which are then incorporated into the visual output of the pieces. These pieces sometimes use the data extracted by the computer vision algorithms to enhance the photographic images of the people themselves in a manner analogous to the visual effects used in movies. Other times they use the tracking data to embed the viewers in an embodied version of the data visualization vocabulary common to the Reality Effect.
In his essay, The Reality Effect, Roland Barthes described how literary realism creates an impression of authenticity by the inclusion of “superfluous details” such as “notations, data, descriptive details”. The annotated images produced by computer vision applications represent a contemporary, visual version of this Reality Effect. They add a forest of machine-generated “notations” on top of the original images lending the results a sense of realism analogous to the literary effect described by Barthes, but now backed by the objective, scientific aura of computer science research.
A great example of this can be found in Max Gadney’s account of in-screen sports graphics. Gadney describes the advances made by the company Sportsvision in capturing data about sports and integrating it into live broadcasts that cover sporting events. Gadney holds up the results as a model for designers working with data in other fields.
The in-screen graphics that Gadney praises use computer vision techniques to track objects in the scene being captured so that the graphics appear to be physically integrated into the scene even as objects move within it and cameras move around it, an approach that maximizes the visual density of the graphics and hence their reality effect.
Further, much of the funding for computer vision research in the last two decades has come from military and intelligence sources. As Matt Webb tweeted back in 2010 when the Kinect launched:
WW2 and ballistics gave us digital computers. Cold War decentralisation gave us the Internet. Terrorism and mass surveillance: Kinect.
It is partially because of this heritage that images overlain with computer vision-derived data displays carry such a strong sense of authority. The field’s military heritage extends the long tradition in photographic theory of leaning on photography’s scientific basis as a source of authority and objectivity.
In his 1844 book, The Pencil of Nature, early photography pioneer Henry Fox Talbot argued that unlike drawings, in which the artist necessarily intervened, photographs were scientific instruments created “by optical and chemical means alone” without human interference: “nature’s painting”.
In a similar vein, designers and artists have found themselves fascinated with the debug screens and technical illustrations frequently included in publications of computer vision research, artifacts which contain a strong sense of this objective, impersonal point of view.
Timo Arnall of design and invention firm Berg London has made maybe the masterpiece of this genre, a short film called Robot Readable World, which aggregates a large number of videos from scientific and technical demos showing the debug displays of various computer vision systems.
At their best these kinds of explorations go beyond merely borrowing the authority of computer vision’s Photographic Trace or Reality Effect for their own purposes and undertake a deep investigation of the texture of these systems and how they penetrate our daily lives. The New Aesthetic blog by James Bridle was a great example of this kind of work at its best, collecting everything from Adam Harvey’s deconstruction of the Viola-Jones face detection algorithm to the operation of Google’s self-driving car’s vision system.6
At their worst, designers and artists borrow directly from the computer vision visual style in a simple attempt to appropriate its authority. A brazen example recently was the movie Prometheus by Ridley Scott. Prometheus appropriated the appearance of point clouds that are captured by depth sensing systems such as the Kinect for the design of a super-advanced 3D display created by the film’s godlike “engineers”, including adding distortion and noise to the points in an attempt to represent the great age of the recordings.
For a decade or two, artists and designers have used physical sensors to track users in order to create interactive experiences. In his classic blog post Physical Computing’s Greatest Hits (and Misses), NYU ITP professor Tom Igoe catalogued some of the most common approaches from Sensor Gloves to Floor Pads to “Remote Hugs”. These projects use an array of sensors to locate people and track their movements (accelerometers, gyroscopes, tilt sensors, rangefinders), identify objects (RFID tags and readers, NFC chips), and monitor the users’ bodies (heart rate monitors, galvanic skin response circuits, stretch sensors to detect breathing).
Despite their long heritage and plethora of cool applications, sensor-based projects are plagued with problems. Physical sensors are expensive, fragile, and complex to work with. For a user to interact with a physical computing project they frequently have to attach a sensor to their body. Even though sensor components have gradually fallen in price they can’t compete with the results of Moore’s Law or the More Pixels Law in improvement and affordability. Accelerometers don’t get orders of magnitude better after you buy them due to software improvements like cameras do. Finally, in order to build sensor-based projects, artists and designers must master some degree of embedded electronics as well as programming making for a steeper learning curve and more barriers to entry.
There is a striking resemblance between the computer vision applications I described earlier and the sesnsors used in the physical computing projects in Igoe’s blog post. Kinect skeleton tracking and approaches like those described in the Oxford coarse gaze estimation work can track people’s bodies in place of accelerometers and gyroscopes; object-tracking techniques like those used in Kinsight can stand in for sensor systems based on RFID and NFC; and Eulerian Video Magnification can perform heart rate monitoring and breath detection used in many of body-sensing projects.
For these reasons, more and more interactive media art projects are using computer vision instead of physical sensors to track their users. Chris Milk’s The Treachery of Sanctuary is a canonical recent example. Engineered by a team lead by artist and programmer James George, The Treachery of Sanctuary uses a Kinect to track the movements of its viewers, transforming their shadows into a flock of birds:
Like many other camera-based installation pieces, The Treachery of Sanctuary combines camera input with computer graphics to create a visual vocabulary that closely resembles that of modern cinematic visual effects. Rather than the Reality Effect of the photographic trace, these pieces integrate fantastical elements into an optically convincing image.
Another example of approach is Face Substitution by Kyle McDonald and Arturo Castro. McDonald and Castro used a sophisticated face-tracking technique to replace the user’s face with those of a series of famous figures from Steve Jobs to Salvador Dali, Fidel Castro and Marylin Monroe:
Other projects have pursued aspects of the data visualization aesthetic I described above in an interactive context. A notable recent example was the release of the Google Interactive-Spaces toolkit, a suite of open source tools designed to embed interactive data displays into physical spaces.
Interactive-Spaces participates in a tradition that includes Jeff Han’s legendary TED touch-table demo. Projects in this tradition use interaction to construct the sensation of immediate and intuitive access to large and otherwise abstruse sets of data. They mix the immersive “special effects” quality of projects like The Treachery of Sanctuary and Face Subsitution with the Reality Effect produced by dense data “notations” I described earlier to create a powerful result that combines fantasy with authority.
It is easy to underestimate phonomena like the More Pixels Law that change at a geometric rate. We have a powerful tendency to imagine the future through linear extrapolation of current trends. We tend to think something like: “Sure digital images are improving quickly, they’ll probably improve as much next the next ten years as they did the last ten.” However, in reality, the geometric rate of increase means that the next order of magnitude improvement will be nine times larger than the last one. In absolute terms, the quality of our sensors, the capabilities of our algorithms, the power of our computers all of these things will increase more in the next decade than they have in their entire history. And that’s not including any unexpected paradigm-shifting breakthroughs that speed things along.
It is time for us to start thinking about what we’re going to do with that capacity. How we will contextualize the output of these new mobile camera-computer systems? How will we train our artists and designers to work with them? How will they operate in an image culture where techniques described in computer vision journal articles are as important as those practiced by Renaissance masters and photographic virtuosos?
If the More Pixels Law holds, DARPA’s gigapixel camera will probably arrive in our homes and pockets around the time of the 50th anniversary of Sasson’s invention. By then our culture will have shifted based on our answers to these questions and the products and creative works that embody them. There are times, once or twice a century, when we reimagine how images work and build new practices around them. This, right now, is one of them. Say ‘cheese’.
See Steve Sasson's account of the creation of the camera for more details, including images of that first digital photograph on a television screen.↩
Of course, improvements in camera technology are not just about capturing more pixels in each image. While convenient to measure and compare, raw resolution is a seriously imperfect metric of image quality. Sensor size, filter pattern, glass quality, and many other factors all play a role. However, none of these factors seem to be scaling up at the rate of resolution and higher resolution sensors will be able to take advantage of these parallel improvements at least as well as low res ones.↩
An additional limiting factor beyond processing power, USB bandwidth, plays a major role here, creating a significant difficulty getting the data from 1080p cameras to computers fast enough to be processed in real time. New i/o technology such as Thunderbolt will help with this problem, shifting the burden back onto processors to keep up. This i/o issue is another advantage for smartphones whose cameras are internal and therefore don't have to travel over potentially slow peripheral connections such as USB.↩
See Star Wars: Episode I "The Phantom Menace" rendered on SGI for details. There were many SGI workstations used to render Episode I and they were run continuously for a long period of time, still the specs themselves illustrate the rate at which mobile computing is following its larger-scale counterpart and points towards some of what it should be capable of.↩
In the last year and a half, many notable computer vision applications have used the Kinect depth sensor as their camera of choice. The Kinect is not a traditional digital camera; it also includes an infrared projector that sprays a grid of dots onto the scene to facilitate the computer vision algorithms that are able to extract depth information from the camera's input. I believe that this, and other, "augmented" camera approaches represent a temporary phase where supplemental hardware acts a stopgap for the primitive state of current computer vision algorithms. As the software improves, I expect these specialized hardware setups to disappear; the advantages of having an application that works on the billions of exisiting traditional camera sensors will simply be too overwhelming. And we're already seeing instances of this with, for example, Dense Tracking and Mapping, an approach that extracts Kinect-like depth information from a simple color camera.↩