The Kinect is a very cool device, that has utility far beyond gaming. It and its RGBD kin provide a regular color camera and an infrared projector / sensor pair which provide dense depth data. What does this mean? It means that for every frame of color information, nearly all of the pixels in that frame can be associated with a different depth value. This directly gives you the shape of things, and by comparing frames, their motions in three dimensions.
Using our eyes, we can only measure the depth of one thing at a time. Of course, the brain has powerful machinery for maintaining detailed information about the structure of the scene, and infers things about shape from shadow and shading. However, when the color is extremely uniform, for example, blending perfectly with the background, it’s difficult to get both eyes pointed at the same point, and we can get confused. The Kinect does not have this flaw! Since it is actively projecting information onto the scene, it has the super-human ability to track everything in the scene, discerning visually indistinct objects, even in darkness.
The main application: SCIENCE!
The research applications of the Kinect has not been neglected – it’s taken the computer vision and robotics research by storm. No longer are we fussing with unreliable stereoscopic vision systems – instead we have moved on to the meat of the vision problem: understanding visual scenes. The research in this area still results in surprisingly crude solutions, but they are rapidly improving.
However, I think that the interdisciplinary applicability of the Kinect hasn’t been realized by researchers. Researchers certainly use sensors for their work, and by no means are these sensors supplanted in their utility. However, the ratio of the precision and density of the Kinect data, to its cost, is unprecedented. I’d be curious to know how many of the animal researchers of the world know of the actual capabilities, and potential uses of this device. Not that they could be faulted for not knowing – if there are no straightforward ways to apply the technology to their research in the near-term, then why bother?
My suggestion is that the cost of purchasing and setting up the devices is far outweighed by the potential for medium and long-term gains. There is great utility in beginning data collection NOW, in order to have a large corpus to analyze in the future, and capture things that we can only collect now.
A few past Kinect projects
Why believe me – what do I know of the capabilities of the Kinect and related technologies? Well, I’m no expert, but I have used it in the past for a few school-related projects (I really need to take videos of these! The drumkit and robot project are represented in the Robotics and Sound capstone videos):
The first Kinect I worked on was a group “Capstone Project”, creating a “Virtual Drumkit” based on the RGBD demo. It also used WiiMotes (for lower latency drumming and controls), and allowed you to place drums, and play them (sorta).
This project used a slice of the depth information, and used various different transformations to convert this to an audible spectrum. This created lots of eerie and sometimes thereminish sounds. The main idea here was that you could construct physical representations of sound which could then be “played”.
This group capstone project emulated the gestures of both the arms and legs of the operator, while retaining balance – even while standing on one leg! Retrospectively, the Nao robot actually provided a lot of APIs that we ignored – we ended up implementing our own Inverse Kinematics – but this was a worthwhile experience.
In my experience, using these devices is greatly restricted by the current lack of easy and effective analysis of the data. The Point Cloud Library is making great strides towards a unified platform for working with this kind of data, but getting reliable, informative results requires you to be an expert in machine learning and computer vision.
A call to arms!
So, what do we do, now that we’ve armed ourselves with the observation that there is untapped value from using rich sensors for scientific analysis? Well, that depends on who you are. Here are a few examples:
Researchers. I would like to encourage researchers to investigate the potential applications of these technologies to their field. Moreover, I would like to encourage them to record their experiments, or subjects using these cameras. While the actual experiment would not use this data, it might allow the effort of the experiment to yield more value once more sophisticated analysis is possible.
Businesses involved in physical tasks. For example, the effectiveness of workers could be tracked over time, and correlated with differences in the work environment. Farming conglomerates could analyze the habits of their animals, and accurately measure their crops. This allows for feedback at a much smaller time scale than purely looking at seasonal yield. It’s interesting to me that there may be business applications for collecting this sort of information – and this might be an interesting way to get lots of scientific data, as a side-effect.
Quantified self hobbyists. This is a community of people who are interested in studying themselves quantitatively, in order to refine their behaviors. An issue with this is that you end up needing to form the habit of recording all of the information that you are interested in, and this takes significant time. If the Kinect can start to automate this process in the near-term, then it’s worthwhile to start recording data in your own homes. Unfortunately, to do this currently, you’ll need to waste tons of gigabytes of space on motionless scenes. I’d like to be able to point at a ready-made solution for this problem, however, I couldn’t find one.
Regular programmers. This is the category I fall in – and a perfect project would be implementing a utility that turns a number of Kinects into a motion-sensitive, compressing surveillance system. I’m interested in making this, once I get some free-time. A more advanced project would be to create an application based on existing vision algorithms, that would attempt to automate the selection of parameters and perform post-processing on results. Our existing algorithms could (and do) already yield scientifically interesting results, but the barrier to entry is too high for non tech-savvy researchers.
Computer Vision and Machine Learning experts – this is where the real legwork needs to happen. Thankfully, it will happen regardless of this post, as there are lots of researchers bent on solving these problems (many more people should work on these domains, in my humble opinion!). As described below, there are lots of gaps in our capabilities, and no unified way of applying our techniques.
Game content creators. If we can get a decent model of a scene, and how elements of it interact, then by definition we can likely simulate it! While probably quite a while away, this might allow us to do something like “generate a typical barn scene, with typically behaving cows”. I think that there’s also an opportunity for people who collect such data to sell it to improve the quality of such entertainment simulations. This is a further side-effect of collecting this data for business and scientific purposes.
But this isn’t Science!?
Those familiar with the scientific method will be quick to observe that this is quite contrary to traditional scientific methods. How can it be science if there are no control groups, and no intervention used?
My stance on this is that this particular aspect of the scientific method – demanding controls and actions taken on the subjects – is a convenience that makes things feasible and strong even when data collection is small. This is great for when the actions are necessarily taken due to the subject of the study, such as treatments in a clinical study, or the extreme conditions necessary for a physics experiment. Controlled variables are a useful and necessary tool for these experiments to be practical and meaningful, as are innate costs associated with performing these actions and collecting data.
However, controls are also a problem – how can we be sure that there aren’t dependencies (such as the arrows in the image) on the chosen controls, that either mask or exaggerate the effect of the independent, manipulated variable. It seems like controls are selected to be “normal” conditions, and are somewhat a qualitative element of experimental design. Hopefully the choices are justified by prior experiments, but this implies that the validity of our experiment is conditioned on those that determined our choices. I suppose that’s one reason to have a bibliography!
In order to determine the effect of the controls, while still conducting a properly controlled experiment, we need to either control the variables we were changing previously, or endure the combinatorial explosion of a factorial experiment. The solution is to shirk these responsibilities, and use an observational study.
Observational studies are somewhat discouraged, as they do not lead to the level of convincing results that more controlled studies are capable of. However, they are acknowledged as quite useful tools in the social sciences. I think that by collecting massive quantities of data, we will eventually be able to automate the generation of informative, and statistically significant, multivariate correlations about the subject being studied.
It’s already happening
I searched around to find examples of the Kinect being used for scientific and medical applications. I found a number of interesting results:
- Diagnosing autism
- Analyzing glacier shift
- Potentially aiding surgery
- Applications to Physiotherapy
- Deb Roy: The Birth of a Word – tracking the first three years of a baby’s life. Doesn’t use a Kinect, but right on target.
However, none of these studies are taking advantage of the huge quantity of these devices that are now in circulation, or using the low price to justify using huge quantities of the devices. I also found several articles similar to my own, that observe the trend of increasing interest in the Kinect’s application to science, however, none I found mentioned more than the above examples. This is truly low-hanging fruit for many research domains!
Current state of analysis
Many people likely associate the Kinect with the ability to track human bodies. While the fundamentals of the technology for the sensor itself have been around for quite a while without being capitalized on, the approach used to get the reliable and efficient tracking used by the XBox are new. This paper outlining the technique, was published in June 2011.
However, this approach relies on hundreds upon hundreds of hours of training data, crunched by even more hours of computational time. In essence, they traded training time for the output being very efficient and effective models. While the resulting skeleton motion data can be very useful for science, as some of the above scientific applications demonstrate, it is very restrictive to humans.
I cannot humanly survey all of the literature, and really have seen unfortunately little of it, although I have gotten an idea of the current status and difficulties of computer vision by talking about it with some graduate students in the field. Currently, versatile and reliable inference of the constituents of a scene and how they are interacting is well out of our reach.
Something I’d particularly like to see is the ability to machine-learn novel articulatory structures. In other words, observe that something is changing shape in some way, and figure out how the mechanical structure must work for that change to be possible. For example, this would allow animal psychologists to analyze the body language of animals, lending insight into their mental state. While talented field animal psychologists can become very attuned to the body language of their subjects, using this qualitative data is quite unscientific. If we can show that machine-learned categorizations effectively predict behavior, then we can use these to study the effect of stimuli.
The closest thing I found to this is the Articulated 3D project at Cornell. Their results are very cool, but do not exactly inspire confidence. While this is purportedly a method for inferring a variety of articulated structures, they do not fully demonstrate this. Instead, “To demonstrate the robustness of our algorithm, we collected point cloud data for several classes of boxes, including various sizes of standard packaging boxes and unusually shaped boxes.” Methinks they need to think outside the box! I look forward to their next publication! Time likely caused the lack of diversity in this publication.
Once you can begin to quantitatively understand arbitrary objects, we can begin to study the aggregate interactions of these objects, and perhaps find things we would have otherwise missed. By automating the entire process, powerful, presently nonexistent, machine learning techniques could automate extremely large scale experiments. By making large-scale, multivariate, longitudinal studies feasible, sensors of this variety have the potential to uncover correlations we’ve missed, and perhaps suggest avenues for future exploration.
I wrote this post as part of a project for an “Innovation and Creativity” class. I probably wouldn’t have posted it otherwise – this deviates from my typical Haskell posts – but it is something that I have thought about a good deal, and think would really be helpful to the world.