IU Matters: Artificial Intelligence
WTIU PBS 2017
Description of the video:
Hi everyone. Welcome to my presentation. My name is John. Today I'm going to present you the projection television at devaluation of creative images from data, a cluster because variational autoencoder approach. My clothes. My co-authors are David, David Crandell. Now at first some background. We know that deep generative model, it's a happy, used to be kidding me to create samples. A Examples are bio got creative diversity of data broken. Cycle again. I show something example is CycleGAN, I show something example is CycleGAN, However, VA is actually less successful than Gan, terms of creativity and examples such as it goes, it could be a sketch that introducing creativity by dbdt Betty's AT features. So we have seen examples such as Here we are proposing best photos from a school of thought afar earlier, which is the cost across methods, Clustal cutoffs methodology to say that's a for short, assume that there are patterns between each pair of causes and the sample is a fraught. That's what causes are consistently different or similar features. different or similar features. And we would use this a lot if our sample is from one to another. Well, so great to measure the creativity of the samples generated from using a generator testing framework method. So the contribution of this paper is two-fold. One is that we propose a nice note that there is the patent between classes and generate new samples by modifying symbols. We also propose a general approach is to evaluate the query TBD or samples classification domain by creating the generator in my shots and t. So r is t. So r is called the costs will cause variation autoencoder. Our city CVA for shorter, it relies on a standard VI, as shown on this diagram. The left side, the standard va, given two samples, S1 and S2. We're first going to use the encoder of the VA to instruct fishers fs one, fs two. After that, we can calculate out After that, we can calculate out their official difference as f delta. their official difference as f delta. F delta with the input as well as expected output for C2C v0. as expected output for C2C v0. So given a visual difference f delta, the city CVA will use an encoder to extract the feature difference encoding, which will be reconstructed using a decoder. The goal of the city CVA. It's also to vote. He wants to minimize the reconstruction loss while enforcing that the encoding distribution, it should be a Gaussian distribution. from the VA. Now why would this even work? The underlying reasoning behind it's very similar to cycle got. The reconstruction loss of city CVE ensures the feature difference can't be recreated. This is the same as the cycle consistency loss of cyclic AMP at the reconstruction loss of VA ensures that a realist, a sample can be reconstructed from a feature. This is similar to the adversarial loss of cycle. Can we test this using our evaluation scheme? We call our evaluation scheme and et cetera it on few test on midi or golf top for short. It looks like this. In every test, we choose a cost at the target class. And the samples in the class is called Tokyo samples. During training, the generator only see one or a few of the target samples. While the evaluator get to see all of the target samples and testing, the generator are asked to generate more target samples. While the evaluator evaluates the variety and the validity of the samples. So in golf time, that novelty is equivalent to the variety of the samples. This is because the generator only get to see one's dipole during training. So whatever variety introduces, it's the equivalent to novelty and the validity of the samples equivalent to the value. So pros and cons of golf top posts is golf, tennis, and lightweight. You can apply this to any classification domain. It doesn't require any extra knowledge such as the human rating. However, it does not apply if the evaluator cannot be BOT or is bad, it doesn't apply. Also. It doesn't apply either if the generator is weak in one setting. So we evaluated our C2C DIE using CoughDrop on the Middle East at the fashion setter. For every experiment, we choose each class as the party cars. And we also train a partition of V as a baseline for comparison. Here are more implementation details. The VA you will use is very standard for ferris, the city CVA and the condition of the US almost identical architectures. The CDC VA terabyte ideal example by adapting a source sample. Here we actually see the source as an average sample from here because the diagram like In our first and spread. But before we do the golf Tom method, we try some more cities, will choose standard deviation equals one. Here we see Zappos bottle can actually demonstrate variable at the various samples. There are the creative evolved because they are trade waste by resembles the CMC fashion like East. resembles the CMC fashion like East. Second experiment. We see CVA no longer generating creative samples. To see CO2 that The same thing also happens here. This is very understandable because the condition of the Odyssey, I think on some particular trading so you can only generate data sample y are C2C VA has a variety introduced by adapting for different source cases. So for the same variety, CVA needs a bigger SDT change, sacrificing more accuracy. For example, to achieve the same variety as the earlier figure spice it is TVE. We have to increase the standard deviation to 4.54 e by yeast and fashion based on the samples looks at a pattern, interclass pattern. And they generate creative suppose by adapting odor samples. And golf tab is a lightweight are generally applicable creativity evaluation method for classification domains. That concludes our presentation. Thank you.
Description of the video:
I was amazed at the sheer madness you talked about. Poster sessions jammed in like sardines, waiting list to get into conferences, thousands of paper submissions with not enough reviewers. I know it can feel really overwhelming. I, I'm Norman Sue. I do research in human-computer interaction and I'm David Crandall and I working computer vision were professors at Indiana University. This paper was inspired by talking to people at CVP are hearing about both the excitement and frustration of working computer vision. Norman, do you remember how this turned into a research paper? So remember us discussing late one night at a local restaurant. How has the growth of computer vision emotionally affected individuals in the community? We often think of technology as not having emotions. The emotions drive us and point to what is possible. So by examining what people computer vision are feeling, we can see what people feel they can and cannot do. David, has that been investigated before? I don't think so. That's why I was so excited to team up with an ethnographer like you. So we took a qualitative approach which is well suited for identifying the how and why behind phenomena. In this case, how people are emotionally responding to computer vision's growth because of the pandemic. We did this asynchronously. We asked computer vision academics and professionals to write a nonfiction story involving recent changes in vision and depicting an emotional event. Yep, In the end, 56 people wrote stories. 66 percent were academics, at 38% were professionals. And the stories were just so interesting that we decided to write a paper for CVP are to have a direct impact on the vision community. So what did the study revealed, norman? Well, we found that people are experiencing a wide range of emotions. Folks are excited living through the deep learning revolution that changed everything. Here's one story about the power of deep learning. Around 2014, a student was working on a problem that their advisor thought was difficult. They recalled 30 minutes later, we had results the outperformed the prior model by ten to 20 percent. I remember bouncing back into my advisor's office with a silly grin. And overall, people felt the prestige and popularity of computer vision had open new doors for them. Students now decide among multiple prestigious internship offers. Despite these successes, though Norman, about 80 percent of our stories depicted negative feelings, right? Deep learning has upended computer vision. People feel that they are engineering the black box is self investigate deeper questions like perception and cognition. One senior researcher was surprised that a student posted present or didn't know about their work on the topic. The students smugglers said, I don't read any paper before 2015, the senior researcher felt deflated while other stories taught that the rising rope industry in computer vision, a striking story came from a student who's advisor receives a lot of industry funding. They wrote, everyone in the lab is assigned to work at a company for funding, and these companies make us work hard. This will be fine if what we were doing was research, but it is not. I learned that most students hated their company research. I remember that story too. They felt the moralized because that's not why they went to grad school. Many stories touch upon implications of computer vision. One student color for Willie, said that he could no longer be a quote, simple, happy nerd, an ostrich researcher hiding my head in the sand and blame others for the misuse of technology. Yeah, there seems to be a realization that we can't just hope that someone else will worry about the social implications of vision. Some, the most upsetting stories were about isolation. One student recounted feeling energized by Talks at women in machine learning and career in ai workshops. But our conference roommate asked, Why are you wasting your time with these workshops? And so going to the core conference talks, she was hurt and felt that the topics are these workshops shouldn't be sidelined. Here's another example of feeling sidelined. Professor talked about heartbreak. He returned to CVP are after several years of focusing on teaching. He said, I tried to talk to presenters, but it was clear that I did not have any insights to offer in return. They move down and talk to others. I haven't returned to CVP our sins. So David, what are the main takeaways of this study? Well, our goal was to uncover how people are really feeling if you give them a chance to share confidentially. Overall, we found that the rise of deep learning has tracked with the success of vision, but also feelings of isolation and marginalization. Also, there seemed to be a feeling that hyper focus on quantitative metrics and benchmarks might be detrimental. Maybe this could be an opportunity to accept and expand qualitative metrics in computer vision? Yes, for example, in so relying only on quantitative benchmarks, you might tried to point algorithms in the wild and qualitatively evaluate their effects. And we haven't spoken to enough people and under-represented populations, a future papers should address this limitation overall are finding suggests we need to foster an environment and computer vision more inclusive of those who feel marginalized. Thank you all for listening. Our participant shirts, so many other interesting stories. So we hope you'll check out our full paper.
Description of the video:
Hi, this is Satoshi to see. I'm from Indiana University. And today I'm going to present a paper whose hand is this person identification from egocentric and gestures. And this is a joint work with young Wei Fu from Fudan University and David Crandall from Indiana University. All right, we are interested in wearable camera or ecocentric region. These days. Due to the popularity of AR or VR devices, egocentric patient is becoming more and more important. In this paper, we ask a very simple question. Can we identify a subject from egocentric PDO or the hand? The reason why we are interested in this question is because in egocentric vision, subjects are not visible, which means we can't use conventional vision based identification. However, there is an exception that hands often appear in these videos. Let's see some examples. I will play a video of a person performing a hand gesture. Next, I will also pray to videos performing the same gesture, but the winds from the same person, from a different person. All right. Which one do you think is from the same person? Well, the answer is the bottom. We could also play the same game on the depth image is, let's see. Well, the answer is actually same. I'm not sure if this was easy for you or not. But this paper asks the same task for computer vision systems, okay, methodologies. So the purpose was to investigate if this task is possible for computer vision. So we build upon a very standard 3D convolutional neural network designed for video classification. And then tray end-to-end subject classifier from the video clips. And not just simply train end-to-end classifier, we want to investigate what kind of information in the video is helpful to identify the subject. Therefore, we prepare the five types of input. First one is just RGB, and second it's just depths image. Then we use this information to create the sad modality of binary hand, which is just a binarize the depth map, which approximates shapes of hand. Fourth one is the gray-scale hand, which is just a gray scale image masked with the handshape. Then finally, the fifth one is theta. Images of hand. We experimented on a gesture they said that has 50 subjects. In order to avoid CNN's rely on the background. We trained on creeps, recorded indoor, and then tested on outdoor cribs. All right, result. In the table. Each row represent a modality that I explained in the last slide. For each modality, I also annotated what kind of information is contained in them. As you can see, the more information we had, the accuracy gradually increases up to 20 percent. This means that the shape, 3D shape, skin texture, and skin color, all these information has a unique contribution to identify the subjects. So far, we assume that all gestures included in the training set. But we also experiment another split where we divided the scene, gesture and unseen gestures. Our results indicate that there's accuracy drop for unseen cases, but still it is possible to identify some subjects. In addition to these experiments, in the paper, we showed several more analysis on the experiments. For example, we show the case study to investigate which gesture is easy to identify. The subject with gesture is harder to identify. For another example, in order to interpret a prediction result, we analyze attention map of the CNNs. We also have some verification experiment and others. To see the details. Please read our paper.
Description of the video:
Hello everyone. Welcome to our paper part segmentation of unseen objects using keep on guidance I'm shooting from indiana University. In this paper, our goal is to segment the parcel financing object given only the weak supervision in terms of the objects coupons. Now manually antigen object parts is extremely labor-intensive. We're interested to learn whether we can utilize easier to obtain supervisions such as key points to transfer part annotation from a known object set to add an unseen objects at which share same object parts as unknown objects. For example, other quadrupeds animals share same body parts, but fairly widely in shape and texture. To solve this problem, we propose a new end-to-end learning approach, which can transfer because the object by denotation from a so, from a known object set to another unseen objects that we evaluated our proposed approach, I guess several strong baselines and datasets. Our model consists of two primary modules, the visual module and the structural module. Both of these modules are fully convolutional unit with skip connections, the visual module takes an image as input and transform it to multiscale visual features, where the structural module generates multiscale structural features from the key point heatmaps. We then use a roadblock to exchange information between the visual stream and the structural stream. We use a transfer block at each stage of the decoding phases of the visual and structural modules, the transfer blocks helped to generate rich feature representations by combining the feasible and structural representation at multiple levels. Also, this allows to incorporate the coupon information of an unseen object in the model and generate part our feature representations. Now for finally breaking the part labels, we first generate part exemplars using the visual module output of multiple training objects and use them to non-parametrically classify the visual module convolutional features of an unseen object. This approach is inspired by power, works on few-shot semantic segmentation. This helps us to properly segment object bars which have limited number of training pixels due to the small size of the part, occlusion, et cetera. We first experimented on the basketball dataset. It has images of five quadrupeds animals we bought key point and part annotations. Will also annotated 100 images of 10 quadrupeds animals from the aid of their dataset with what key points and part maps, we find an updated 100 images or for quadrupedal animals from the MS coco dataset. None of the animals in these two datasets are present in the basketball dataset. Here we can see the result on the PESTLE up our dataset for, for quadruped animal body parts, we apply five-fold cross-validation on the dataset by considering for animals for training and data anymore for testing, we can see our model outperforms all the baselines with large margins, especially for the animals with article that poses such as cat and dog. Also, the non-parametric approach perform significantly better than the parametric classifier for that Dale segmentation. We conducted experiments for five bars at our model. Again, outperforms Bayesian models with high margin. Finally, we trained our model on the whole basket dataset and evaluated it on the EWA and coco dataset. Our approach bathrooms best for these two datasets as well. Here are some qualitative results on the basket, our dataset. Here, we can see the performance differences for segmentation. Be doing that bio-metric and non-parametric classifiers. It is obvious that the nonparametric classifier can segment that they're much better than the parametric classifier. And here are the results on the AW and the coco dataset. Finally, in this paper, we present the novel problem of cross class part segmentation using keep on guidance and propose a novel approach to solve the problem. We hope this will inspire more works on cross class, but just for us. Thank you.
Description of the video:
Hi everyone. Yeah, I go by e to the workplace for many years on case-based reasoning. Today I'm going to present about I apologize, I can't show up in person or do this live session. So I have to this to this before recording. Thank you for your understanding. And today I'm gonna talk to you about our research in place adaptations, and also help advisee different stages of case-based reasoning system. That may. The overall purpose of our research. Cber, more on par with nerdy by using techniques from CPR more accurate, more efficient, and return down the line. We want to use CPR to benefit artificial intelligence in general. Brain translucency of CPR. Cpr, other aspects such as adaptation of CPR. So the first research I'm going to talk about is apply different statistics. There are adaptations from the features. I'm actually going to combine presentation slides from three different presentations. I will skip and streamline together so they are first of all, we all know CPR. It's got four stages. I'll give you a case. We first find a similar case for the k space. After the retrieval, we try to reuse the previous that we should. And if that's the case, we may carry out the adaptation after If successful at 40 itself that case, we can create a space. In this presentation, we will focus not adapted state of CPR. There is a difference between the retrieve the case at a query case, we often carry out adaptation citizenship. However, that the knowledge is hard to acquire. To automate this process. Had proposed different simplistic six ADH to this. The basic idea is that adaptation a third from every pair of cases. So the difference in The difference should descriptions is attributed to the difference in differences. There are many variations of this. these pairs of differences in terms of rules. And given ideal case, case, we find a difference between the two. And we try to retrieve that both ways. the same difference at the use the corresponding difference to adapt. Our case. They were at work. And so the first school proposed CJ Walker CDH approaches rely on predefined features. And we will actually cover that first as we are involved there too. That is, trade from base pairs to predict the difference, the difference. It has good performance in adaptive step function, well, especially with that domain knowledge. For example, for caries or high dimensionality. where we use the term adaptation. The workplace retrieval cases are stored in a quiet space. And given a query, we find that both similar case using white paper. This is similar to a CBR is an adaptation. For case Adaptation Theory trading pairs of cases. I use that to train a worker. A, to produce different products, different contexts, different tests, difference between the cases, calculating the product description of the case, as useful as the text. The product description of the case is used as the context. Network. Use is the difference at the context. To propose a solution difference. The difference is add it to our solution to produce the final solution. Even a wafers k-nearest neighbor to retrieve the cases at that. we calculate, that we calculate the difference between profit and the query private difference. We Pasi that we take the profit, the profit difference into a adaptive model, which is equally bad theater. As a network. The network produces a difference which is used to modify our final solution. We compare to our CDH with other assistance to different dataset. The first one is the airfoil data setup. We compared our CDH1 ways. Other assistance in a very specific setting. In the, under this setting, given a query way, we goo number of cases around that clearly the k-space. By 3, this, the query becomes novel because similar cases I'll reboot. And we test the performance of different assistance under this scheme. So we try to 56, That's a total of three nearest neighbor, one nearest neighbor. This is our CPR network, CDH. This is a global network. At the last spline is CBR with a Bobo CDH1, where the CDF is a pigmented as a rule-based matter, that they were networked matter. By removing different number of cases. We are making the query. And more and more novel for this is that we're going to see that consistently. More cases, all what this instance with this offer in this experiment, CDH1 does not outperform the network most often. But in a rare situation, or maybe both cases, the CDH begin to outperform. They were networker. We think that this genre acceptable because airflow did I say it's actually a very simple dataset. The outer lining pattern is lurked well by the network, even if we remove many cases. Now, contrary to the airfoil, did I said we test the car dataset, weighs a lot more features, both non-fatal and numerical features, and alter this setting. We carry out the same kinda isn't whereby rape crisis to make the query level. And we see that in this experiment, CDH outperform all the other system. Because the cars are very different with so many features. We're moving at a number of cases is going to damage the knowledge that assistance a lot. However, CDA to somehow compensate that by dirty knowledge, even if it doesn't retrieve a similar case, you can steal a relatively accurate prediction by modifying the briefcase. Now, the CDH, which is a network that pretty extensive difference from a proper difference. We want to investigate why there are deep learning can be used to CDH as well. So this is going for that worker to a deep-learning network that we extract features from product description first. And z data. We're pretty A subtle difference from the product feature difference. And we tested this a task of predicting h from facial images. This is an example. Given a query, it will retrieve a similar case to modify the age of the case. However, given a query a novel way bay. However, given a curry that isn't novel, that our nearest neighbor, the case may be to go out there, some adaptation are retrieved. and we evaluate two hours. Is that Wikipedia pages from IMDB dataset. Where do we go to a convolutional network and to extract features from face images. They see pre-trained for a face recognition task. And we compare, for instance, one is a baseline that outputs the average age of all the training samples. Why is that deep learning regressor, which is c plus a dependent regressor that predict age directly from an image. H3 is a CPRs is that UC nearest neighbor for retrieval at the DLL CDH1 for adaptation. For it's a secure system using a Siamese Network for retrieval at IDEO C2H4 adaptation. And if you don't know much about sadness network, that's okay. You can think of it as a given network that is designed to measure similarity between states. So it is a better retrieval mechanics and that nearest neighbor in this regard, the following issues. Our architecture for the deep learning CDH1. The first one shows the baseline aggressor. The second one is our network for retrieval. At the third one is a DLL CDH1 for adaptation. So for baseline requests are given a case where you will see that these five features, we pass those features into a regressor, which is this on the left side, to predict the age directly. At our CPR system. Given a query case x, we first compare it with many other cases in our k-space. Why? Using a desktop to find their case? This, this after we find that both distributor, What? We pass the case to our adaptation, CDH1. It goes through the feature extraction. The same thing as this, coupled with clear. After that, features, pass those fissures it to another regressor, which is also a matter of law study that request there. The difference between the two. And then we add this edge difference to our retrieve the case why we get our final solution. Now you can see that we are trying to issue a fairly spitting distance by using the same feature stress effects and also using the same architecture for the regressor R1 and R2. The difference is that even though they are almost the same architecture, they are used to train for different purpose. The base not regressor is trained to predict the age directly. While with our adaptation is true h difference from a pair of cases. Now we did this using a normal setting where we do cross-validation and also query setting just as the primary CDH. We try to make the query case level. So here we are USE cases from 2250 for JD. And we are using cases from age 0 to 25th to 70 at above 70 for testing respectively. the network is it play by play loss is random pairs generated on the fly. And for validation. And every validation case in a validation set is paired with this nearest neighbor. The training set, training set at all such pairs form validation pair set. The reason I do this kind of evaluation is that testing, we'll go through a process because for every query, for every test query, we're going to pair it with its nearest neighbor in the Chinese case in the training set and carry out to the adaptation accordingly. Well, here shows the average error of the assistance under a different setting out the global setting, out the global setting, we see that the baseline regressor outperform other assistance. This is a quick because this agrees with our results. C2h2 as well. I'll do a double query. Cities. We see that our CPR system using CDH1 at the scientists that were consistently outperform the baseline assistance. Interestingly, we see that by using this network rather than a nearest neighbor using L1 distance, our retrieval is consistently better. And because of that, adaptation, is adapting from easier cases and therefore the error is also better. So deep learning CDH adaptation generally proofs in favor result. Only one exception where But there is only one exception. Out the normal setting, we see that the retrieval error, it's actually lower than the error after adaptation. We think this is because We think that is because the retrieval stage as well trade cases that are close to the curvy. While the adaptation stage is trained on pairs and capable of adapting for relatively larger difference between base pairs. So for radiation. And this actually relates to the other research I will talk about down the line. The base line outperforms our CPR. You see sadness network for retrieval, CDH for adaptation, which also performs what the paper for retrieval, DLM CDH1 for adaptation. This result agrees with our earlier result about CDH, where we're not working with Steve features but multiple features. Novel Kirby setting. We see that CPR using such that work across adaptation. It's Saturday, the best-performing systems. So results are consistent with the earlier result. To read, to reiterate CDH1 stage performance worse that particular network regressor, easy task domain that performs better than the requester. You got debate with novel queries. Validity of work. Validity of work. Now let us see how do we extract features. And they had those queries using the double queries using CPR. And they also retain cases so that vocabulary speak out. If actually. So, there were mostly about retrieval. Wk, while we're here uses the same features for re-use. So we could easily combine. So potentially we can count by their work which are working to form. A CPR says, Dark covers. After these three stages of CPR and other related work, caveat, Cain uses a to extract features identified, modify the features, and then visualize the features as an image. God. They do this to purchase counterfactual and second, factual cases for explanation. So their work adapts the features generated by deep learning. While our work is adapting solutions. In terms of future work, we consider a body features by rivers deep learning CDH. A current voting CDH modifies the province, different modifies solution difference according to product, dear friends. And we want to reverse this process so that if we want to adapt something into a different solution, we know how to change is probably description. And since we're actually doing is now And we're actually doing this now using variational autoencoder class to class methodology, which is way blue cat. Which is we are using a variational autoencoder to extract features. And we use the another model to learn the difference between features of different classes. Now in conclusion, a trade-off for deep learning, the CBR, deep-learning is the testbed CDH system consistently outperform its counterpart, which is that they were networked regressor for mobile queries. However, it loses when the query is the normal setting, which is query is the distribution of the data. However, we still see benefits of the CDH because it allows other benefits such as nerdy and explanation minority without changing the model. And continuing on CDH. At deep-learning CDH, we are thinking about whether we can apply this to a classification task and not just the regression. Now non-linear classification is easier than regression for other methods because you don't have to predict accurate numbers, you just need to classify it in your general costs. However, learning classification adaptation is challenging for the CDH approach. As an example. Think about the three, think about these three examples of apartment. The first department has one edge, leaving room at a cost 700 per month, and it's considered the second bedroom has two bedrooms and what will cause Line hunters, they are considered cheap. Third bedroom hadn't played that you're a hunter. Consider expensive. If we were to compare a pair of cases for the three examples where there are two different rules. Adding one from the first, the second, the apartment, the price remains cheap. However, if you add one for the second to the third department. An exchange for cheap, expensive. So because nobody no labels at mobile attributes, height, subtle differences. And previously there has been work learning and adaptation for classification. The previous data bar is an ESC, which enhances the CDH using Vireo difference metric. It is a metric that pairs nominal values based on the conditional frequency. So the diff, the distance So this formula represents the distance between two nominal values, a and b. Each side is absolute operation and the denominator LFA, stands for the number of tabs feature F, value a. And the numerator at SAC stands for the number of tags. Feature f takes the value a, and the case is classified as C. So if two attributes has equal importance, attributing to the classification, the base value should be 0. If they are very different to this value will be big. Now EAC also carries on our sample and where it will cleave. Where the value of each source case is adopted by the majority of our sample of adaptation rules. now we want to extend CDH to see CDH difference. Heart is, difference is hard to calculate for the open intellectual goods and values. So we just skip that calculation altogether. I trusted that neural network. We learn that difference, calculation and contexts of adaptation. So it's the CDH is going to learn the mapping from problem 1 and problem 2 into solution too. This begs the question, is, will this be different from just creative solution to directly from product to? We will go back to this question later. We are combining our CDH1. Cdh is that this problem doesn't arise in the combined version. So to revisit CDH, given a query and retrieval problem, calculate the difference. Conciliation, calculates a permanent difference and use the given to critique the solution difference. In CDH. We don't go through the calculation of the permanent difference ourself. Instead, we pass the bot source profit and the query public to establish a model to learn the target directory. And we also propose a few variations of this. The first variation is that we can also consider the source solution. What are the good for datasheet model has taken a third variation. Is that we can group base pairs based on their source solution and a specialized adaptations. They will networking. For every group meeting. We will have multiple attached to the network. Each one is responsible for adapting, adapting from one source solution to all the authors for solutions. We are using base e. And the system will select the right specialized network based on the source solution of the rich briefcase. The four tries to buy three ways. The idea of ensemble, instead of retrieving source case, multiple source case. And we adapt from all those sources and take a majority vote to get adapted. Version. Five takes one step further by retrieving one source came from each class. By doing this, we are going to guarantee retrieving a diverse group of source at the adapt from different parts of the debate. Instead of just a single cluster. Similar cases. By doing this, we are going to adapt our final solution from multiple diverse cases. Cases. We carry out a spider and we carry out a CPR evaluation to our CDH. Different dataset of all of our variance. And a normal CDH that uses a Turkish and rules. And also on a counterpart that uses a classifier. And also with some other machine learning algorithm at the UAC method. So this is the result of our work to compare many ways EAC, we can see that the CDH is performing about the same as the normal neural network at a deputy improves the results of 1-nearest neighbor. And we can see that in fact it starts with the worst retrieval that UAC, but gets better at Target solution that UAC. This is the result of CNC. The more datasets. This is the result of. And we compare, we compare our accuracy with some of the other algorithms detailed in another paper. It turns out that paper, there are classifiers. And our work is about the average rank of CDH is 3.4. When we compare with the baseline, compared with the baseline neural network, we can see that even though the two of the structure similar power, yet they perform significantly different data. So this gives us some confidence that going back to the earlier question. Going back to the earlier question, this gives us the confidence that we are now. We have very different from directly predicted that solution for that case. In this table on the left side with the datasets that CDH performance better. Otherwise sideways you get onset where CDH performs better. So CDH predicts the final solution based on source. So as a result, we see very similar accuracy. We think this is it because the source solution is not very useful because it is already applied by the source problem. We also see specialized adapted converges faster and easier to change, but it's less accurate. Data source solution, not all the same for every second. Group. Third, we see that ensemble doesn't really help. Even the ensemble has been really useful in UAC. It hasn't useful for CDH. We think, because they were at work already doing validation. However, force ensembles from multiple clauses doesn't help them, do help. Force. However, ensembles from multiple classes to help. This is invariant five, where we see five sometimes outperforms all the other variates. Number 5 do not always improve the retrieval result just as well. We have CDH. So in conclusion, we learned adaptation for classification. And we see its performance, its competitive ways that counterpart at the other classic algorithms. There's also comes with a webpage of CPR system. So as a future direction is adaptation from source problem, pocket problem into talking solution. In principle, this mapping could just even or the source problem and reason from scratch. And here we showed there are significant difference from a reasoning from scratch or not. They were at work, but we're still sorted how that's happening. So to find out more about this, we want to specifically input differences rather than the raw attributes and see if the CDH can still predict a solution difference from there. So in both CDH and the ACTH and also CAN CDH, we have seen that adaptation doesn't necessarily improve the result of retrieval, especially with arbitrary what's good. And what happens was the retrieval and adaptation are not synchronization. So retrieval, retrieval case better trained to adapt rather pairs of cases. It's not trained specifically to adapt cases. They started our next research and carbonized in case retrieval and adaptation with alternating optimization. First cover some background about alternating optimization A0, and then talk about how we talk about how we formulate the problem for balancing, retrieval and adaptation. Before we do that from another perspective, CVR at not just the four stages, It's actually multiple tethers that interacting with each other. The steps, each step depends on the previous result. The knowledge containers overlap, compensate for another at the SUID per container. So when we train that retrieval way, only care about the result. I clean adaptation we only care about are adapted. Result is that at based on fixed the previous components, nobody is transmitting one container can improve the overall performance. But unfortunately, we can also pair it. One example is this fancy car. Even though we break the back wheel. Very good and very bulky. The front wheel. This wheel deaf any deteriorates the overall comfort. Performance. So the goal is that we want to achieve in their balance with the CPR system. And this does not mean the retrieval reuse will be equally weighted. It doesn't mean the best performance for each is pursued independently. As John said, No man is an island. We extend that to know CPR cycle is an islet. So to balancing retrieval and we use, we want to acknowledge and respect that there are multiple criteria for optimizing components strengths, different choices can produce radically different the results. And the goal here is to provide the best performance in terms of components direction, and a user goes. Okay. Now we talk about alternating optimization. Ao, Ao, focus on the task of optimizing a function f. Bye. All focuses on minimizing a function f by optimize the parameter x, which is often of multiple dimension. And it does this by partitioning x into non-overlapping subsets for x one to x t as minimize effects iteratively. In every cycle, ALL minimize f by optimizing XI while keeping all the other subsets fixed. At a example of A0 is the EM algorithm where the E step and the two examples of ALS algorithm is an algorithm and the k-means clustering algorithm. To make things easier, we will define some terminology. And R is the retrieval function parameter plus x r. A is the adaptation function. A is the adaptation function parametrized by x. A. Cb is the k-space, q is the query, and q is the query set. F stands for the loss function for the whole system. At the r, small r is the case for the k-space, while small a is the adapted a solution are over or go is, we will first go through a retrieval stage to get our recursive case. And then at that are retrieved case to the query q and produce a solution. And lastly, the loss function f. We'll compare our adaptive resolution ways that query case Q to give us a loss. And we want to minimize this loss by optimizing. A and X are the parameters of retrieval, adaptation and trace. Back and forth matter. When we train, we are trying to minimize this function while holding x a. And when we train a, we are trying to minimize the loss function. Why Haldane SR fixed by alternating between steps 23 is eventually going to converge. Or we will run out of iterations. But wait, there's more to the loss function. The goal or the loss function can transcend the system accuracy. Cruces like usefulness of cases provided as explaination, efficiency of retrieval. Efficiency or pretty or adaptation. For simplicity, balance the two losses here. One is the data should loss. In this case, we use the adaptability of the face. Why is retrieve the loss? Whichever loss, which is whether the case is a competency explanation. So now we have two loss, F and G. So now what does f? Now we have two unknowns, f at g, two losses. So now we have two gases, f at g. And then. The overall loss function for step two becomes what we are optimizing. The next are, we are minimizing both retrieval and that they should loss. Where the two losses outweighed by a alpha value. Y0. Step 3 is unchanged because XR is, it's fixed and we're going to retrieve the case is fixed. And retrieval loss cannot be optimized. But we are teaching XA. And now we're going to alternate between step 3 at four. And if alpha is equal to 0, 4 becomes two, meaning we will ignore the retrieval with us. If alpha is one, then retrieve where people are is trained to minimize their fever loss only. Blurry vision at the attic Norway, the adaptation loss. To evaluate A0 would do to test their system. For regression. We use the squared error between a Data solution that the final solution as a tactician loss and the way you use the square error between the fifth case and the query case as direct cable loss. Because we know that it's unlikely you retrieve a case that perfectly resolves the query. So there will always be some error there. We introduced a caveat though E here, as the threshold for the retrieval loss. So if your retrieval loss is less than e, then we consider the retrieval is successful at the error is 0. We believe, assisted, similar to what we have done before in our CDS work, the retrieval is using a sadness that work. Why are the adaptation is using CDH? And the reason for choosing this test, the system, is not just because it's very convenient for us. And it's also because the, it depends that they got the dependent of each other or deeper down each other, allowing comparisons between different paradigms. And they can be traded using formula three and formula for the elbows powerful enough to solve majority of the queries. So the evaluation wouldn't be paired somebody because they are not powerful enough. This is the training algorithm. I will skip this for now. The amazing thing here is for every case. Amazing here is that during training, we will prepare the pairs for training for each batch. And then after a batch of Paris are prepared, we will carry out a training for adaptation and the retrieval. And this repeats for the next batch. So it'll trigger is actually a general dilation of the traditional training. If the batch size is set, the size of the k-space, that the training only happens for one big batch. So it only happens once. That regard. Add in that regard. It is similar to traditional training where the two stage and retrieval and reuse our trade in sequence. And if alpha is equal to 1 is the same as a dependent training, because retrieval only cares about retrieve or loss now. And if alpha is equal to 0, It's the same thing as HER at that teacher guided retrieve or training, the voice of trade ways that pathogen lossy mind only. And we try to find an ideal setup between by finding that right alpha and finding that balance is difficult. This does it just involve finding the Alpha also evolves, designed or loss functions f and g, and call your parameters b0. How you choose your batch size on your training steps. And we evaluate this idea, five datasets, one of which is an artificial dataset. This dataset is designed in a way, so it has the four attributes at a wife. At one final value. It can't be solved ways to support that decision rules. And it turns out, on average, at A0, performance better than both. It depends that at HR training for mostly datasets. And we also showing those detailed errors for each dataset. And we will explain what is interesting here. So here we show a figure of three columns. The first column shows the error of retrieval, coordination through ever after adaptation. And the third column shows how adaptation teaches the error of retrieval. In every figure, the axis, the x-axis are the bins. It shows how many tabs error is a hit it a bit. The y-axis shows the count of that bit. There are three bars in each bin. The black bar represents a dependent treaty. Lighter gray bar. The second bar represents HER training adaptation, got HAT add. The dark gray part represents our training data set of that. And you can see that in retrieval. We're going to see something interesting result here by using HER training where the retrieval only occurs adaptation. Even though afterward that there's an error, it's very low. Your retrieval error can be very high because your retrieval error is not considered trading. And if we were to use a D-pad that Katie, and you get a very good result. But you also get some very bad ever after adaptation. Because adaptation at the retrieval are not synchronized. However, for a o training, we see its generality. Retrieving good results at adapting into good errors as well. So you can also see this on figure where fours ALL training. The change induced by adaptation is consistently small and negative. While for independent training, adaptation can cause an error and bigger than 0. While the HER training can cause a big change in error. We also see the same pattern, okay, under the law sets. And now we'll skip that for now. So as a result, we see it even to training that literally voice well-trained. But adaptation doesn't necessarily improve the results of retrieval. Ag arginine, the clinical stage provides for initial solution. Why the adaptation stage carries out heavy correction. While email Trinity, that retrieval generally provides a good case. Adaptation further modifies resolution to be closer to the correct solution. We think that a can't be applied to two different deny containers at their relationships. This enables our customer validation to other goals such as explainability, efficiency, and accuracy. By doing a component already is balanced to the task. As a future direction, we want to also include retain, revise the EO. So we can apply to all four CBR stages and bring out the full benefit of the different containers. And the fact that there are already candidates. In fact that there are good candidates for this. We have seen sounds instance where the four stages of CBR can't be dependent on each other. And we think moving from here, we can beat other ANS it. And that concludes my presentation. Thank you very much. So as a future direction is adaptation from source problem, pocket problem into talking solution. In principle, this mapping could just even or the source problem and reason from scratch. And here we showed there are significant difference from a reasoning from scratch or not. They were at work, but we're still sorted how that's happening. So to find out more about this, we want to specifically input differences rather than the raw attributes and see if the CDH can still predict a solution difference from there. So in both CDH and the ACTH and also CAN CDH, we have seen that adaptation doesn't necessarily improve the result of retrieval, especially with arbitrary what's good. And what happens was the retrieval and adaptation are not synchronization. So retrieval, retrieval case better trained to adapt rather pairs of cases. It's not trained specifically to adapt cases. They started our next research and carbonized in case retrieval and adaptation with alternating optimization. First cover some background about alternating optimization A0, and then talk about how we talk about how we formulate the problem for balancing, retrieval and adaptation. Before we do that from another perspective, CVR at not just the four stages, It's actually multiple tethers that interacting with each other. The steps, each step depends on the previous result. The knowledge containers overlap, compensate for another at the SUID per container. So when we train that retrieval way, only care about the result. I clean adaptation we only care about are adapted. Result is that at based on fixed the previous components, nobody is transmitting one container can improve the overall performance. But unfortunately, we can also pair it. One example is this fancy car. Even though we break the back wheel. Very good and very bulky. The front wheel. This wheel deaf any deteriorates the overall comfort. Performance. So the goal is that we want to achieve in their balance with the CPR system. And this does not mean the retrieval reuse will be equally weighted. It doesn't mean the best performance for each is pursued independently. As John said, No man is an island. We extend that to know CPR cycle is an islet. So to balancing retrieval and we use, we want to acknowledge and respect that there are multiple criteria for optimizing components strengths, different choices can produce radically different the results. And the goal here is to provide the best performance in terms of components direction, and a user goes. Okay. Now we talk about alternating optimization. Ao, Ao, focus on the task of optimizing a function f. Bye. All focuses on minimizing a function f by optimize the parameter x, which is often of multiple dimension. And it does this by partitioning x into non-overlapping subsets for x one to x t as minimize effects iteratively. In every cycle, ALL minimize f by optimizing XI while keeping all the other subsets fixed. At a example of A0 is the EM algorithm where the E step and the two examples of ALS algorithm is an algorithm and the k-means clustering algorithm. To make things easier, we will define some terminology. And R is the retrieval function parameter plus x r. A is the adaptation function. A is the adaptation function parametrized by x. A. Cb is the k-space, q is the query, and q is the query set. F stands for the loss function for the whole system. At the r, small r is the case for the k-space, while small a is the adapted a solution are over or go is, we will first go through a retrieval stage to get our recursive case. And then at that are retrieved case to the query q and produce a solution. And lastly, the loss function f. We'll compare our adaptive resolution ways that query case Q to give us a loss. And we want to minimize this loss by optimizing. A and X are the parameters of retrieval, adaptation and trace. Back and forth matter. When we train, we are trying to minimize this function while holding x a. And when we train a, we are trying to minimize the loss function. Why Haldane SR fixed by alternating between steps 23 is eventually going to converge. Or we will run out of iterations. But wait, there's more to the loss function. The goal or the loss function can transcend the system accuracy. Cruces like usefulness of cases provided as explaination, efficiency of retrieval. Efficiency or pretty or adaptation. For simplicity, balance the two losses here. One is the data should loss. In this case, we use the adaptability of the face. Why is retrieve the loss? Whichever loss, which is whether the case is a competency explanation. So now we have two loss, F and G. So now what does f? Now we have two unknowns, f at g, two losses. So now we have two gases, f at g. And then. The overall loss function for step two becomes what we are optimizing. The next are, we are minimizing both retrieval and that they should loss. Where the two losses outweighed by a alpha value. Y0. Step 3 is unchanged because XR is, it's fixed and we're going to retrieve the case is fixed. And retrieval loss cannot be optimized. But we are teaching XA. And now we're going to alternate between step 3 at four. And if alpha is equal to 0, 4 becomes two, meaning we will ignore the retrieval with us. If alpha is one, then retrieve where people are is trained to minimize their fever loss only. Blurry vision at the attic Norway, the adaptation loss. To evaluate A0 would do to test their system. For regression. We use the squared error between a Data solution that the final solution as a tactician loss and the way you use the square error between the fifth case and the query case as direct cable loss. Because we know that it's unlikely you retrieve a case that perfectly resolves the query. So there will always be some error there. We introduced a caveat though E here, as the threshold for the retrieval loss. So if your retrieval loss is less than e, then we consider the retrieval is successful at the error is 0. We believe, assisted, similar to what we have done before in our CDS work, the retrieval is using a sadness that work. Why are the adaptation is using CDH? And the reason for choosing this test, the system, is not just because it's very convenient for us. And it's also because the, it depends that they got the dependent of each other or deeper down each other, allowing comparisons between different paradigms. And they can be traded using formula three and formula for the elbows powerful enough to solve majority of the queries. So the evaluation wouldn't be paired somebody because they are not powerful enough. This is the training algorithm. I will skip this for now. The amazing thing here is for every case. Amazing here is that during training, we will prepare the pairs for training for each batch. And then after a batch of Paris are prepared, we will carry out a training for adaptation and the retrieval. And this repeats for the next batch. So it'll trigger is actually a general dilation of the traditional training. If the batch size is set, the size of the k-space, that the training only happens for one big batch. So it only happens once. That regard. Add in that regard. It is similar to traditional training where the two stage and retrieval and reuse our trade in sequence. And if alpha is equal to 1 is the same as a dependent training, because retrieval only cares about retrieve or loss now. And if alpha is equal to 0, It's the same thing as HER at that teacher guided retrieve or training, the voice of trade ways that pathogen lossy mind only. And we try to find an ideal setup between by finding that right alpha and finding that balance is difficult. This does it just involve finding the Alpha also evolves, designed or loss functions f and g, and call your parameters b0. How you choose your batch size on your training steps. And we evaluate this idea, five datasets, one of which is an artificial dataset. This dataset is designed in a way, so it has the four attributes at a wife. At one final value. It can't be solved ways to support that decision rules. And it turns out, on average, at A0, performance better than both. It depends that at HR training for mostly datasets. And we also showing those detailed errors for each dataset. And we will explain what is interesting here. So here we show a figure of three columns. The first column shows the error of retrieval, coordination through ever after adaptation. And the third column shows how adaptation teaches the error of retrieval. In every figure, the axis, the x-axis are the bins. It shows how many tabs error is a hit it a bit. The y-axis shows the count of that bit. There are three bars in each bin. The black bar represents a dependent treaty. Lighter gray bar. The second bar represents HER training adaptation, got HAT add. The dark gray part represents our training data set of that. And you can see that in retrieval. We're going to see something interesting result here by using HER training where the retrieval only occurs adaptation. Even though afterward that there's an error, it's very low. Your retrieval error can be very high because your retrieval error is not considered trading. And if we were to use a D-pad that Katie, and you get a very good result. But you also get some very bad ever after adaptation. Because adaptation at the retrieval are not synchronized. However, for a o training, we see its generality. Retrieving good results at adapting into good errors as well. So you can also see this on figure where fours ALL training. The change induced by adaptation is consistently small and negative. While for independent training, adaptation can cause an error and bigger than 0. While the HER training can cause a big change in error. We also see the same pattern, okay, under the law sets. And now we'll skip that for now. So as a result, we see it even to training that literally voice well-trained. But adaptation doesn't necessarily improve the results of retrieval. Ag arginine, the clinical stage provides for initial solution. Why the adaptation stage carries out heavy correction. While email Trinity, that retrieval generally provides a good case. Adaptation further modifies resolution to be closer to the correct solution. We think that a can't be applied to two different deny containers at their relationships. This enables our customer validation to other goals such as explainability, efficiency, and accuracy. By doing a component already is balanced to the task. As a future direction, we want to also include retain, revise the EO. So we can apply to all four CBR stages and bring out the full benefit of the different containers. And the fact that there are already candidates. In fact that there are good candidates for this. We have seen sounds instance where the four stages of CBR can't be dependent on each other. And we think moving from here, we can beat other ANS it. And that concludes my presentation. Thank you very much.
Description of the video:
This is a lot of work. You can just take a picture of the boat. Um, what do you call a Technophile? I love using mobile apps to optimize my life, drink water now. And another tells me to meditate when I'm about to getting stress. Meditate now. Meditation, complete. Meditate now, meditation, meditate. Recently my friend and I discover new app. It's so amazing. It helps you can lock the food or you have to do right now is to just take a photo. The app literally tells you everything. Here we go. And the assays, it's a banana, but it has 10, five calories, 1.3 grams protein, and it's organic. It's from Brazil, isn't an amazing way to separate every ingredients, so recognize it. I'm not sure what I'm eating, but I trust the app. It gives me everything I need. So why not? It's a 100 calories and essays propeller I hold up. This is not how computer vision based dietary assessment should be. Dietary management requires more than an estimation of nutrients, portions, and calories. We need to understand why people want to eat healthier, what they need to achieve those goals, and how we can apply AI technologies to support them. Separating ingredients before taking a photo may seem silly, but that is because we scientists created recognition models that do not consider more diverse training data. As we move forward, we need to examine a more holistic approach to support individual health goals.
Description of the video:
Hi, I'm Satoshi from Indiana University. I'm going to present a computational model of reward run from infant point of view. This is a work by Satoshi to see urgent Chandra sacral, MD Reza, David Crandall, who are from computer science and chain you from cognitive science. Amotivation of these studies to model in France rather rounding mechanism infants produces a crossfade as early as six months old. If we take the runners point of view, this task of what running is very challenging due to something called referential uncertainty. Let's look at some examples. Suppose that the child is looking at dispute. And then parents said, here's our total in the child to be he can see cars and of course turtles and some other object. Then how can a child tells us which object is our tartar? That's the referential uncertainty. This situation often happens when children are running the name of the objects. In general, learning object names in everyday context requires young children not only find unrecognized visual objects, but also map this object to the names that they hear. And of course they have to address the referential uncertainty. Previous work address the problem using the simulated images like we show here. These laboratory experiments can provide some insights. However, we believe that we should use more realistic environment where infants naturally interact with the parents. And that's the uniqueness of our study. Before we get into details, let me provide structure of this paper. Our goal is to build our model. Similarly, the RUR running the model has to recognize visual object and then associates this object with cerebrals. The model takes a low BGO under speech collected from the infant's point of view. We then systematically manipulate the input and use the model to investigate how different input data influence the different running outcomes. So here's the environment we prepared. We have our infant and its parent praying freely with 24 toys, the infants around 20 months old. Both of them have a camera mounted on the head with eye tracking, which means we can capture as abuse from their own perspective. And also we can point out where they are looking at. Let's see some example. This is a moment that parents said, should we puts at all. This is a view of the same scene from tired to view with eye gaze easier. It's the same thing, look like this from parent view. Moreover, we have object detected like this because we have eye gaze, we can also identifies attended object. To use this data for wild running, we use a model that can be trained directory from these capture the images. Our model is a CNN, which achieved very impressive performance in computer vision. Given an image, the model will output the probability distribution of predefined object. Please note that we do not argue that CNN approximate the mechanisms that kids use in their brain. We just use it as an off-the-shelf rather, without any customization to add running, the model requires image and label pairs, and here's how we process the data to get it, we identify a naming utterance, which is a segment or parent speech containing the name of object. For example, that's a hermit. Recent study showed that the moments during and after hearing or what is a crucial for runners to Associates? Object answer was because the average naming utterances 1.5 to two seconds, we use frames corresponding to 3 second, starting from the onset of the naming event. In this way, we can includes during and after the naming the event. Let's look at some examples. Here's a frames that are parents said, one data already back. The reference object already bought is held by the parent. But charity can actually see many other objects. This is our referential uncertainty. In the next example, the parents said it is a snowmen. The child is holding multiple object in his hand, so he has to decide which one is a Sloman. This is another referential uncertainty. The last example is a moment that parents said, how about the hermit? However, the helmet was a reference object, is only visible in parent view and it is occluded in the child abuse. This is the example that makes surrounding task very challenging and also ambiguous. Frames corresponding to these naming events become our training data for CNNs. However, we don't just use these images. Has our training data, RCTs. Human vision is for V8 it, which means the area around the eye gaze is more higher resolution, whereas area far from the Gates has lower resolution. Therefore, we simulate that effect using acuity filter, as we show in the right side, after the training is done, we test the performance of the model. In this way crust, we prepare a cream background images of some object, like we show here. These images are captured from multiple abuse to test the capacity of some model. And we have a total of around 3000 MS is symmetric, we use is the accuracy, which is the number of correctly classified MS is divided by the total number of test cases. For example, suppose we only have four images for tests, like we show here. If some other correctly classified three or images out of four, the accuracy will become 75%. All right, we finish explaining the methodologies. Now, let me explain the experiments we did. Specifically, we perform the three studies. The study one, to investigate if VD running tasks can be addressed from low MSE is captured from the child's own point of view. To study two is to investigate the effect of different potential strategies during the name You bet The studies 3 is to investigate the effect of visual property of the attended objects. For each experiment, we will explain motivations head up and result. The fat study has multiple motivations. The fast motivation is to demonstrate that are off-the-shelf. Cnn can learn visual object mapping directory from the captured data. The second motivation is to compare the child and parent views and then decide which view it's a bit, the Southern motivation is to investigate the quantitative impact of the training data or specifically the number of naming event. To investigate things, we train a CNN for each CPU. And then we also have samples and naming events into seven different subsets with different number of events. Here's the results of the study. One, y axis is the accuracy of the test data. And then x-axis is the number of neighboring events. And blue line represents a chariot view, and then orange line represents the parent view, is 200 or more naming events. Models trained from infant data consistently outperforms the model trained from the parent data. As quantity of training data increased, model trained from data performed better and better. Y or Z, or the trend from parent data got saturated. These results suggest that the model can solve the name object mapping froms a low Video directory recorded from the egocentric view. And also infant data contains certain properties leading to the beta while running compared to the parent data. Okay, let's move on to the study. To motivation of this study comes from the fact that it's a 3 second window we choose actually contains during and also after the hearings and naming event. Therefore, infants can look at multiple objects, or alternatively, they can continue looking at the same object. Therefore, the purpose of this study is to investigate how these two attention strategies can influence the wild running. Please note that even if, say keep looking at the same object, it might not be the object that is named. To investigate these things, we divide it's naming events into two types. One is sustained attention and the other is a distributed attention. We call the event sustained attention. If our infant look at single object for more than 60 percent of the time, otherwise, we call it distributed attention. We visualize the example for each case at the bottom, for the top one, because the child looked at the object one for more than 60 percent of the time. It's sustained attention even on the other hand, in the bottom side, the child looked at three object and then know of that Luke is more than 60 percent. So this is distributing attention. The threshold of 60% is determined to make that both sets roughly equal. As a result, we got around several 100 events for each type. We then train the scene and for each set here as a result of experiment 2. So we use a brew for sustained attention and then orange for the distributed attention. The accuracy of sustained attention. Is higher than the distributed attention. This suggests that's the sustained attention on a single object while hearing the name of our object is a better for running object names, this finding is consistent with the previous work with state. That's the being able to keep sustained attention is crucial for every infant development and healthy development outcomes. This experiment investigate its attention in the temporal scale, but the attention in the spatial region is still unclear. This bring us to the experiment three, a study to investigate the effect of temperate ancient dwelling naming moment. The effect of sensory information selected and processed in the naming moment. It's too unclear. Therefore, we study how visual properties of the attended object influence toward running. Specifically, previous work shows that object attended by infants tend to be Raj in the view providing higher resolution image to investigate the effect of them. We divided the naming events into two groups, the large object groups and small object groups. So large object means that's the, the bounding box of the attended object. It's more than 6% of the field of view. Why the small object is less than 6%. This threshold of 6% is determined as the mediant of bounding box size. So that's the, each set will contain the same number of naming event. He had some examples of Raj object and small objects. And here's the result. We show the accuracy of Raj object as blue and small object as orange. Obviously, the model trained with Raj object achieved significantly higher accuracy than the model trained with small object. However, we have our concern. The larger object in naming event may simply correlate with sustained attention in study two, making the effect of visual properties unclear. Therefore, to distinguish these two covarying factors, we divide the naming events into sustained attention and distributed attention as in studying T2. And we did that for each of the large and small sets and he has a result. Fortunately, we observe the same results for both attention type. This mean, that's the visual properties of the target object. Indeed has a dialect and unique inference on while running. All right, Let's summarize. So this paper I investigated how to infant runs from ambiguous experiences, which is referential uncertainty. The uniqueness of our paper is to use the egocentric video and the head-mounted eye tracker to capture the actual data that the infants are perceiving. We have three key findings. Fast finding it, that's the available information from infant point of view, it's sufficient for a machine model to successfully associate our object names with visual objects. Second, the moment of sustained attention during parent naming event Is a better visual input for wild running compared to the moments when infants show more distributed attention. Last three, im Moment property of infant visual input, inference, add Ali, you are running. In other words, Raj, object size in the field of view is better for the planet. Of course, we have a future work fast because we extracted naming moment. We didn't use the entire speed. So using more speech in the future work. That means we can investigate how infants run not only object names, but also other types of words, for example, burps. Lastly, we should also investigate the effect of social cues in child-parent interactions. These nonverbal communication should also influence the wild running. That's it. Thank you very much for listening.
Description of the video:
Hello, thank you for joining this talk. My name injectable Hassan. And I'm going to present all people automatically detecting bystanders in photos to reduce by the series. This research was done in collaboration with David Campbell, myopathies and poverty are many photos captured in public places, content bystanders, in addition to photo subjects. When shared on online platforms, these photos reveal the identity, location, and other sensitive information of the bystanders, potentially an unbounded number of people. Many of these photos are shared publicly, does using facial recognition systems, people can be automatically identified and checked. The children, the privacy and safety of people appearing in these photos, including the bystanders who did not participate in taking or sharing these photos, and may not even know about their existence. To protect the privacy of best genders, researchers have devised different solutions. For example, both et al. designed a QR code as a privacy deck to input a privacy policy. People who do not want to be in others photos may dictate to broadcast their privacy preference. The authors also created a protocol that can be implemented in Cloud servers so that when the tag is detected in an uploaded photo, the corresponding privacy policy can be applied, such as apostate. In phases. Kennedy et al proposed another Cloud-based approach. Higher registered users can mark locations as private. The server would notify them whenever a new images are uploaded that were taken in one of those mark locations. Another set of plant worst attempt to prevent taking photos content in bystanders or removing them before sharing. In such systems, bystanders broadcast their privacy policy using, for example, mobile applications. The corresponding app for the photographer's checks if the bystander is in any recently taken photos using identifiable information such as facial features of the best gender. If detected, either the photo is deleted or the bystander is obfuscated. Our goal is to eliminate the burden placed on the bystanders where they need to be proactive to protect their privacy and use specialized applications, share sensitive information such as facial features and location. Towards this goal, we propose a machine learning system that can distinguish between bystanders and subjects in an image. Once this is done, any privacy preserving mechanism such as removing all of us, getting the best understand be done automatically. This system can be deployed as a Cloud service so that photo taking devices and any social media platform can use it whenever photos are captured or uploaded. There are two primary benefits to this approach. First, it facilitates establishing privacy by default policy, by attempting to protect the privacy of bystanders without triggering them to use any specialized application to share sensitive data, such as facial features and location. Secondly, in addition to new photos, this system can be applied to images that already exist in devices or cloud servers. In this work, we define a bystander as a person who is not a subject of the photo and thus not important for the meaning of the photo. For example, the person was captured in a photo on the because they are in the field of view and was not intentional captured by the photographer. The context dependence of this definition signals that challenging nature of conceptualizing and distinguishing best genders from the subjects. For example, the people in this photo look very different from one another. And the overall photo context needs to be considered to classify them as subject or best gender. Sometimes there is not much difference between the visual appearance of subjects and bystanders. Further complicating the task of distinguishing them based on the visual data. So here we outline our approach to tackle this challenge. First, we seek to understand how humans classify subjects and bystanders. So he conducted a user study to collect data from people. From this data, we determined what reasoning or rationale people used, labeling a person as a subject or a bystander. Then we identified what high level intuitive features of a person in a photo are relevant to human reasoning. Next, regime of features that contain redundant information and created a set of minimally correlated features. Then we build regression models to approximate those teachers using raw image data. And finally, we train a classifier using the predicted high level feature values. I'll go into details of each step in later slides. We started with a user study where participants are asked to label people in photos such as those shown here, as subjects and bystanders. The images are taken from the Google Open in a dataset. For the labeling task, we asked, do you think the person in the green box is a subject or a bystander. In this photo, participants responded with a five-point Likert scale. Then they explained why they classify someone as a subject or a bystander. We curated a set of high-level characteristics of the person that are presumably related to the classification task. Such as whether people in the photo are aware of and comfortable being photographed. Whether they were willing to be photographed and by intentionally pausing. Whether photograph or indigenously captured them. And whether they can be replaced with other random people without changing the meaning of the photo. We asked participants to rate people in images according to these features. Here is one example question, asking about the earnest feature. How strongly do you disagree or agree with the following statement. The person inside, again injectable, was ever of being photographed. We started with photos of 5 thousand people and each person was labeled by at least two people. Out of 5 thousand photos, 920 of them are not of actual people. And their HMO, 56% of the remaining people in photos are labeled as subject and 37 percent are labeled as bystanders. The next step is to determine what rationale the study participants used to distinguish between a subject and a vice gender. For example, when we ask, why did you label the person in the green box and a subject? The most frequent responses or this photo is focused on this person and they are taking up a large amount of space. The photo is about what this person was doing. The person looks similar to, is doing the same activity as or is interacting with other subjects. In this photo. For bystander levers, the most frequent responses, or this photo is not focused on this person. They were captured by chance. The person looks similar to, is doing the same activity as all interacting with other bystanders in this photo. The next step we investigate whether there is any relation between the reasonings we just saw and the high level features that we collected data I wrote. As an example, let's pick the most frequent reason for labeling a person as a bystander. This portal is not focused on this person. How is this associated with different high-level features? This table shows experiments row as the correlation measurement. All of the correlation coefficients are statistically significant, suggesting association between not being focused in a photo and the high-level features. Repeated this calculation for the other reasons for headed by the study participants and operate similarly those. In summary, this statistically significant correlations between the reasonings and the features such that the high-level features may be good predictors to classify subjects and best genders in photos. In the next step, we study how these features are related among themselves to remove redundant features. To do that, we perform exploratory factor analysis, including all the high level features plus some other features, Jackie okay. Level from a photo such as sight of the person, distance from the photo center, and total number of people in the photo. We extracted two factors in the plot. You can see how each feature correlates with these two factors. This chief features, pose, comfort and willingness, are correlated more with vector one, vector two. And we refer the underlying concept represented by this factor as the visual appearance of a person. Similarly, this cheap teachers come from the second and length factor that indicate how prominent a person is in the image. All features under the same group contains similar information. So we just pick one from each group, namely posts from group 1 and size from group two. We keep replaceable as it is, associated equally with both factors and photographs intention as it doesn't associate with any of these factors. So we now have defined set of features and we can use them to train a classifier. But the problem is, other than size, none of them can opt-in directly from an image. So there has to be some intermediate step to get these features from the raw image. This is what we do in this step. We infer this high level features. Using raw image data. To do that, we utilize several existing deep-learning models and past I even threw them to extract intermediate features. The extracted features are then used to change dictation models that will be used to predict the high-level features. One of the existing models we use is open posts from which we extracted body joints and link Angus jointly they satisfy the orientation of the body. We use another deep-learning model to detect phases. The phases are then extracted and fit into a second model which progress probabilities of the eight emotions based on the facial expression. We also extract features from resonant 50 model, which was originally chained to recognize objects including people. So this is the full process of predicting high level features from raw image data. Separate regression model was built for each of these high-level features. Finally, we use this predicted high level feature values to chain a classifier. Combining everything, here is the full process of building the final classifier. That image is used to extract intermediate features using several existing deep learning models. This intermediate features are then passed to read additional models to predict high-level features. Mind you in size. These features are used to change the final classifier. This twisted classic, traditional procedural resulted in 85% of classification accuracy. We built several other models for comparison. For example, another model change directly with features extracted from ResNet of inputs and emotion detection models. The alert 78% accuracy. We hypothesize that the high level features content information more pertinent to the classification of subject and bystander with less noise compared to d features from which they are tiered. Does the twisted positive resulted in higher accuracy? What makes someone a bystander is still a subjective notion, and so it is hard to easily declare labels as conclude. We group the images based on how many of the human annotators agreed with the final class level. For 34% of the images, all annotators agreed on the final task level. But for another 34% of the images on the 67 percent of managers agreed on the final class level, indicating ambiguity about the levels. We chained and evaluated the model separately on these two subsets of data. The classifier has 90% of mean accuracy in the first case, with the corresponding area under the curve of 98%. For more ambiguous images are only 67 percent people agreed on the class level. Our model demonstrated 80 percent of classification accuracy and 89% of area under the curve. To evaluate the generalizability of the model, we applied eat over an entirely new dataset of 600 images taken from the MS coco dataset. On this dataset offer all the classifier had over 84% of accuracy to achieve more than 91% accuracy for photos with 100 percent agreement among the annotators. And over 78% accuracy for photos are only 67 percent of annotators that agreement. These findings demonstrate good generalization capability of our model. These are some examples of correct classifications. People bounded by green and red rectangles are predicted as subjects and bystanders respectively. These are examples of wrong classification. Again, green and red rectangles indicate predictions of subject and best gender as before, but here, the ground truths are opposite of dispositions. To conclude, we propose a machine learning based model to detect bystanders in photos so that they can be automatically Timo or APA stated to protect their privacy. But that distinction between subjects and bystanders is often very context dependent and sometimes subjective. As evident from the findings of our user study. That's detecting best tenders in images is a challenging problem. We approach this problem by identifying intuitive concepts underlying humans is earnings and building a classifier big on only a few high-level features that map to those concepts. This resulted in higher classification accuracy. Additionally, the decisions made by our model are interpretable as they are big on high-level intuitive concepts. Our proposed solution eliminates the need for best genders to be proactive to protect their privacy and exchanging sensitive information such as facial features and location. Our solution applies to all past, present, and future images. Does it has the potential to protect best under privacy at scale. With that, I would like to thank you all for your attention and express my gratitude to the quarters of this paper. Looking forward to hear your feedback on this research.
Description of the video:
Hello, in our paper, hope that a graph-based model forehand object pose estimation. We worked on 2D and 3D pose estimation of a single RGB image in real time. The hope model uses lightweight ResNet 10 as image encoder, followed by a three layered graph convolutional neural network to refine the initial is submitted to the coordinates and our adaptive graph unit to convert 2D coordinates to 3D coordinates. Are adaptive graph unit is a graph convolutional neural network with a unit a structure. To this end, we designed three new operations, adaptive graph convolution and trainable pooling and on pooling layers. With these new graph convolution operations, our model outperforms previous models in hand. An object pose estimation both in 2D and in 3D. Here is the qualitative results on first-person hand action dataset. And if you were interested in our paper, please come to our chatroom. Thank you.
Description of the video:
Hi everyone. This video is a short introduction to our paper called interaction graphs for object importance estimation in on-road driving videos. This is joint work between Ashish to y and sutures that Martin at Honda Research Institute and Z was saying. And David Crandall at Indiana University. Driving is a complex task because it involves navigating highly dynamic, complex environments in which many different autonomous agents, like other drivers and pedestrians are acting at the same time. Human drivers must make real-time decisions by combining information from multiple sources, especially what they see. Since people have foliated vision systems, they can only really focus on one spot in the scene at a time. So people must identify and attend to the most task relevant objects in their visual field at any given time. In this paper, we propose a novel model for object importance estimation in on-road driving videos. Given a video that was recorded by car mounted front-facing camera, we want to detect the important objects in the target frame. The important objects are defined as those affecting the drivers control decisions. And we assume that they're annotated by experience drivers. Learning to predict driver's attention has become a popular topic in recent years due to potential application and advanced driver assistance systems and autonomous driving. For example, in autonomous driving, the detected important objects can be used to help the system prioritize the important regions of the road seen, thus leading to better control decisions. Most existing work relies on driver's eye gaze as the ground truth and predicts a pixel level attention map, like the one shown on the left. But this is different from important object estimation. Drivers will often look at driving irrelevant objects, such as the beautiful scenery around them. And they may be a tent actively attending to multiple objects at the same time. To overcome these problems, similar to a recent paper by Gao et al, we investigate how to directly estimate each objects importance to the ego vehicle for making driving decisions without using eye gaze as an intermediate step. Our work is also related to eye gaze and object level, human attention estimation in egocentric or first-person videos. In that work, videos or recorded by head-mounted cameras and eye gaze trackers. And we can use cues like what people are touching. But in our problem, the cameras are mounted on cars, and so useful clues such as hands and head movements are not available. Making this problem in many ways more challenging. To solve the problem. Our main idea is to leverage the frequent interactions among objects on the road, which are often overlooked by other methods but extremely helpful. In the example shown here, for instance, there's a car directly in front of us, which would prevent the ego vehicle from hitting any of the other objects on the road. So when making control decisions, the important object at the moment for us is the car, not the people. Here's an overview of our model. We first apply an off-the-shelf Mask R-CNN detector to obtain object proposals and only keep those proposals were clap whose classes are related to the driving task, such as cars, buses, and pedestrians. Then the interactions are modeled using a novel interaction graph with features of each task relevant object pooled from an i3 debased feature extractor as the nodes. And the interaction scores are learned by the network itself as the edges. An identity matrix is further added to the learned edge matrix to force self attention. Through stacked graph convolution layers, object nodes interact with each other and their features are updated from those of the nose they closely interact with. The updated features are concatenated with a global context descriptor and passed through a multilayer perceptron for the final per node important score estimation. We implement our model with TensorFlow and Keras. And in order to alleviate data, data imbalanced, hard negative mining is applied during training. There's a lot of other details. Please refer to the paper for them. We evaluated our model on the dataset introduced by Gao et al. And we follow the same evaluation protocol for a fair comparison and use, including using average precision as the metric. Here's some quantitative results. The results show that our model outperforms the state of the art well with the least input and the easiest pre-processing. In other words, while the state of the art method requires additional information, such as optical flow, goal information and location information. Our model only requires an RGB clip of 16 frames as the input. In addition, our model only requires object detection on the target frame as the preprocessing step. Whereas the state of the art technique needs to do tracking as well as object detection on each frame during pre-processing. Here are some sample qualitative results. Our model can effectively suppress false positives and enhance true positives with the help of interaction graphs. When applied to that same example I showed before, our model suppresses the importance of the pedestrians, since the current front would prevent us from hitting them. While the other models detects the pedestrians as important. In the second example, our model assigns similar importance scores to both pedestrians who are crossing the road together. Their importance is enhanced by each other. Through the graph convolution. We also conducted ablation studies by, by removing certain parts of our model and training the remaining parts. Performance drops. And all three of these experiments showing that each component of the model helps towards this overall performance. For example, the interaction graphs are important for the model to leverage object interactions. The global descriptor provides a global context which is lost when extracting node features through pooling. And self-attention is crucial for each node to retain its characteristics while interacting with others in graph convolution. In conclusion, we proposed a novel framework for online object importance estimation in on-road driving videos with interaction graphs. Experiments show that interaction graphs help the model to suppress false positives as well as to enhance true positives and thus lead to better performance than the state of the art while using less input information and doing easier pre-processing. Our future work includes seeking better graph formulation methods and visualizing the graphs for better explainability. Here's the citations that we referred to throughout the talk. And thank you so much for listening and we hope that you'll check out our paper. Thank you.
Description of the video:
We'd like to tell you about a line of work entitled active viewing and infants facilitates visual object learning, a computational modeling approach. I'm David Crandall from Indiana University presenting on behalf of my PhD students, Satoshi's to Sui and Professor Chen Yu at the University of Texas at Austin. The work I'll talk about here as part of a broader collaboration of an interdisciplinary team of psychologists and computer scientists. The psychologists include Chen you, Linda Smith at Indiana University and their labs of students and postdocs. That computer scientists also includes fan Bomba UK who was a PhD student and post-doc at IU. And saying who's a current CS PhD student? My lab specifically works in computer vision and machine learning. Because I'm a computer scientist. I thought I'd begin by explaining why computer vision researchers might be interested in interdisciplinary collaboration like this one. Well, so much progress has been made and computer vision and machine learning over just the last few years. Today's computer vision algorithms can now solve problems like object recognition, even on complex consumer images like this one, which seemed impossible just a few years ago. In fact, the popular press has even covered these developments with headlines like these, inspired by the fact that modern vision algorithms can sometimes match or even exceed the performance of people. A wide range of problems from recognizing faces to actions to even emotions, to outperforming doctors in diagnosing disease. Most of this progress has been due to deep learning with artificial neural networks. The idea is that these neural network models can, given a large amount of training data, find patterns to be able to map raw incoming images to high level semantics like presence and locations of objects in the image. Deep learning and neural networks go back decades, but only within, about that last, say, 10 years have the pieces come together to make them work well in practice, including massive training datasets, usually tens of thousands to millions of images, and enormous amounts of computation power to learn these huge models with a huge number of parameters. In fact, deep neural networks have become so influential that three computer scientists who, who helped create them, Bengio Hinton, and the Kuhn, won the 2018 Turing Award for their development, which is the highest honor in computer science. And yet, despite all of this progress, there are signs that something is not quite right. The same systems that supposedly can outperform humans on some tasks fail at other tasks that seem so simple, like self-driving cars hitting obvious obstacles, supposedly smart assistance being fooled to carry out unintended actions. Ai systems that seem to exhibit bias, for example, by failing to accurately recognize faces of some genders and races. Here's just one example. This image, when given to a modern image recognition system, is classified as a church. It's actually a picture of the Indiana Memorial Union, the student union building here at IU. But churches not unreasonable based on its visual appearance. But here's some other photos. You might think this is also of a church, but you'd be wrong. According to the deep network, it's a, it's a ladybug and this is a Chihuahua and this is a Christmas stocking. These are examples of something called adversarial examples, images that my student prepared that add some particular noise that deliberately confuses the classifier. But it shows that for all their successes, deep networks are clearly not as robust to visual recognition as humans are. And here's a few other examples from a recent paper in Nature. You can add stickers to stop signs and cause deep networks to think their speed limit signs instead. Deep networks hallucinate objects and texture patterns as shown in the bottom left. In the top right, deep neural networks can identify stop signs, but not when they're rotated or skewed. Probably because it never saw images like these and it's training data. Even in natural images like those in the bottom right, the networks can be thrown off by texture and shape. Well, these examples and many others show that although we've made significant progress in the computer vision learning problem, the models we learn are often very rigid and that they fail to generalize to new contexts. And they're confused by tiny amounts of noise. They also require vast amounts of training data and vast amounts of compute power. So how can we begin to move past the shortcomings? Well, we, as computer scientists are aware that there is already a system that knows how to overcome these problems that is more powerful learner than any state of the art algorithm or supercomputer. And that is of course, the human child. And that's why we are interested in interdisciplinary collaborations with developmental psychology. Of course, I don't have to tell you about kid's amazing abilities to rapidly learn to recognize and name new visual objects from limited examples. So what can we learn as computer vision researchers from kid's amazing abilities? Of course, the mechanism of learning is probably very different. Kids are probably not using deep feed forward convolutional neural networks of the particular type that are state of the art algorithms are, of course. But besides that, another key difference between these two systems, that is between modern computational visual learning algorithms and the visual learning systems of children, is that the training data input to the system is very different. To create the model for a car, for example, computer vision researchers would collect thousands or millions of images, typically from the web, learn a deep neural network model. In contrast, the child's learning system clicks training data in everyday interactions with objects. As they actively manipulate objects, they generate visual images for their training, for their visual learning system to uses training data. Moreover, there from a foliated visual system allows them to focus on one particular part of the visual field at a time. Visualized here with the red crosshairs. The type of data collected by the infant as thus, unlike any used for modern computer vision. Our major goal in this line of work in this talk has been to try to quantify and understand the unique properties of the training data that is collected by children during these everyday object interactions. Our results suggests that children naturally collect data that leads to highly efficient learning and that inducing similar patterns and training datasets for computer vision can improve the learning outcomes for those models as well. So I'd like to tell you about the details of these studies. In order to actually show this, we use lightweight head-mounted cameras and eye gaze trackers to record a good approximation of a person's visual field as they go about everyday activities. For the experiments reported here, 26 child-parent dyads were brought into a lab that was set up to resemble a home environment. That children range from 15 to 24 months old. There were 24 different toys available on the floor and the parents were instructed simply to play with their child and the toys. Meanwhile, both the parent and child were outfitted with a head-mounted camera that recorded their field of view. As well as gaze tracker is to identify where in the field of view each one is visually attending. All the cameras were synchronized so that at each moment in time we have the view seen by the parent and their gaze position, as well as the view seen by the infant and their gaze position. Since we're interested in visual object learning, we began by annotating the position and identity of each of the 24 objects in each frame of the parent and child had camera videos. We used a semi-automatic approach to doing this by first running a state of the art computer vision algorithm trained to locate the 24 toys and then cleaning up the annotations to human annotators. So from the 20 for child parent dyads, we have two different visual data sets, one from the perspective of parents and one from children synchronized together from the same environment. There were a total of about 100 minutes of data. And we only took the frames in which people are looking at an object which yielded a total of about 500 thousand frames, each annotated with the position and identity of the 24 objects. Given these two data sets, one from parents and one from children. We then asked which of these two views would be better for training a Computer Vision Object Model. To do this, we trained two separate convolutional neural network models, one on the parent data and one on the child data. The model architecture and learning algorithms were exactly the same. So the only difference was that one model was trained with child had Kim data and the other was trained with the parent data. We then tested the ability and accuracy of these models to recognize the 24 objects in a separate dataset consisting of the same objects but taken in a controlled environment from different angles with a black background. We found in fact that the model trained with the child's data perform significantly better than the model trained with the parents data. This suggests that there may be special properties of training data collected by the child that may make it more amenable for efficient training. So then the question is, well, what is the difference in these datasets? Well, here's an example of visual instances of a particular toy object, a football helmet in the child view on the left and instances in the parent view on the right. In other words, each of these montages consists of cropped helmets that appear in the field of view of the children or adults. Comparing these two views, it's clear that they're different toddlers views of the object, we're generally larger in size, perhaps because the child's child had children have shorter arms and are closer to the ground. We can quantify the size difference. The plots here show histograms of object size is a fraction of the size of the field of view for children and parents. The mean size for objects across the child view is about 13 percent of the visual field, where it was only about 5% for the parents. Overall, the children saw large instances of objects significantly more often than parents, which may make the learning problem easier for both kids and algorithms when trained on this data. However, in addition to size, it's also clear that toddler an adult views have important differences in the distribution of training data. It seems that children see a greater diversity of views than the parents do. Here's one way we tried to visualize this difference. These figures show all cropped instances of one particular object, the blue car in the parent data on the left and the toddler data on the right. The images here are arranged by projecting the image data itself with each image represented using its pixel values as a mathematical vector and high-dimensional space into a 2D space using something called multidimensional scaling, or MDS. Ds tries to find a 2D arrangement of the images such that the 2D distances between images is proportional to how different is their visual appearance. We can see from the figure on the right that the parents visual data was concentrated on sort of a core set of canonical views of the object. The toddler view on the left also had a concentration in this core, but it also had many images outside indicating a greater diversity of views of the object. Quantitative measures of diversity also bear out this point. Here we show histograms of the degree difference between pairs of instances averaged over all objects. Both children and parents see a lot of instances that are visually similar. But the toddlers also see many more instances that are visually diverse. So it appears that children may collect some combination of these two. In other words, toddlers spend hours every day playing with toys, actively manipulating them, and creating training data by self-selecting object views. This creates more diverse views for children than for parents. So from the point of a learning system, there's a trade off in the distribution of training data that you collect. Having many similar high-quality example, say an ideal lighting conditions and from canonical views may be important for the system to build core prototypes of what it is to be an object. But this may also limit the system's ability to generalize and recognize new instances in the future. In contrast, having a highly diverse dataset, dataset, my help generalization, but having too many outliers may make it difficult for the model to learn important core patterns. It appears that toddlers may naturally collect a mixture of these two in a relative proportion that strikes a good trade off and makes their visual learning problem easier. To further explore this hypothesis, we took the child data and split it into two subsets. One consisting of highly similar examples and one consisting of highly diverse examples. We then created new training data sets that collect consisted of various proportions of these diverse and similar subsets. We then train separate models with each of these datasets and measure the accuracy on the separate test dataset. This plot shows the results. The four groups of bars show what happens as we increase the number of training examples per toy object. Basically more training examples is always better. But looking at say, the 200 training examples case, we see that having only diverse views are only similar views creates models that perform relatively poorly. However, a combination of about 50 percent of each yields a model that works best. In fact, that works as well, or even slightly better than the actual data that was collected by children, which is the blue bar on the left. So these results are consistent with our hypothesis that children naturally collect distribution of similar and diverse training examples that is perhaps ideally efficient for learning robust object models. Finally, we tested whether this insight from our analysis of kids camera images could generalize beyond just our dataset of 24 toys and actually help us improve computer vision models on the typical datasets and problems that we use in the computer vision community. To do this, we took a training images from a subset of the MS coco dataset, which is widely used in computer vision, consisting of about 50 thousand images of 12 object categories such as airplanes, bicycles, buses, and cars. A sample of training images on the left. As our test dataset, we used images from the shape net dataset as shown on the right. Using the insight from the child studies, we divided the training data into similar and diverse subsets and then constructed datasets with different combinations of these two. Here are results showing similar plots to the ones we presented for the child data. Again, having more training images is always better. Focusing on the 200 training images per class case again, the leftmost bar shows what happens when we simply use the coco data set off the shelf, which is what computer vision researchers would typically always do. However, we find that by dividing the data into similar and diverse subsets and then taking a combination of the two, we can actually achieve significantly better results. And this combination is better than using either the diverse or similar subsets alone. Although these results are preliminary, they suggest that the training data collected by kids as they naturally interact with objects, has a special distribution that is highly efficient for learning. And computer vision and machine learning researchers may be able to improve their algorithms by using this insight from kids. I should note here that we've done many variants of these experiments I presented and the details are in our papers. For example, we've simulated the effects of visual of foliated vision by blurring image frames outside of the gaze Point. Foliated vision introduces another sort of trade off into a learning system. On one hand, having a how a limited high resolution view may help focus the learning systems attention on a target object. But it also may prevent the system from building a coherent model of an object is only part of it can be seen at any given time. We've also trained models on a per child basis. And the results suggests that some children collected data that is more effective for training our deep learning models than others. Analyzing how data varies across children and across objects may yield additional insights into the important properties of training datasets. With that, I'd like to thank you very much for listening to our talk. You can find more information on our websites. And we'd like to thank the sponsors of this work.
Description of the video:
The title of our talk is combining deep learning and case-based reasoning for robust, accurate, and explainable systems. I'm David Crandall and Associate Professor in Computer Science here at the School of Informatics, Computing and Engineering at Indiana University. I work in computer vision and machine learning. And this is a collaboration with David leak, a professor in computer science also here at IU, who's an expert in case-based reasoning. You'll hear more from him in a few minutes. So much progress has been made in AI over just the last few years. For example, in computer vision, algorithms can now solve problems and complex consumer images like this one that seemed impossible just a few years ago. You've probably seen headlines in the popular press like these, inspired by the fact that modern vision algorithms can sometimes match or even exceed the performance of people on a range of problems from recognizing faces or actions or even emotions, to outperforming doctors in diagnosing disease. Most of this progress has been due to the development of deep learning with neural networks. The origins of deep learning and neural networks go back many decades, but it was just within the last five or ten years that the pieces came together to make them work well in practice. These pieces are massive training datasets, unprecedented amounts of computation power, and significant algorithmic an engineering advances in the models themselves. Deep neural networks have become so influential that these three researchers, Bengio Hinton, and the Kuhn, won the Turing Award in 2018 for their development. And yet despite all of this progress, there are signs that something is not quite right. The same systems that can outperform humans on some tasks fail catastrophically, and other tasks that seem much simpler, self-driving cars have hit obvious obstacles. Supposedly smart assistants are easily fooled to carry out unintended actions. While AI systems seem to exhibit bias, for example, by failing to accurately recognizing faces of some genders and races. Here's just one example. This image, when given to a modern image recognition system, is classified as a church. It's actually a picture of the Indiana Memorial Union here on campus. It's the main Student Union here at IU. But church is not an unreasonable estimate based on its appearance. But here's some other photos. You might think that there are also of a church, but you'd be wrong. According to the modern deep network. This is a lady bug. This is a Chihuahua, and this is a Christmas stocking. Of course, these are adversarial examples. These are images that one of my students prepared by adding some particular noise patterns that deliberately confused the classifier. However, this example shows that despite all their successes, deep networks are clearly not cueing on the same visual features that humans are. Here's a few other examples from a recent paper in Nature. You can add stickers to stop signs like in the top left and cause deep networks to think their speed limit signs instead. Deep network see objects and texture patterns as shown on the bottom left. In the top right, deep networks can identify stop signs when they're upright, but not when they're rotated or skewed, presumably because the network never saw these transformations in the training data. Even natural images like those in the bottom right, the network can be thrown off by texture and shape similarities to other objects. So deep learning has many strengths, but also many weaknesses, including that they require large training data sets. They can sometimes fail to generalize to new contexts. They can also learn unintended bias and training datasets. No networks do not naturally offer explainability and they're notoriously difficult to debug. Neural networks require vast resources for training and it's difficult to retrain them when a few new examples come, come buy. No networks cannot easily incorporate human knowledge because they're designed to learn patterns automatically from large-scale training data. Finally, deep learning seems to be susceptible to these adversarial attacks like the ones that I showed in a recent talk. Yoshua, Yoshua Bengio, who was one of the pioneers of deep neural networks, characterize the situation like this. He said that there are two broad categories of problems that require intelligence system. One problems are those that a human can do quickly and typically without even thinking about it, like recognizing objects or faces, recognizing spoken words and so on. System two problems, on the other hand, typically require some conscious thought on our part. And humans tend to be slower at these. This might be things like recognizing a new object we've never seen before or solving a word problem or driving into a new neighborhood and having to make sense of, of unfamiliar streets. He said that current deep learning seems to work well for system one problems. But then we need new techniques for system two problems. In his talk, he advocated for developing new network architectures and training paradigms that can overcome these problems. Here at IU, we've been exploring trying to overcome these problems by integrating ideas from deep learning and from a more traditional AI technique with complimentary strengths and weaknesses called case-based reasoning. Now I'm going to turn things over to David leak to describe case-based reasoning and how we've been exploring using it. So S, we're looking for methods to achieve better performance on the type two tasks, the ones for which humans consciously reason. It's reasonable to ask, how do humans do this reasoning? Can we get hints of techniques to use based on observing what people do? Much? Expert reasoning is based on specific prior experiences. Many illustration of this that I like is from a book on the use of anecdotes and Medicine. Talking about a conference where distinguished professor is asked. Not for results from the latest journals, but instead for an anecdote. Has anyone had an experience with this? And I think that's the sort of question that probably we've very commonly encountered. Case-based reasoning is a process of reasoning that is based on the insight that memory plays a very large part in human problem-solving. The case-based reasoning process is one that relies on memory to provide prior examples. Then analogical mapping in order to relate the prior examples to the new situation. And what's called case adaptation in order to fit the old solution to the specifics of the new circumstance. And so basically the flexibility and potentially creativity of the case-based reasoning process comes from case adaptation. There's some important practical benefits to the case-based reasoning approach. One is enabling reasoning from limited data on case-based applications that sometimes been successful with just a handful of cases, basically, as long as they cover the space and as long as there's sufficient knowledge to be able to adapt their solutions to fit the problems that they're likely to encounter. Another practical benefit is that when there is domain knowledge available, that knowledge can be encoded within the cases, can be put into the similarity. Knowledge can be used for the adaptation knowledge. And so there are a lot of places that human knowledge can be integrated into this process when it is available. And finally, providing explainable solutions. Because when a case-based reasoning system presents the solution, it can also present the case on which that solution US-based. And that's something that people often find compelling if they can actually examine prior episode and decide whether they think that that episode arises, sorry, applies in a new situation or not. So the case-based reasoning process is cyclical. It starts with the input of a new case. That case or problem is used to retrieve an old case in memory that is compared with the case that has to be solved. Basically to form a mapping between the T12, that's called the reuse phase of the process. That salt case can then be applied potentially can be revised if the solution still has problems. And then the revised case is retained in the system's memory so that every problem-solving episode becomes an opportunity for learning. But the cases pertain to as a specific example. So basically, case-based reasoning is a lazy learning method. Generalization is only done when required for dealing with a particular new problem. And it means that actually the training of the system is extremely economical, extremely low cost, because basically all it requires is placing the case in the case memory. No other retraining is required. There are some challenges though for applying case-based reasoning. I'm three that had been had classic over. I'm quite some time studying case-based reasoning. One is selecting features to use to organize cases that memory for retrieval. Another is generating the similarity criteria of similarity rules that are used to assess whether a prior restored case applies to a new situation. And a third is generating the rules to adapt old cases to new situations. Traditionally, these are done requiring human knowledge acquisition. And so it can be very expensive process. This suggests that there is a strong opportunity to combine deep learning and case-based reasoning. Basically, if deep learning, for example, can help alleviate the knowledge acquisition burden in getting the knowledge to support the reuse of cases. But the process can still use cases, which adds the advantage of explainability and easy integration with the human knowledge. So we're trying to form a synergistic integration. We want to combine both approaches to reinforce the strengths and alleviate the weaknesses of each of the two approaches. Were doing that pursuing two avenues. One, which I'm calling a component-based approach, is basically to apply one of the methods as a support method for a peace of the other. The other one, I'm calling pairings or twin systems, is basically to have the two systems work simultaneously in parallel and then have the process of one provide additional information that can be used to help, for example, explaining, interpreting, assessing, the processing of the other. Concretely in the component approach, we've begun to work on two different approaches. One is the use of deep learning for similarity assessment to apply and case-based reasoning. And the other is deep learning to improve the process of case adaptation. For learning to improve retrieval, for learning to improve the similarity process. Basically, I'm, what we're using is the fact that Siamese networks can learn similarity. And so by using them to learn similarity that didn't use to guide case retrieval similarity assessment. Basically, we can refine the retrieval process and apply it to new situations with much lower knowledge acquisition cost. We've been looking at this in a sample classification task. I'm a classification approach called on class to class classification, which basically, instead of just looking at similarities to one particular target class, also is based on learning difference patterns between the different classes in order to find evidence from the classes that are unlikely to apply, as well as the classes that are likely to apply. And an advantage of that approach is that this can explain classifications not only by referring to the similarity between the new instance and some pre or prior class, but also by referring to relevant differences. For the task of deep learning, for learning adaptation rules. What we're basing this on as an approach that's been referred to in the case-based reasoning literature as the case difference heuristic approach. The inspiration for that approach was basically that case-based reasoning systems tend to have large amounts of knowledge, precedent in their cases. I mean, it's much harder though to get case adaptation roles. And so the insight which Mark Keene had some time ago was to generate adaptation rules from Paris of prior cases. Basically comparing the two cases, looking at the differences and the problems they address, looking at the differences in the solutions. And then forming a rule that basically ascribes the difference in the solution to the difference in the problems and can be applied when there are similar problems in the future. So to give a very concrete example of that, we can consider that there are two cases for, we're predicting the rental price of apartments to apartments have different sizes. They also have different prices. And so saying that these two apartments differ in their descriptions by the fact that one has one more rule than the other. And their prices differ by a certain amount based on that information when can generate a rule saying that adding one room to the size of the apartment is going to affect the price in the following way. Of course, a big difficulty of this is what relationships should be used. There are many ways that one could try to generalize from comparing those two apartments. And so the approach that we've been using is actually I'm trying to train a network from a number of examples of case pairs in order to determine a relationship and basically calculate what had been calculated by the adaptation rules automatically. So here, deep learning is being used basically as the component that generates adaptation knowledge. Once that knowledge is generated, then it can be used in the case-based reasoning way to extend the reach of the cases to novel situations. And so we've started actually doing some tests on this approach. We found that not surprisingly, network sold own actually can provide very good results for lower-dimensional spaces with sufficient data. And so I'm obviously there are going to be times when simply using the network approach makes the most sense. But as we get into higher dimensional spaces than domains with more limited data, the network learning becomes more difficult. And in our tests, we actually have simulated increasing difficulty of problems by removing cases from the k-space, resulting in a sparser k-space. And basically making it harder for the system to solve. I'm simply from the existing cases. As the dimensions of the space get higher, as data's more limited than combining case-based reasoning with network, learned adaptations can provide important benefit that increases as problems become more novel. So this is an example of such a domain. The domain with this approach is most beneficial. It compares the error with approaches such a nearest neighbor approaches three nearest neighbor, one nearest neighbor. At the bottom we see case-based reasoning with rules generated by normal case different heuristic approach. I'm also the error which is the network approach. And then the new approach we're pursuing, which is case-based reasoning with a network-based case difference heuristic. Basically, as the number of cases removed increases, shows an increasing benefit over the alternative approaches. And so we think that in contexts of the sort that this approach actually is very promising. We've also been looking some at the idea of twin systems. These are systems that pair case-based reasoning and network systems. Both are trained on the same data and applied simultaneously and then the results can be compared or a case presented as an explanation for the neural network result. The neural network will provide a data-driven flexibility. Well, case-based reasoning enables the integration of knowledge and explainability. And we've actually done some preliminary work of this for estimating confidence. And where it seems like it's useful to have this combination. And a new direction that we're, I'm also looking at is basically how we can use a combination of networks and case-based reasoning to merge expert and network features with the goal of getting better retrieval, better explainable solutions. I'm a future opportunity that we'd like to pursue is also applying a case-based reasoning methodology to deep learning systems. Um, for example, we could try apply exploring, I'm applying X adaptation to deep learning solutions, or using case-based reasoning to support things like the design of network-based systems. For things like come architect your selection. So the takeaways from this talk are, first of all, that deep learning with neural networks. It's a very powerful tool for many problems, but it also has some inherent weaknesses. The path forward may involve combining deep learning with other techniques that can explicitly reason with past experience. And that case-based reasoning and deep learning could be one such pairing offering complimentary strengths and weaknesses. We're looking for interesting case studies. If you have an interesting application where deep learning failed, we'd love to hear from you. Thank you very much.
Description of the video:
The human brain is an organism that changes itself. Every moment it is changing itself. That’s how we become smarter. That’s how we learn new skills. That’s how we learn to drive cars. It changes itself. Many people have sort of assumed that computer science and the study of human intelligence, or learning, are totally separate fields. But actually, these fields have been asking the same question about what is the nature of intelligence. And learning is at the heart of what makes intelligence. So, in this project we are trying to explore kind of the connections and the possible collaborations between machine learning and human learning. In the last couple of years, machine learning has gotten very good so that it can almost be able to, in some cases, learn to do things nearly as well as humans can, for instance. We don’t really understand how it works or how to make it better. On the other hand, we have human learning, or sort of our gold standard models for what it means to be a good learner, and we don’t really understand how they work as well. And what our hope is that by studying these two jointly, not in two separate communities but one separate community with one unified set of goals, we might really be able to have both of these areas impact and accelerate progress in the other. We think, our whole group thinks, that we can really learn about learning by looking at the development of it. So, we put these pre-school children into the fMRI scanner to measure changes in their brain before and after learning happens. We collect those measurements and then we can compare the before and after scans to see how the brain changes as children learn in different ways. Probably one of the biggest applications will of course inform us about how people learn, and so I think practically speaking, hopefully that can inform education. This is such a special opportunity for Indiana University in particular because we have a strong computer science program. We have people here that are very, very good in machine learning and we have an extremely strong psychology and cognitive science program with some of the very top scientists in the world here. And that unique combination is also in a place that really, really encourages interdisciplinary collaboration in a way that I haven’t necessarily seen in many other universities. Building this team that is working toward this common goal is something that I find really exciting and that is what I am looking forward to the most. If we learn about visual learning, we can help build satellites and devices that image from cells up to the universe. We can do a lot for helping people post-stroke, for helping school learning, for helping all forms of human activity. The consequences are huge. And we’d know something deep and true about how learning works.
IU Matters: Artificial Intelligence
WTIU PBS 2017
Description of the video:
i'm david crandall i teach in the informatics and computer science programs here and i represent half roughly of the work and i and computer vision here at iu the other half is michael roo who's going to talk a bit about his work in a few minutes so computer vision is the goal in computer vision is really easy to state
and really really difficult to do so the What is Computer Vision idea is given an image like this one as soon as you or i look at this we immediately start sort of understanding the scene without even thinking about it we recognize the building the fountain we see the people we probably you probably um recognize where this photo was taken maybe when it was taken you make inferences about what the weather was like this day what these people are doing and so on this is our goal in computer vision it's to be able to understand images and imagery at a semantic level the way that people are able to do this this is a really hard problem and so we
attack it from various angles and using various strategies and i thought today i just show you kind of a whirlwind tour of some of the projects that we're
working on and if you have any specific questions or interested in any of them in particular you can stop by or check out one of our papers for more information
Applications so on one hand we work on very basic building blocks that are useful potentially across a range of applications like detecting people and images and figuring out what they're doing or detecting particular species of particular classes of objects including even really hard recognition problems like particular species of birds and differentiating between them a lot of our work involves helping people organize huge collections of images like the millions and billions of images billions of images that are on social media these days so among other things we have a project a long long-running project on trying to figure out where on earth a photo was taken automatically using just the video visual content of the image itself
we've also done a lot of work in 3d reconstruction taking images from a particular landmark in producing a 3d model of that landmark which is useful both in its own right but also so that we can organize new images as they come into a system Collaborations we're also really excited about interdisciplinary applications and
collaborations um so to give a couple of examples with uh professor jeffrey fox and others we're working on understanding automatically imagery taken from the from polar science from the polar ice sheets to monitor how climate change is affecting is affecting the ice automatically with professor chen yu over in psychology we have a long collaboration on using vision techniques to automatically code and model uh date behavioral data collected from kids and parents wearing cameras uh actually on their head so that we can monitor what they're looking at and uh and what they're reaching for and so on uh a lot of our recent work is has been
Challenges Opportunities trying to understand um trying to understand the challenges and opportunities of a world where cameras are everywhere in fact i think i count at least six cameras right in this room right now that are clearly visible including this little one that i'm wearing right here which takes a photo
every 30 seconds throughout my day and it's kind of cool i was really skeptical at first but it's kind of cool at the end of my day i get this like visual representation of everything that i've done but there's two problems with this the first is that i get like thousands of photos and most of them are terrible and it's really hard to find the good ones and the second is that i get some images that are really really terrible like the ones that i uh get when i walk into the bathroom and forget to take this off right so we're interested in understanding what sort of the privacy implications of these wearable cameras and ubiquitous cameras are and can we develop technological solutions to help people protect their privacy in this world in this new new age of photography Themes and these seem like really diverse projects underneath the hood we have some common themes some of the themes are probabilistic graphical models machine learning especially deep learning which has become very popular in vision in the last few years we use large scale data sets large scale computation and like i mentioned we're really excited about interdisciplinary applications and collaborations and so with that i'll just thank all of the people who've collaborated with us and my students and sponsors and mention our websites in case you'd like to find out more and of course you're more than welcome to stop by if you're interested in chatting about this or or anything else thank you.
Description of the video:
hence appear often the first person video and give important to use about what people are doing however most work in egocentric and detection make strong assumptions like that no other people are in view or that the environment is carefully controlled we present a new data set containing 48 videos of dynamic interactions between two people wearing Google glass in order to build strong data driven models that can detect as well as distinguish hands like telling apart the observers hands from any others in view the data set includes more than 15,000 labeled hands instances with pixel level ground truth we use a CNN model with a lightweight region proposal method to robustly detect and distinguish hands we show how to generate accurate pixel wise hand segmentations from these high quality detections and finally we investigate whether segmented hands alone can accurately distinguish between first-person activities.

Computer vision meets high-performance computing
SPIDAL Tech Talk

Privacy Behaviors of Lifeloggers using Wearable Cameras
Ubicomp 2014 Talk
paper
Description of the video:
wearable cameras are opening a new paradigm of photography called life logging automatically capturing thousands of photos of wednesday from a first person perspective a sudden rise in such image gathering has novel privacy implications for both individuals and society our challenge is to understand these privacy implications from both the sociological and technical perspective in a society with ubiquitous cameras unlimited memory and powerful data mining tools the context of all kinds of social interaction is changing new technologies often affect cultural expectations about privacy not to mention individual perceptions of what should and should not be private we seek to understand not only how life logging technologies and perceptions of privacy but also how expectations of privacy can inform technology design and development so we're investigating computer vision techniques that can automatically find private content and images so that users can block recording or sharing based on policies that they define well we've already created a tool that people can train to identify and respect private spaces like bathrooms or bedrooms and private objects like computer screens but this is just a starting point in our goal of laying the foundation for understanding and protecting privacy in a camera-rich world
Description of the video:
Overview of our CVPR 2011 and PAMI 2013 papers on large-scale 3d reconstruction.
Description of the video:

Beyond Co-occurence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities
(Need to move video off of https://vision.soic.indiana.edu/tagclusters/cooccur2012wsdm-video.mp4)
WSDM 2012 Talk
paper | project
Description of the video:
Overview of our CVPR 2011 and PAMI 2013 papers on large-scale 3d reconstruction.
NAML 2022 Poster
Deep Learning to Enhance Similarity Assessment for Case-Based Reasoning
CVPR 2016 Tutorial Slides
Training Diverse Ensembles of Deep Networks
More Info: Project
AGU 2016 Poster:
3D Imaging and Automated Ice Bottom Tracking of (Canadian) Arctic Aripelago Ice Sounding Data
NIPS 2016 Poster:
Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles
More Info: Paper
ICCV 2015 Poster:
Lending a Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions
More Info: Paper
ICMI 2015 Poster:
Viewpoint Integration for Hand-Based and Recognizing Activities in Complex Egocentric Interactions
More Info: Paper
ICCP 2015 Talk Slides:
Linking Past to Present: Discovering Style in Two Centuries of Architecture
More Info: Paper | Project
CVPR Big Vision 2015 Poster:
Linking Past to Present: Discovering Style in Two Centuries of Architecture
More Info: Paper | Project
WACV 2015 Poster:
Predicting Geo-informative Attributes in Large-scale Image Collections using Convolutional Neural Networks
More Info: Paper | Project
CVPR 2014 Poster
Multimodal Learning in Loosely-organized Web Images
More Info: Paper
WACV 2014 Poster:
Vehicle Recognition with Constrained Multiple Instance SVMs
More Info: Paper
CVPR EgoVision 2014 Poster:
This Hand is My Hand: A Probabilistic Approach to Hand Disambiguation in Egocentric Video
More Info: Paper
ICWSM 2013 Poster:
De-anonymizing Users Across Heterogeneous Social Computing Platforms
More Info: Paper
WSDM 2012 Talk Slides:
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities
More Info: Paper | Project
WSDM 2012 Poster:
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities
More Info: Paper | Project
BMCV 2012 Poster:
A Multi-layer Composite Model for Human Pose Estimation
More Info: Paper | Project
ICPR 2012 Talk Slides:
Layer-finding in Radar Echograms Using Probabilistic Graphical Models
More Info: Paper | Project
CVPR 2012 Poster:
Discovering Localized Attributes for Fine-grained Recognition
More Info: Paper | Project
WWW 2012 Talk Slides:
Mining Photo-sharing Websites to Study Ecological Phenomena
More Info: Paper | Project
CVPR 2011 Talk Slides:
Discrete-Continuous Optimization for Large-scale Structure from Motion
More Info: Paper | Project
ICCV 2009 Poster:
Landmark Classification in Large-scale Image Collections
More Info: Paper