2016 - 2017. I got my master degree at the Pontifical Catholic University of Chile (PUC). I worked under Karim Pichara (PUC) and Pavlos Protopapas (Harvard University) supervision. They both strongly supported me through all my work. During my master, I spent four months working at Harvard University. This allowed me to learn from others and focus my research to conclude it successfully.
Best Computer Science Thesis Award
2017 - Pontifical Catholic University of Chile
Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and hard to obtain. Most crowdsourcing models in literature assume labelers can provide answers for full questions. In classification contexts, full questions mean that a labeler is asked to discern among all the possible classes. Unfortunately, that discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all the classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries just required a "yes'' or "no'' response. Our model estimates a joint posterior distribution of matrices related to the labelers confusions and the posterior probability of the class of every object. We develop an approximate inference approach using Monte Carlo Sampling and Black Box Variational Inference, where we provide the derivation of the necessary gradients. We build two realistic crowdsourcing scenarios to test our model. The first scenario queries for irregular astronomical time-series. The second scenario relies on animal's image classification. Results show that we can achieve comparable results with full query crowdsourcing. Furthermore, we prove that empirical bias plays an important role to model the labelers failures. Finally, we provide the community with two real datasets obtained from our crowdsourcing experiments. All our code will be publicly available.
Before my thesis defense, I submitted a paper titled "A full probabilistic model for yes/no type crowdsourcing in multi-class classification" at the journal Data Mining and Knowledge Discovery. I am still waiting for the first review.
We crowdsourced labels for an astronomical dataset. This contest allows the proposed model to classify not uniformed sampled time-series. Several astronomers and engineers participated in this crowdsourcing task.
We tested the model in two different scenarios. We will release the full databases once the paper gets accepted. The animal contest is also available.
Our shorter queries just require a "yes'' or "no'' response instead of discerning among all the possible classes.
The proposed model can be applied to any crowdsourcing context, reducing the amount of effort spent by the labelers.
The model works in two stages: Credibility estimation and Labels' posterior estimation.
We performed experiments for synthetic data, classifiers as annotators and crowdsourced data from actual people. Our model converged under all those scenarios.
Due to the convergence time of the MCMC implementation, we decided to compare with a customed implementation of BBVI (Ranganath et al, 2014). This approach tries to find a simple probability distribution that is closest (in KL divergence) to the true posterior distribution.
MSE between each original Credibility Matrix row and its recovered estimation.
The fast convergence does not come with a sacrifice in accuracy, because
our model can quickly detect most of the classes where the labelers have low confidence
As a future work, the model should include the possibility to increase the labelers expertise with time, something that makes much sense in real scenarios. Furthermore, objects shown in previous iterations may modify the way that the labelers perceive our questions. We may also include an Active Learning algorithm.
➢ Annotator do not need to discern among all the possible classes.
➢ Recovering labels from partial information.
➢ No context dependent.
➢ Supervised stage for prior parameters estimation.
➢ We compare two different implementations.
➢ We present the equations needed to solve the model. Extensible to any similar model.
➢ We create two full Yes/No database for further experiments.
➢ All our code and data will be publicly available as soon as the paper gets published.
I have been admin of several courses. I am the owner of some of their repositories that you can find in my GitHub repository.
I am Kaggle competitions' fan. I have tried a few solutions, shared my ideas and learned from others.
My first submission was in 2015 for San Francisco Crime Classification when I was just starting at machine learning. I was in the top 5% of the leaderboard. I compared soft SVM to Logistic Regression one-vs-all where the last one achieved lower log-loss.
Analysis of 2500 votes of the Chilean Senate chamber. I analyzed who influenced who through national agreements. I applied data mining techniques to get the dependent behavior among the senators, and visualization techniques to represent the founded insights.
This project may be solved using correlation techniques, probabilistic models, association rules or decision trees.
With a group of 5 classmates, we evaluated the project of extending a public drugstore service in Peñalolén (commune of Santiago, Chile).
The main challenge was to forecast the expected demand for the following 20 years. We build a database with information of almost every inhabitant in this commune. This database contained patients and illnesses information, competitors prices, economical distribution of people, among others. Finally, we recommended the optimum location to launch a new drugstore and maximize the commune wellness.
Finding patterns in a database os complaints about a Chilean company. All employees were asked to respond a survey which I was in charge to report the manager.
My conclusions and analysis helped the human resources manager to understand his workers better. Results showed that employees increased loyalty and creative thinking once the company gave them more space to show their skills.
The main models used here were LDA and preprocessing techniques such as Tokenization, Stop Word Removal, and Stemming.
Developed smart tool to obtain the trendy words of a tweeter user. Each color represents the average sentiment of the sentences containing the word. In the left image, we can see how some words change their size according to the tendency of use.
Search engine queries' analysis. Using Tf-idf trained in Spanish Wikipedia we could understand how a user query is composed. This visualization uses t-SNE to interpret the data. Knowing the customer's behavior allows the optimization of the products' indexing.