Graduate Research Assistant @ MIT Media Lab

Affective Network

VQA Visual Question Answering

work in progres ...

Masters thesis @ PUC

Harvard - Chile

2016 - 2017. I got my master degree at the Pontifical Catholic University of Chile (PUC). I worked under Karim Pichara (PUC) and Pavlos Protopapas (Harvard University) supervision. They both strongly supported me through all my work. During my master, I spent four months working at Harvard University. This allowed me to learn from others and focus my research to conclude it successfully.


Best Computer Science Thesis Award

2017 - Pontifical Catholic University of Chile


A Full Probabilistic Model for Yes/No Type Crowdsourcing in Multi-Class Classification

Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and hard to obtain. Most crowdsourcing models in literature assume labelers can provide answers for full questions. In classification contexts, full questions mean that a labeler is asked to discern among all the possible classes. Unfortunately, that discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all the classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries just required a "yes'' or "no'' response. Our model estimates a joint posterior distribution of matrices related to the labelers confusions and the posterior probability of the class of every object. We develop an approximate inference approach using Monte Carlo Sampling and Black Box Variational Inference, where we provide the derivation of the necessary gradients. We build two realistic crowdsourcing scenarios to test our model. The first scenario queries for irregular astronomical time-series. The second scenario relies on animal's image classification. Results show that we can achieve comparable results with full query crowdsourcing. Furthermore, we prove that empirical bias plays an important role to model the labelers failures. Finally, we provide the community with two real datasets obtained from our crowdsourcing experiments. All our code will be publicly available.

Thesis board

Before my thesis defense, I submitted a paper titled "A full probabilistic model for yes/no type crowdsourcing in multi-class classification" at the journal Data Mining and Knowledge Discovery. I am still waiting for the first review.


This picture is on my defense day. From left to right, Karim Pichara, Pavlos Protopapas, Belén Saldías, Denis Parra, and Alejandro Jara.

Crowdsourcing scenarios

Classifying stars - Catalina Surveys

We crowdsourced labels for an astronomical dataset. This contest allows the proposed model to classify not uniformed sampled time-series. Several astronomers and engineers participated in this crowdsourcing task.

Classifying Animals

We tested the model in two different scenarios. We will release the full databases once the paper gets accepted. The animal contest is also available.

Overview

New Query Type

Our shorter queries just require a "yes'' or "no'' response instead of discerning among all the possible classes.

Probabilistic Graphical Model

The proposed model can be applied to any crowdsourcing context, reducing the amount of effort spent by the labelers.

The model works in two stages: Credibility estimation and Labels' posterior estimation.

NUTS - MCMC Sampling

We performed experiments for synthetic data, classifiers as annotators and crowdsourced data from actual people. Our model converged under all those scenarios.

Black Box Variational Inference

Due to the convergence time of the MCMC implementation, we decided to compare with a customed implementation of BBVI (Ranganath et al, 2014). This approach tries to find a simple probability distribution that is closest (in KL divergence) to the true posterior distribution.

Annotators' error estimation

MSE between each original Credibility Matrix row and its recovered estimation.

The fast convergence does not come with a sacrifice in accuracy, because

our model can quickly detect most of the classes where the labelers have low confidence

Future work

As a future work, the model should include the possibility to increase the labelers expertise with time, something that makes much sense in real scenarios. Furthermore, objects shown in previous iterations may modify the way that the labelers perceive our questions. We may also include an Active Learning algorithm.

Main thesis' contributions

Crowdsourcing query type

➢ Annotator do not need to discern among all the possible classes.

➢ Recovering labels from partial information.

➢ No context dependent.

Implementation & Training importance

➢ Supervised stage for prior parameters estimation.

➢ We compare two different implementations.

➢ We present the equations needed to solve the model. Extensible to any similar model.

New data released

➢ We create two full Yes/No database for further experiments.

➢ All our code and data will be publicly available as soon as the paper gets published.

Github public projects @bcsaldias

Courses' syllabus

I have been admin of several courses. I am the owner of some of their repositories that you can find in my GitHub repository.

Courses' projects

Most of my work is publicly available, in my repo is the code developed during my engineer career. I have developed projects using: C, Java, C# .Net,  Javascript, Python, Django, RoR.

Other Projects

Kaggle Contest - SFO Crime Detection

I am Kaggle competitions' fan. I have tried a few solutions, shared my ideas and learned from others.


My first submission was in 2015 for San Francisco Crime Classification when I was just starting at machine learning. I was in the top 5% of the leaderboard. I compared soft SVM to Logistic Regression one-vs-all where the last one achieved lower log-loss.

Who is deciding for us?

Analysis of 2500 votes of the Chilean Senate chamber. I analyzed who influenced who through national agreements. I applied data mining techniques to get the dependent behavior among the senators, and visualization techniques to represent the founded insights.


This project may be solved using correlation techniques, probabilistic models, association rules or decision trees.

Demand forecasting

With a group of 5 classmates, we evaluated the project of extending a public drugstore service in Peñalolén (commune of Santiago, Chile). 


The main challenge was to forecast the expected demand for the following 20 years. We build a database with information of almost every inhabitant in this commune. This database contained patients and illnesses information, competitors prices, economical distribution of people, among others. Finally, we recommended the optimum location to launch a new drugstore and maximize the commune wellness.

Text analysis and clustering

Finding patterns in a database os complaints about a Chilean company. All employees were asked to respond a survey which I was in charge to report the manager.


My conclusions and analysis helped the human resources manager to understand his workers better. Results showed that employees increased loyalty and creative thinking once the company gave them more space to show their skills.


The main models used here were LDA and preprocessing techniques such as Tokenization, Stop Word Removal, and Stemming.

Tweets' sentiment analysis and evolution in time

Developed smart tool to obtain the trendy words of a tweeter user.  Each color represents the average sentiment of the sentences containing the word. In the left image, we can see how some words change their size according to the tendency of use.

How do the customers try to find their products?

Search engine queries' analysis. Using Tf-idf trained in Spanish Wikipedia we could understand how a user query is composed. This visualization uses t-SNE to interpret the data. Knowing the customer's behavior allows the optimization of the products' indexing.