Saturday, December 11, 2010

Using CrowdFlower for Sentiment Analysis

I recently used CrowdFlower for sentiment analysis of tweets (for CS 424P). I just want to share a few tips that I learned on the way:

  1. Use gold units. This is probably the best feature of CrowdFlower. You can specify gold units (using the gold digger). The gold units are then randomly inserted into jobs. By default, a user needs to answer at least 4 gold units in order to become trusted. This a great way to ensure the quality of the work.
  2. Target the appropriate country. For my project, I was analyzing sentiment towards Obama. Initially the jobs weren't targeted, so people outside the US were assessing the sentiment. I noticed that many people outside the US would provide bad labels, probably because the tweets were so US-centric that it was hard to understand the context for outsiders. After I targeted the US, the quality of the labels went up significantly.
  3. Use the gold report. In the gold report, workers sometime provide feedback on the gold units. Also, you can find items with high disagreement in the gold report. I had made a couple mistakes in the gold unit, and I was able to find them through the gold report.
  4. Expect issues with the API. I kept hitting bugs with the API, which was the most frustrating part of CrowdFlower. For example, sometimes jobs would complete but their status would remain "Running" indefinitely. CrowdFlower confirmed that this was a bug. If you're going to use the API, you should start early so that bugs don't delay your project.
  5. Try your job as a worker on mturk.com. It's very educational to try your own job on mturk.com. You'll understand how the workers see you project, and how they get paid.
  6. Read the general Amazon Mechanical Turk tips. This is the only document I could find on best practices for crowdsourcing. I only read this after I did my project. I would have saved time if I had read their advice before my project. Instead, I ended up learning their tips by trial-and-error.
Overall, the experience with CrowdFlower was okay. I wish documentation was a little bit better and that their API was more reliable, but it got the job done. Let me know if you have any questions or need any help.