iOPENER

Our project iOPENER has been featured on NSF Discoveries!

Read more here!

 

Advertisements

What is Collective Discourse?

My research ideas have recently been mainly about how we can characterize and exploit collective discourse.
So what is collective discourse?

With the growth of Web 2.0, millions of individuals involve in collective discourse. They participate in  online discussions, share their opinions, and generate  content about the same artifacts, objects, and news  events in Web portals like amazon.com,  epinions.com, imdb.com and so forth. This massive  amount of text is mainly written on the Web by non-expert individuals with different perspectives, and yet  exhibits accurate knowledge as a whole.

In social media, collective discourse is often a  collective reaction to an event. A collective reaction  to a well-defined subject emerges in response to an event (a movie release, a breaking story, a newly  published paper) in the form of independent writings (movie reviews, news headlines, citation sentences)  by many individuals.

A common characteristic of collective discourse, just like many other collective behaviors,  is the diversity among individuals engaging in it. This diversity is emerges in form of diverse perspectives that different people have about the discussed topic.  

The diversity of perspectives in non-expert contributions  in collective discourse can be exploited to discover various aspects about a subject that are otherwise  hard to unveil.

(Read More … )

Some Useful NLP Datasets (to be completed)

Reproducibility and Scientific Findings

What I liked the most about ACL submissions this year was the opportunity to upload datasets and source codes. Although it seems far-fetched to me that reviewers would try to reproduce the results reported in each paper, it at least, to some extend, encourages  reposting transparent, reproducible results.

Thanks to Chris Brockett, who shared an interesting relevant article a few days ago:
“John P. A. Ioannidis. Why Most Published Research Findings Are False. PLoS Medicine”. (a summary of this paper is written by the mathematician John Allen Paulos: http://abcnews.go.com/print?id=12510202)

What I liked about this paper was the 6 corollaries on the validity of scientific findings:

Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are tobe true.
Corollary 4: The greater the flexibility in designs, definitions,outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research fi ndings are to be true.
Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.

Corollary 6, which is my favorite, implies that the competitive nature in research and the urge to publish have had negative impacts on the quality of published research.

Finally, I am excited about ACL’s action on requesting datasets, and think we will start to see stronger measures in (hopefully) near future.


Paradigm Shift

It sounds natural that the focus of a research community changes over time. This focus, which presumably shows people’s main interest, is easy to observe by reading the papers that appear within a community every year.

Well, ACL is no exception, with Statistical Machine Translation being considered popular these days. But was that always the case? To answer this question, I collected all the papers that appeared in ACL in each year, and found which article these papers have cited the most. This would, to some degree, show what the ACL community have written about in each year.

Interestingly, the top cited topics are Dialogue and Discourse till mid 90s, but the focus has shifted to Parsing, and then Machine Translation in the past decade.

1979:     A Goal Oriented Model Of Human Dialogue
1980:     Ungrammaticality And Extra-Grammaticality In Natural Language Understanding Systems 
1981:     A Snapshot Of KDS A Knowledge Delivery System 
1982:     Linguistic Analysis Of Natural Language Communication With Computers
1983:     A Practical Comparison Of Parsing Strategies 
1984:     Relaxation Techniques For Parsing Grammatically Ill-Formed Input In Natural Language Understanding Systems 
1985:     Parsing As Deduction
1986:     Providing A Unified Account Of Definite Noun Phrases In Discourse
1987:     Categorical Unification Grammars
1988:     Attention, Intentions, And The Structure Of Discourse
1989:     Attention, Intentions, And The Structure Of Discourse
1990:     A Logical Semantics For Feature Structures
1991:     Attention, Intentions, And The Structure Of Discourse
1992:     Attention, Intentions, And The Structure Of Discourse 
1993:     Attention, Intentions, And The Structure Of Discourse 
1994:     Attention, Intentions, And The Structure Of Discourse 
1995:     Attention, Intentions, And The Structure Of Discourse
1996:     A Stochastic Parts Program And Noun Phrase Parser For Unrestricted Text
1997:     The Mathematics Of Statistical Machine Translation: Parameter Estimation
1998:     Building A Large Annotated Corpus Of English: The Penn Treebank
1999:     Building A Large Annotated Corpus Of English: The Penn Treebank
2000:     Building A Large Annotated Corpus Of English: The Penn Treebank 
2001:     The Mathematics Of Statistical Machine Translation: Parameter Estimation
2002:     A Maximum-Entropy-Inspired Parser 
2003:     The Mathematics Of Statistical Machine Translation: Parameter Estimation
2004:     The Mathematics Of Statistical Machine Translation: Parameter Estimation
2005:     A Maximum-Entropy-Inspired Parser 
2006:     The Mathematics Of Statistical Machine Translation: Parameter Estimation
2007:     Minimum Error Rate Training In Statistical Machine Translation
2008:     Minimum Error Rate Training In Statistical Machine Translation
2009:     Minimum Error Rate Training In Statistical Machine Translation
2010:     Moses: Open Source Toolkit for Statistical Machine Translation
  

(Raw data from http://clair.si.umich.edu/clair/anthology/)

Pot calling the kettle black

I knew that lexical choice results in the diversity of ways that people talk about the same thing in one language.

However, it was interesting to find out how different cultures use the same idiom, but in different wordings. Being familiar with a few of these cultures, I can tell where for instance, blind in Azeri, camel in Arabic, and pot in English, Turkish, and Persian come from.

  • English: “Pot calling the kettle black.”
  • Arabic: “The camel cannot see the crookedness of its own neck.”
  • Azeri: “If a blind man doesn’t point out the other blind man that his blind, he’ll die.”
  • Basque: “The blackbird to the crow: Black tail”
  • Burmese: “The Son is one month older than the father.”
  • Hindi: “The thief scolding the magistrate in reverse.”
  • Indonesian: “The thief shouting robber.”
  • Chinese: “50 steps laugh at [those who retreated] 100 steps.”
  • Dutch: “The pot reproaches the kettle for looking black.”
  • French: “The hospital mocks the charity.”
  • German: “One donkey calls the other one long ears.”
  • Greek: “The donkey said to the rooster “Your head is too big.”
  • Hungarian: “The owl says the sparrow has a large head.”
  • Irish: “That is the pot calling the kettle black.”
  • Italian: “The ox calling the donkey horned.”
  • Persian: “The pot tells the other pot your face is black.”
  • Portuguese: “The pig talking about the bacon.”
  • Romanian: “Potsherd laughs at the cracked pot.”
  • Spanish: “The donkey talking about ears.”
  • Turkish: “One pot saying to another pot, your bottom is black.”
  • Urdu: “The thief scolding the magistrate in reverse.”
  • Vietnamese: “dog ridicules cat for being hairy.”

The complete list with translations at
http://en.wikipedia.org/wiki/Pot_calling_the_kettle_black

 

%d bloggers like this: