How do you catch and archive deleted posts on Sina Weibo?

It’s the holy grail of any media researcher working on China: how do you quantify contents removal from social media services such as Sina Weibo? Here at JMSC, we’ve been developing tools to scour and assess social media of all sorts for the purpose of researching online media in Hong Kong and mainland China. (This is the same project that generated WeiboScope, for those tuned in.)

Screenshot at 2012-02-08 11:56:04

With the extensive archive of weibos we’ve accumulated so far and mechanisms underlying its retrieval, we were able to develop a routine that finds and marks deleted posts (method explained here).

The result is an archive of deleted posts (and the CMP’s Anti-Social List). Not only is it possible to find a large number (which is not exhaustive, we admit) of such posts within the day of their doom time, but also be able to presume with clear-cut evidence whether these deleted posts were removed by the user itself or simply deleted by system managers (we first noticed the difference in August 2011…).

As described in a post last week, the idea behind this archive is simple and straightforward to implement, once you’ve got the infrastructure.

deleted_weibo_previous
A previous copy of the user timeline, containing all posts

deleted_weibo_current
A current copy of the user timeline, with a missing post

Both copies of a user timeline (post IDs extracted from the full JSON response) are obtained during two consecutive API calls, which may span a few minutes or several hours. The smaller this interval between pollings, the more precise would the routine be in finding the exact removal time (and the chances of missing something, smaller too).

A post is found to be deleted when you could see it in the previous version, but not the current one. Since we keep a copy of every post we see, we simply mark this post, and can then view them all in a custom webpage. Easy enough, right?

Screenshot at 2012-02-08 12:14:35

Screenshot at 2012-02-08 12:14:26

The previous two images show you exactly what, from a programmer’s point of view, Sina Weibo returns us for two different kinds of deleted posts. The former, with an API response “weibo does not exist”, identifies a post that was presumably deleted by the user. The latter, which returns “permission denied”, is presumably a post deleted by the system.

We don’t know the intention between both types of messages, but we can guess based on what their contents are generally. The first are generally made of spam-like posts that would be deleted on any online social network in the world. The second seem to provide more legitimate contents, including some made by so-called VIP users verified by Sina (we only check a 2500-odd sample of users, so can’t really infer on their representation).

This feature is indeed powerful, because it finally puts a number on post removal on Sina Weibo. (We computer science majors strongly dislike conclusions not based on numbers and data.)

It is however currently impossible to tell with certainty what gets deleted (versus what’s not), since our user sample is strongly biased towards public commentators, and perhaps because the number of posts found is still extremely small.

What it does give is an understanding of how post removal works, how much time it usually takes for something to be removed, and whether reposts of posts or just reposts but not original posts, get deleted (in fact, it happens). It’s a privileged peek, indeed, at what’s going on on the Chinese Internets, right here, right now.

Until we accumulate a substantial archive to do anything useful, our colleagues at China Media Project have started compiling and explaining deleted posts on their Anti-Social List.


Sina Weibo deleted posts archive

Since Sina Weibo has a pretty good API, and since we do download lots of data every day, it just makes good sense to keep an archive of deleted posts.

The strategy is very straightforward and only incurs a negligible extra number of hits against the API:
- Take the statuses/user_timeline function for each user in your list (we have 2,500 in a sub-list).
- Extract the IDs of all 200 posts in the response and save as a text file, one ID per line. They are already ordered chronologically.
- You should have a previous list of IDs. Use diff to compare both files.
- Loop through the output of the diff. Mark all the IDs that appear in the previous version, but not the new one.
- Those IDs are the deleted posts.
- Mark them, and send your alert, etc. (We also hit the API again on statuses/show to double-check if the post was really deleted.)
- Overwrite the old ID list with the new one.
- Repeat whenever you can fetch a new version of the timeline (you might be rate-limited by Sina if you do it too often).


Searching our Sina and QQ Weibo archive

Screenshot at 2012-01-19 17:15:45

We had a search engine built a while ago for Sina Weibo archive, and since yesterday, also for the QQ Weibo archive. We use Lucene as the indexer (to do quick full-text searches) and then store all linked information in our standard database. The difference with the real search engines provided on the Sina and QQ Weibo websites is that we don’t currently implement any weighing, and the results are just everything we got, ordered by publication date.

We index at every four hours, so there’s at least a 30 minutes delay, and at most around 4 hrs 30 minutes. There’s paging, too. Because we’re not Google, be sure to understand that queries normally take up to 1 minute to run (more if there’s lots of activity on the server). The search by region / province on the Sina search is also uber-slow.

Cool feature: you can link directly to searches! For instance, if you were interested in racing celebrity Han Han (韩寒) who has been under fire recently, you may use a link such as these:
http://research.jmsc.hku.hk/social/search.py/qqweibo/?q=韩寒
http://research.jmsc.hku.hk/social/search.py/sinaweibo/?q=韩寒

Other cool feature: Google Translate! Write your search query in your language, and behind the scenes, we’ll try to send a query to the Google Translate API. You’ll know whether it worked when you get your results.


Ma Ying-jeou in pictures on WeiboScope

When you check out WeiboScope today, what you will notice are Yao Ming with sleeping delegates at the Shanghai CPPCC and Ma Ying-jeou, Taiwan’s president re-elected for a second 4-year term on Saturday (there’s also this weird meme of Zhou Qifeng, Peking University’s president, grinning uncontrollably alongside Li Keqiang, China’s Vice-Premier).

mayingjeou

But what really caught my eye were all the photos sitting at the top of our Sina Weibo data stack, which probably ranges in the thousands when you look at the bottom of the stack. The image wall here above was generated using the image search portion of WeiboScope. There’s one particularly making rounds of Ma in the US with his future wife, Zhou Meiqing.


Le bogue de l’an 2012

In case people noticed, the quality of our WeiboScope declined quite a bit towards the end of last week. It was just caused because of the passage to the first ISO week of 2012 (which started on Monday). Consequently, only most popular posts made in 2011 counted. We didn’t lose anything, and things are back in track.

Featured today on WeiboScope:
- The rumoured coup in North Korea
- 36 years since the death of Zhou Enlai
- Some corrupt officials at the D & G in Hong Kong?


WeiboScope: image search by keyword

weibosearch

You might have heard of WeiboScope for its display of most important images by a sample of users we selected. “So what?”, some users have asked. WeiboScope is a suite of visualisation tools for an archive of Sina Weibo posts that we collect and store on a local database, which may currently range in the 2-3 million per week.

But the power of WeiboScope is not this particular visualisation (because there are many of them), but rather the data underneath that sustains it. Rather than let Sina Weibo dictate the way the data produced by users should be displayed, we borrow a bit from the open data movement and repackage posts in ways that may be a bit more useful to users. This is how a WeiboScope search by image came to be.

Consider these current use case scenarios:

1- A non-Chinese reader would like to know what the Chinese Weibosphere is now thinking about the death of Kim Jong-il. They can decide to type the Chinese name of Kim Jong-un in the search bar on Weibo.com and find a list of about 25 weibos. But because they are unable to read, they rely only on images. They feel lost, and give up on Weibo (for the day).

2- A person who has a native level of Chinese is doing research on suicide. Some cases are reported to be made viral on the Internet, sometimes because of the fake attention-seeking nature of them, or sometimes because their causes provoke deep societal debates. The researcher searches on the search bar on Weibo.com, finding sometimes irony, and some irrelevant news. It is hard for him to assess the importance of such case with respect to others within a certain period of time.

Now, consider that we had a sample of all Weibos ever produced and that our search engine is neutral as to what gets shown and what does not.

Scenario 1: Using the image search on WeiboScope, you can now find that one of the most popular images used in posts was this one. But then, by visual elimination, you may also notice some more odd pictures such as this one speculating on the younger Kim’s Christmas activities.

Scenario 2: Using the image search on WeiboScope, the researcher searches the word “suicide”. In March 2011, we tried this with an early version of this tool. Just by curiousity, we heard of this schoolchildren suicide case in Fujian through the popular image aggregation. At this point, we only saw one post that made it to viral level. We were curious of the impact of this case on the Chinese Internet, so we searched the characters for “suicide” on the search engine. The result? About 80% of the recent posts with the characters for “suicide” were related to the Fujian case.

The WeiboScope image search demonstrates that when you are allowed to mash and mix, and remix data, it may lead to some discoveries and realizations that may not have been made possible otherwise.

http://research.jmsc.hku.hk/social/obs.py/sinaweibo/#search

(For non-Chinese writers, the engine supports some automated Google Translate translation! For people searching in Chinese characters, please use quotes around your characters.)


Android Ice Cream Sandwich on a Pandaboard

I got myself a Pandaboard a few weeks ago. It’s a community-supported fan-less board sponsored by Texas Instruments. It sports their OMAP 4430 chip, which is the same CPU as found in the Kindle Fire, Motorola Droid RAZR and a bunch of other smartphones (the Galaxy Nexus has its successor, the 4460).

So, when I first got it, I installed Ubuntu Linux (Linaro) on it. Used it very little, and sort of figured out the basics, such as installing an Apache webserver, but finally noticing that it lacked support for things I wanted to try out, such as Google Video Chat (which is not yet available for an ARM architecture, the one commonly found in most smartphones today).

So, I instead followed instructions on a YouTube video from the Pandaboard website that said you could install Android 4.0. And turns out you could, by following the instructions (you can find clearer instructions on the Web). So, now I have Android 4.0 Ice Cream Sandwich on the board… Next step is to figure out how to get (or just wait for) the Google Apps (Gmail, Gtalk, etc.), and support for basic hardware such as video and audio capture.

In terms of media and journalism, there is perhaps some potential to create new ways to interact with information, by plugging a projector and some sensors to detect human input. In computing power, the Pandaboard is probably as powerful as a top of the line smartphone, yet at a much lower price of US$178 (but then, so flashy touchscreen). The form factor is interesting for embedded systems, which is something I only discovered this year.


Google+ API crawler in Python and a few remarks to start with

We’ve started working on tools to crawl the newly released Google+ API. I got an e-mail notifying us of the availability of the API on September 16th. I think we’re the first ones to write third-party tools to download and cache some of the data.

I’ll post the database schema later when they’re more stable.

For now, the API is read-only, and we’re limited to a 1000 requests/day limit. Since it is a first release, I was keen on collecting, in case the terms would change.

The API is interestingly minimalistic: People, Activities and Comments are the three data types you can search, list and get. There are many other types of data, but they are attached to the aforementioned. For instance, a “People” can have several organisations, urls, placesLived and emails, although I don’t think the latter is available with the current version of the API.

As People are concerned, you may also get a hasApp (for the mobile app, we guess), languagesSpoken (an array of string) and even an intriguing currentLocation (Latitude/Maps integration, someone?). It’s interesting, but it’s also scary, from a user’s point of view, how much publicly accessible information there is.


The trouble with popular users…

Screenshot at 2011-11-17 09:47:06

At some point in our research project, it was a good idea to take all the users with more than a certain arbitrary large number of followers (say, 1000) and download their posts and analyze them. This doesn’t always seem to be the case anymore. Results are variable depending on the days.

We are set to release WeiboSphere, but will wait a little before pushing it. Right now, we’re taking every user with 1000 or more followers and get all their recent posts from the API. We aggregate and produce an unfiltered (at least not with a human filter) classification of the most popular posts by 24 hours, 48 hours, week, two weeks and one month.

Alas, in the last two days, all we’re seeing are female body parts, shoes and celebrities who returned to an incredibly thin size after a pregnancy.

The hope for now, until we improve the filters, is that we can see posts such as this one on an abducted girl in Guangzhou, posted yesterday morning.


Spawn more overlords?

Lucene -> Daemon

One of the biggest challenges in the project has been I/O. Throughout the networks that we check, we deal with large amounts of data that we need to write and read at every moment.

Lucene is a quick way to search through text, including that in Chinese language. We used to rely on the database to do this, but it turned out quickly to be terribly inefficient. To do a search, you had to visit every row (within the parameters given) and search for whether a term appeared.

We asked our HKU colleagues in the computer science department for help, namely Reza Sherkat, a former IBM employee, now a post-doctoral fellow with Nikos Mamoulis. He had previously given us advice on inverted indexes, which in a nutshell uses tokens of text (from the weibos, say) as keys in a gigantic array. The values in each element are what were the indexes in the table or type of objects that we are indexing (for weibos, it would be the weibo ID).

So, when you search a word, you effectively only go through a list of unique words/tokens, which returns a bunch of weibo IDs.

The second trick Reza told us about was the use of programs running in the background, or what are commonly called “daemons”. Like daemons, they are always there, waiting for a program to call it. A use we could make (or should make) would for instance be to keep a list of user IDs in memory. If you want to know whether a weibo was made by a user, no need to go to the database to check. You can do all of that in memory.

There are probably some more clever uses, such as for counting or going through large numbers of items.

It is known for instance that for Google and Facebook to achieve their levels of efficiency that all the data that passes through in fact just stays in memory. And the problem with memory is that it requires an electric current to stay alive. A power outage (which we think should never happen) and the data dies.

Operating in memory (in RAM, that is) is much much faster than having to fetch from a disk. It should make a difference, and we shall try it on our 48Gb of RAM.