This blog is gone elsewhere!

To facilitate the sharing of contents, I’ve decided to move my personal work blog to Tumblr. Thus, The Rice Cooker has now become The Electric Rice Cooker.

Additionally, I’ve started working on the Data Journalism Lab, a project at JMSC currently in its pilot phase as of Fall 2012. And yes, we also have a blog on Tumblr: http://datajournalismlab.tumblr.com.


Public services shutdown

Unfortunately, we noticed that we’re not optimized to support external requests to our search tool and other tools mentioned in the previous posts. If you need data, please contact Dr. King-wa Fu directly: kwfu@hku.hk


Weiboball

A year and half ago, I did a project called Twitterball for one of my classes at PolyU on information architecture. It played on the idea of visualising the frequencies of tweets in time.

I quickly remashed Twitterball into Weiboball, using data from our archive and the search engine built with WeiboScope Search. The result is Weiboball and Weibobubble.


Sina Weibo: Zombie accounts, slain entries, and now, ghostly posts

While inspecting a potential hiccup to our Sina Weibo deleted posts monitoring system (see the ASL / see article explaining the method), we discovered the occurrence of a post that was seemingly deleted, but which still occasionally appeared on the API.

It was made as a repost made on March 2nd by David Bandurski of CMP, and was a repost of a Hu Shuli post commenting on the Wang Lijun incident on the same day. The post was archived, but can no longer be found on David’s timeline. Ms. Hu’s post was still alive on Weibo when I checked (and we have it archived, in case).

The strange thing was that this post was not marked as “permission denied” and was fully available when queried with the “statuses/show” function on the Weibo API (link — requires login). On the other hand, the user_timeline function, which lists the latest 200 posts made by a given user, reports that the post no longer existed.

Even more bizarre, still, when you searched on Weibo.com for David’s posts between the date range containing March 2nd, you’d see that the search page claims of 5 results, whereas only 4 are actually returned! This is shown in this following image:

For our monitoring tool, specifically the user_timeline function that casts the net for deleted posts, we have been using the first version (V1) of the API (because of rate limiting issues). V1 seems to me like the abandoned entrance for a data store that is more and more complex in its layers of visibility (and invisibility). These layers don’t seem supported with 100% fidelity on V1, and will return copies of user timeline that sometimes contained the “ghostly” post, and sometimes just won’t (probably depending on which physical server you end up accessing).

Not only there are “permission denied” or “weibo does not exist” posts, as described in our previous post about the methods behind the tool, but there are now posts wavering between a state of publication or non-publication (at least from the weibo.com website’s point of view, they were dead).

Anecdotally, I noticed that “permission denied” posts usually meant that their dependent posts (by people who repost) would also be marked as deleted with “permission denied” as well. In this case of the Hu Shuli post, the original post was alive and well, while the repost was deleted.


China News Archive

Ever looked for an automated archive of Chinese news websites? For months, we’ve been collecting screenshots and HTML snapshots of up to 20 websites based in China or covering China. We now have a webpage for it.

http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots/

The screenshots are classified by news source and with a minimalistic (if not just minimal) interface, organised by day and regrouped by month. For instance, you could go to the QQ News archive, an archive for February 2011, and a particular link to today.

There’s also a version for accessing navigeable HTML pages when available.


Spam spam spamalot

I’ll start filtering weibos that contain links with photo.weibo.com and event.weibo.com. It seems like most containing such links go to spam-like galleries. Bad, bad, mega-bad. The amount of spam is just ridiculous today.


How do you catch and archive deleted posts on Sina Weibo?

It’s the holy grail of any media researcher working on China: how do you quantify contents removal from social media services such as Sina Weibo? Here at JMSC, we’ve been developing tools to scour and assess social media of all sorts for the purpose of researching online media in Hong Kong and mainland China. (This is the same project that generated WeiboScope, for those tuned in.)

Screenshot at 2012-02-08 11:56:04

With the extensive archive of weibos we’ve accumulated so far and mechanisms underlying its retrieval, we were able to develop a routine that finds and marks deleted posts (method explained here).

The result is an archive of deleted posts (and the CMP’s Anti-Social List). Not only is it possible to find a large number (which is not exhaustive, we admit) of such posts within the day of their doom time, but also be able to presume with clear-cut evidence whether these deleted posts were removed by the user itself or simply deleted by system managers (we first noticed the difference in August 2011…).

As described in a post last week, the idea behind this archive is simple and straightforward to implement, once you’ve got the infrastructure.

deleted_weibo_previous
A previous copy of the user timeline, containing all posts

deleted_weibo_current
A current copy of the user timeline, with a missing post

Both copies of a user timeline (post IDs extracted from the full JSON response) are obtained during two consecutive API calls, which may span a few minutes or several hours. The smaller this interval between pollings, the more precise would the routine be in finding the exact removal time (and the chances of missing something, smaller too).

A post is found to be deleted when you could see it in the previous version, but not the current one. Since we keep a copy of every post we see, we simply mark this post, and can then view them all in a custom webpage. Easy enough, right?

Screenshot at 2012-02-08 12:14:35

Screenshot at 2012-02-08 12:14:26

The previous two images show you exactly what, from a programmer’s point of view, Sina Weibo returns us for two different kinds of deleted posts. The former, with an API response “weibo does not exist”, identifies a post that was presumably deleted by the user. The latter, which returns “permission denied”, is presumably a post deleted by the system.

We don’t know the intention between both types of messages, but we can guess based on what their contents are generally. The first are generally made of spam-like posts that would be deleted on any online social network in the world. The second seem to provide more legitimate contents, including some made by so-called VIP users verified by Sina (we only check a 2500-odd sample of users, so can’t really infer on their representation).

This feature is indeed powerful, because it finally puts a number on post removal on Sina Weibo. (We computer science majors strongly dislike conclusions not based on numbers and data.)

It is however currently impossible to tell with certainty what gets deleted (versus what’s not), since our user sample is strongly biased towards public commentators, and perhaps because the number of posts found is still extremely small.

What it does give is an understanding of how post removal works, how much time it usually takes for something to be removed, and whether reposts of posts or just reposts but not original posts, get deleted (in fact, it happens). It’s a privileged peek, indeed, at what’s going on on the Chinese Internets, right here, right now.

Until we accumulate a substantial archive to do anything useful, our colleagues at China Media Project have started compiling and explaining deleted posts on their Anti-Social List.


Sina Weibo deleted posts archive

Since Sina Weibo has a pretty good API, and since we do download lots of data every day, it just makes good sense to keep an archive of deleted posts.

The strategy is very straightforward and only incurs a negligible extra number of hits against the API:
- Take the statuses/user_timeline function for each user in your list (we have 2,500 in a sub-list).
- Extract the IDs of all 200 posts in the response and save as a text file, one ID per line. They are already ordered chronologically.
- You should have a previous list of IDs. Use diff to compare both files.
- Loop through the output of the diff. Mark all the IDs that appear in the previous version, but not the new one.
- Those IDs are the deleted posts.
- Mark them, and send your alert, etc. (We also hit the API again on statuses/show to double-check if the post was really deleted.)
- Overwrite the old ID list with the new one.
- Repeat whenever you can fetch a new version of the timeline (you might be rate-limited by Sina if you do it too often).


Searching our Sina and QQ Weibo archive

Screenshot at 2012-01-19 17:15:45

We had a search engine built a while ago for Sina Weibo archive, and since yesterday, also for the QQ Weibo archive. We use Lucene as the indexer (to do quick full-text searches) and then store all linked information in our standard database. The difference with the real search engines provided on the Sina and QQ Weibo websites is that we don’t currently implement any weighing, and the results are just everything we got, ordered by publication date.

We index at every four hours, so there’s at least a 30 minutes delay, and at most around 4 hrs 30 minutes. There’s paging, too. Because we’re not Google, be sure to understand that queries normally take up to 1 minute to run (more if there’s lots of activity on the server). The search by region / province on the Sina search is also uber-slow.

Cool feature: you can link directly to searches! For instance, if you were interested in racing celebrity Han Han (韩寒) who has been under fire recently, you may use a link such as these:
http://research.jmsc.hku.hk/social/search.py/qqweibo/?q=韩寒
http://research.jmsc.hku.hk/social/search.py/sinaweibo/?q=韩寒

Other cool feature: Google Translate! Write your search query in your language, and behind the scenes, we’ll try to send a query to the Google Translate API. You’ll know whether it worked when you get your results.


Ma Ying-jeou in pictures on WeiboScope

When you check out WeiboScope today, what you will notice are Yao Ming with sleeping delegates at the Shanghai CPPCC and Ma Ying-jeou, Taiwan’s president re-elected for a second 4-year term on Saturday (there’s also this weird meme of Zhou Qifeng, Peking University’s president, grinning uncontrollably alongside Li Keqiang, China’s Vice-Premier).

mayingjeou

But what really caught my eye were all the photos sitting at the top of our Sina Weibo data stack, which probably ranges in the thousands when you look at the bottom of the stack. The image wall here above was generated using the image search portion of WeiboScope. There’s one particularly making rounds of Ma in the US with his future wife, Zhou Meiqing.