<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Rice Cooker 電飯煲</title>
	<atom:link href="http://jmsc.hku.hk/blogs/ricecooker/feed/" rel="self" type="application/rss+xml" />
	<link>http://jmsc.hku.hk/blogs/ricecooker</link>
	<description>The JMSC&#039;s Computational Journalism Blog</description>
	<lastBuildDate>Thu, 03 May 2012 03:00:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Prototype: Completed Buildings in Hong Kong (2005-2011)</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/05/03/prototype-completed-buildings-in-hong-kong-2005-2011/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/05/03/prototype-completed-buildings-in-hong-kong-2005-2011/#comments</comments>
		<pubDate>Thu, 03 May 2012 01:06:29 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Data Visualisation]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=610</guid>
		<description><![CDATA[Screenshot of completed buildings map in Hong Kong (2005-2011) The following is a map of completed buildings in Hong Kong from 2005 to 2011, according to data from Buildings Department as processed by us (errors may occur): http://opengov.jmsc.hku.hk/datamap/#completed It is still being worked on as we speak, so bugs might also be found. Using a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://opengov.jmsc.hku.hk/datamap/#completed"><img src="http://farm8.staticflickr.com/7181/6991240400_cc5bde66d5.jpg" width="500" height="335" alt="Screenshot at 2012-05-03 08:52:01"></a><br />
<a href="http://www.flickr.com/photos/smurfmatic/6991240400/" title="Screenshot at 2012-05-03 08:52:01 by Cedric Sam, on Flickr">Screenshot of completed buildings map in Hong Kong (2005-2011)</a></p>
<p>The following is a map of completed buildings in Hong Kong from 2005 to 2011, according to data from Buildings Department as processed by us (errors may occur): <a href="http://opengov.jmsc.hku.hk/datamap/#completed">http://opengov.jmsc.hku.hk/datamap/#completed</a> It is still being worked on as we speak, so bugs might also be found.</p>
<p>Using a <a href="http://www-zeuthen.desy.de/~friebel/unix/lesspipe.html">text processing tool</a> on Linux, we extracted the text of <a href="http://opengov.jmsc.hku.hk/buildings/digests/5.6.aggr.txt">every section 5.6</a> of Buildings Department&#8217;s PDF monthly digests (<a href="http://opengov.jmsc.hku.hk/buildings/digests/Md201112e.pdf">here&#8217;s one</a> of the 80-something published between 2005 and 2011).</p>
<p>The data was cleaned with <a href="http://code.google.com/p/google-refine/">Google Refine</a> to the best of our capacities, and mapped with <a href="http://www.google.com/fusiontables/">Google Fusion Tables</a> according to the town planning units. We couldn&#8217;t do with addresses were often messy, and those in the New Territories often referred to their <a href="http://www.landreg.gov.hk/pdf/AllCRT.pdf">lot number</a> only.</p>
<p>There are still lots of semi-open data (because not originally in a raw text format) provided by the Hong Kong government that could use a bit of repackaging job. We&#8217;ll get back to you soon on this.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/05/03/prototype-completed-buildings-in-hong-kong-2005-2011/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/05/03/prototype-completed-buildings-in-hong-kong-2005-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Public services shutdown</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/04/23/public-services-shutdown/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/04/23/public-services-shutdown/#comments</comments>
		<pubDate>Mon, 23 Apr 2012 04:48:39 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=606</guid>
		<description><![CDATA[Unfortunately, we noticed that we&#8217;re not optimized to support external requests to our search tool and other tools mentioned in the previous posts. If you need data, please contact Dr. King-wa Fu directly: kwfu@hku.hk]]></description>
			<content:encoded><![CDATA[<p>Unfortunately, we noticed that we&#8217;re not optimized to support external requests to our search tool and other tools mentioned in the previous posts. If you need data, please contact Dr. King-wa Fu directly: <a href="mailto:kwfu@hku.hk">kwfu@hku.hk</a></p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/04/23/public-services-shutdown/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/04/23/public-services-shutdown/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cleaning HK Gov Data with Google Refine and displaying it with Google Fusion Tables</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/04/15/cleaning-hk-gov-data-with-google-refine-and-displaying-it-with-google-fusion-tables/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/04/15/cleaning-hk-gov-data-with-google-refine-and-displaying-it-with-google-fusion-tables/#comments</comments>
		<pubDate>Sun, 15 Apr 2012 12:14:46 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=598</guid>
		<description><![CDATA[Last week, I started working with data from Buildings Department, concerning building permits. Despite the PDF documents being &#8220;protected&#8221; (preventing copying when opening with Acrobat), you can use a common utility for Linux called lesspipe, a pre-processor for less, that can process many file types into readable text. Readable does not necessarily mean structured. By [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, I started working with <a href="http://www.bd.gov.hk/english/documents/index_statistics.html">data from Buildings Department</a>, concerning building permits.</p>
<p>Despite the PDF documents being &#8220;protected&#8221; (preventing copying when opening with Acrobat), you can use a common utility for Linux called <a href="http://sourceforge.net/projects/lesspipe/">lesspipe</a>, a pre-processor for less, that can process many file types into readable text.</p>
<p>Readable does not necessarily mean structured. By no means, the lesspipe output is usable as it (<a href="http://lamma.jmsc.hku.hk/buildings/digests/5.2.aggr.txt">it looks like this</a> after separating the sections and aggregating across different PDF files). With the fantastic <a href="http://code.google.com/p/google-refine/">Google Refine</a> tool, you can however try your best to parse the data, clean the different fields manually and then even perform geocoding inside the tool (with &#8220;Add column by fetching URLs&#8221;).</p>
<p>After the cleaning was done (it took a few hours last Thursday, and a few more hours today), I did an export in TSV, and sent it to <a href="http://www.google.com/fusiontables">Google Fusion Tables</a>. I customized the map visualisation with the &#8220;month&#8221; field, and here is the result:</p>
<p><strong>2005-2011 data for &#8220;Table 5.2 Buildings for which building authority has issued demolition consent&#8221; from Hong Kong Buildings Department&#8217;s monthly digests (alpha)</strong><br />
<iframe width="500" height="500" scrolling="no"  src="https://www.google.com/fusiontables/embedviz?viz=MAP&#038;q=select+col5+from+3546150+&#038;h=false&#038;lat=22.31954974270474&#038;lng=114.16789386503376&#038;z=12&#038;t=1&#038;l=col5"></iframe></p>
<p>This is not even close to our final product yet, because the Google Maps JavaScript API V3 now <a href="https://developers.google.com/maps/documentation/javascript/layers#FusionTables">lets you add layers from Fusion Tables data</a>. Effectively, it means that you can build Web applications with different kinds of filters (in pull-down menus, etc.) that dynamically change how the data is displayed. The example here above only shows the single view specified inside Fusion Tables by the owner of the table (me). You could take possibly use the ID of the table (3546150) and make your own visualisation.</p>
<p>For now, the data hasn&#8217;t been vetted after refining (maybe the govt will provide us with raw data?), so I would recommend using with high caution as to the validity of the data. It should be largely correct, but some data points may not have been geocoded properly, if at all. For this particular data, corrigendums to Buildings Department monthly digests are not yet taken into account.</p>
<p>Here is another Google Refine + Google Fusion Tables trick on Hong Kong government data:</p>
<p><strong>Map for data from &#8220;Short Term Tenancy (STT) Tender Forecast&#8221; from Hong Kong Lands Department (alpha)</strong><br />
<iframe width="500" height="500" scrolling="no"  src="https://www.google.com/fusiontables/embedviz?viz=MAP&#038;q=select+col3+from+3546151+&#038;h=false&#038;lat=22.373888638213955&#038;lng=114.1234016418457&#038;z=11&#038;t=1&#038;l=col3"></iframe></p>
<p>These are the <a href="http://www.landsd.gov.hk/en/stt/forecast.htm">Short Term Tenancy (STT) Tender Forecast</a> from Lands Department. They are the sites for sale on short term tenancy, for a few years, for uses such as car parks. The color code on this custom map is based on the square meters area of each site for sale (from purple 0-1000 sqm to red for 5000+).</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/04/15/cleaning-hk-gov-data-with-google-refine-and-displaying-it-with-google-fusion-tables/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/04/15/cleaning-hk-gov-data-with-google-refine-and-displaying-it-with-google-fusion-tables/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weiboball</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/04/12/weiboball/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/04/12/weiboball/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 03:53:23 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Online Social Networks]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=592</guid>
		<description><![CDATA[A year and half ago, I did a project called Twitterball for one of my classes at PolyU on information architecture. It played on the idea of visualising the frequencies of tweets in time. I quickly remashed Twitterball into Weiboball, using data from our archive and the search engine built with WeiboScope Search. The result [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://research.jmsc.hku.hk/social/twitterball/weiboball.html"><img src="http://jmsc.hku.hk/blogs/ricecooker/wp-content/uploads/2012/04/weiboball-489x500.png" alt="" title="weiboball" width="489" height="500" class="alignnone size-medium wp-image-594" /></a></p>
<p>A year and half ago, I did a <a href="http://jmsc.hku.hk/blogs/ricecooker/2010/10/22/twitterball-and-twitterbubble-to-visualise-chinese-twitter-data-using-processing-js/">project called Twitterball</a> for one of my classes at PolyU on information architecture. It played on the idea of visualising the frequencies of tweets in time.</p>
<p>I quickly remashed Twitterball into Weiboball, using data from our archive and the search engine built with WeiboScope Search. The result is <a href="http://research.jmsc.hku.hk/social/twitterball/weiboball.html">Weiboball</a> and <a href="http://research.jmsc.hku.hk/social/twitterball/weiboball.html">Weibobubble</a>.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/04/12/weiboball/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/04/12/weiboball/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>No trending topics, no problem</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/04/03/no-trending-topics-no-problem/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/04/03/no-trending-topics-no-problem/#comments</comments>
		<pubDate>Tue, 03 Apr 2012 00:49:35 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Online Social Networks]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=581</guid>
		<description><![CDATA[Another side-effect of the comment system shutdown (and as documented by my colleague David at CMP), was the shutdown of the trending topics. [Edit (12:30PM): There was no shutdown of trending topics (thanks Charlie of Chinageeks), but I noticed that their weekly trending topics never seem (from visual inspection) to include posts made in the [...]]]></description>
			<content:encoded><![CDATA[<p>Another side-effect of the <a href="http://jmsc.hku.hk/blogs/ricecooker/2012/04/02/no-comment/">comment system shutdown</a> (<a href="http://cmp.hku.hk/2012/04/02/21126/">and as documented by my colleague David at CMP</a>), was the shutdown of the trending topics.</p>
<p>[<strong>Edit (12:30PM):</strong> There was no shutdown of trending topics (thanks Charlie of <a href="http://chinageeks.org/">Chinageeks</a>), but I noticed that their weekly trending topics never seem (from visual inspection) to include posts made in the 3 past days. I was also confused by the daily topics, because I wrote the entry before 9AM and only saw the topics of 2 days ago as "yesterday's" trending topics, skipping one entire day. It's possible that the trending topics are only released once or a handful of times per day.]</p>
<p>The image embedded here below is a screenshot of the most commented on weibos of the last week on Sina. They are based on the <a href="http://weibo.com/pub/topmblog?act=week&#038;type=cmt">number of comments</a> made on the given post that week, but are all on rather innocuous topics this time.</p>
<p><a href="http://www.flickr.com/photos/smurfmatic/7040409689/" title="熱門轉發 新浪微博-隨時隨地分享身邊的新鮮事兒 by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7265/7040409689_93ca8da243.jpg" width="500" height="336" alt="熱門轉發 新浪微博-隨時隨地分享身邊的新鮮事兒"></a></p>
<p>The following is a screenshot of the page for the most reposted weibos during the last week (<a href="http://weibo.com/pub/topmblog?act=week&#038;type=re">original page</a>). Sina counts them based on a week time from today, and a calendar lets you navigate through the archive.</p>
<p><a href="http://www.flickr.com/photos/smurfmatic/7040430153/sizes/o/in/photostream/" title="Most reposted last week by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7260/7040430153_d21d77374a_b.jpg" width="204" height="1024" alt="Most reposted last week"></a></p>
<p>It&#8217;s not always inoffensive stuff, as sometimes the posts would touch on social injustice and events of political importance, <a href="http://weibo.com/pub/topmblog?act=week&#038;type=re&#038;t=2012-03-06">like here</a>. But don&#8217;t look for the Bo Xilai and Wang Lijun stuff, because you won&#8217;t find any of this. That said, not everything has to be political to be important, and celebrities posts often occupy the microblogosphere of the majority.</p>
<p>So it is a good thing that we are keeping <a href="http://research.jmsc.hku.hk/social/sinaweibo/">our own trending topics</a>.</p>
<p>We have been making our own index of trending topics for a long while already (more than a year now) and while it chiefly depends on our capacity to collect posts, it has always given us a good indication of what&#8217;s *really* trending, among people of slightly greater influence (we have a list of 270,000 people now, with more than 1,000 followers).</p>
<p>We solely look at popular posts based on the number of reposts (among a sample), because it&#8217;s not practical to do trending topics based on comments, when you don&#8217;t have the capacity to discover popular posts that way.</p>
<p><a href="http://www.flickr.com/photos/smurfmatic/7040409771/in/photostream/lightbox/" title="WeiboScope - published by the JMSC at HKU by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7114/7040409771_9b2623f07b.jpg" width="500" height="336" alt="WeiboScope - published by the JMSC at HKU"></a></p>
<p>Our trending topics come in handy when the comments are completely gone, as it was the case this week on both Sina and QQ weibo, and reportedly most of the other microblogging platforms. <a href="http://research.jmsc.hku.hk/social/obs.py/sinaweibo/">WeiboScope</a> is a visual representation of those trending posts, according to us. For instance, there is a strong representation of pictures of buddhist monks, either of what seems to be <a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3430274899767224">one of their leaders</a>, and another of what looks like <a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3430134881486551">two monks deviating from their monastic life</a> (<a href="http://www.flickr.com/photos/smurfmatic/7040409771/in/photostream/lightbox/">see archived screenshot</a>).</p>
<p>Also a bit of <a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3430066581179654">Yao Ming</a>, <a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3430197195973648">cats growing old together</a> and <a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3430371226166733">some ridiculous freak incident</a> where a lady fell in a hole in the pavement where hot water pipes had ruptured.</p>
<p>The image is only indicative, as I didn&#8217;t check the actual sampling. But because of its consistency over time (in terms of matching Sina&#8217;s own trending topics or what I end up seeing in the news), I can believe that this would be what is interesting among a certain group of slightly more influential people. That&#8217;s what the chatter&#8217;s on right now on Sina Weibo.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/04/03/no-trending-topics-no-problem/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/04/03/no-trending-topics-no-problem/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>No comment</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/04/02/no-comment/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/04/02/no-comment/#comments</comments>
		<pubDate>Mon, 02 Apr 2012 01:08:23 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Online Social Networks]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=563</guid>
		<description><![CDATA[Since Saturday, comments on Sina Weibo, China&#8217;s most popular microblogging platform, have been shut down for clean-up all the way until Wednesday, April 4. 新浪微博公告 各位微博用户： 　　最近，微博客评论跟帖中出现较多谣言等违法有害信息。为进行集中清理，从3月31日上午8时至4月3日上午8时，暂停微博客评论功能。清理后，我们将再开放评论功能。进行必要的信息清理，是为了有利于为大家提供更好的交流环境，希望广大用户理解和谅解。感谢大家的支持。 新浪微博 2012年3月31日 Weibos are a bit more &#8220;Twitter-esque&#8221; now, since comments are a unique feature of Chinese microblogs over their Western counterparts. It&#8217;s especially a critical feature, since [...]]]></description>
			<content:encoded><![CDATA[<p>Since Saturday, comments on Sina Weibo, China&#8217;s most popular microblogging platform, have been shut down for clean-up all the way until Wednesday, April 4.</p>
<blockquote><p>新浪微博公告</p>
<p>各位微博用户：<br />
　　最近，微博客评论跟帖中出现较多谣言等违法有害信息。为进行集中清理，从3月31日上午8时至4月3日上午8时，暂停微博客评论功能。清理后，我们将再开放评论功能。进行必要的信息清理，是为了有利于为大家提供更好的交流环境，希望广大用户理解和谅解。感谢大家的支持。<br />
新浪微博</p>
<p>2012年3月31日</p></blockquote>
<p>Weibos are a bit more &#8220;Twitter-esque&#8221; now, since comments are a unique feature of Chinese microblogs over their Western counterparts. It&#8217;s especially a critical feature, since reposts are often made through a comment first (you may then decide to repost something back to your readers) and allow to aggregate text of the same topic within a single stream.</p>
<p>Tencent had a similar message for their weibo, also one of the most popular ones in the nation:</p>
<blockquote><p>腾讯微博客公告<br />
關閉</p>
<p>用户朋友：<br />
　　近期，通过微博客传播的谣言等违法有害信息造成了不良社会影响，评论跟帖中的有害信息相对较多，需要作集中清理。为此，本网站决定，自3月31日上午8时至4月3日上午8时，暂时停止微博客评论功能。由此给您带来的不便，敬请谅解。</p>
<p>腾讯网</p>
<p>2012年3月31日</p></blockquote>
<p>The networks do not point fingers at a specific target of this &#8220;cleanup&#8221; in their message to users, but many understand that it is to rid their weibo systems of the chatter on a supposed &#8220;coup&#8221; in Beijing.</p>
<p>As illustrated in <a href="http://www.economist.com/node/21551466">The Economist this week</a>, but understood by any Chinese speaker, 140 Chinese characters are worth a lot more semantic value than in alphabet-based languages.</p>
<p>We&#8217;ll try to to monitor the blackout and keep you posted on it.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/04/02/no-comment/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/04/02/no-comment/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sina Weibo: Zombie accounts, slain entries, and now, ghostly posts</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/03/26/sina-weibo-zombie-accounts-slain-entries-and-now-ghostly-posts/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/03/26/sina-weibo-zombie-accounts-slain-entries-and-now-ghostly-posts/#comments</comments>
		<pubDate>Mon, 26 Mar 2012 10:04:16 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Online Social Networks]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=553</guid>
		<description><![CDATA[While inspecting a potential hiccup to our Sina Weibo deleted posts monitoring system (see the ASL / see article explaining the method), we discovered the occurrence of a post that was seemingly deleted, but which still occasionally appeared on the API. It was made as a repost made on March 2nd by David Bandurski of [...]]]></description>
			<content:encoded><![CDATA[<p>While inspecting a potential hiccup to our <a href="http://research.jmsc.hku.hk/social/search.py/sinaweibo/#lastpermissiondenied">Sina Weibo deleted posts monitoring system</a> (<a href="http://cmp.hku.hk/~/asl/">see the ASL</a> / <a href="http://jmsc.hku.hk/blogs/ricecooker/2012/02/08/how-do-you-catch-and-archive-deleted-posts-on-sina-weibo/">see article explaining the method</a>), we discovered the occurrence of a post that was seemingly deleted, but which still occasionally appeared on the API.</p>
<p><a href="http://jmsc.hku.hk/blogs/ricecooker/wp-content/uploads/2012/03/sinaweibo_api_3419194303931100.png"><img class="alignnone size-large wp-image-554" title="sinaweibo_api_3419194303931100" src="http://jmsc.hku.hk/blogs/ricecooker/wp-content/uploads/2012/03/sinaweibo_api_3419194303931100-601x1024.png" alt="" width="601" height="1024" /></a></p>
<p>It was made as a repost made on March 2nd by <a href="http://weibo.com/u/1958906460">David Bandurski of CMP</a>, and was a repost of a Hu Shuli post commenting on the Wang Lijun incident on the same day. <a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3419194303931100">The post was archived</a>, but can no longer be found on David&#8217;s timeline. Ms. Hu&#8217;s <a href="http://weibo.com/1497882593/y836jAMZC">post</a> was still alive on Weibo when I checked (<a href="http://research.jmsc.hku.hk/social/index.py/singleSinaWeibo?id=3419185478768140">and we have it archived, in case</a>).</p>
<p>The strange thing was that this post was not marked as &#8220;permission denied&#8221; and was fully available when queried with the &#8220;statuses/show&#8221; function on the Weibo API (<a href="http://api.weibo.com/2/statuses/show.json?id=3419194303931100&#038;source=4280451947">link</a> &#8212; requires login). On the other hand, the user_timeline function, which lists the latest 200 posts made by a given user, reports that the post no longer existed.</p>
<p>Even more bizarre, still, when you <a href="http://weibo.com/u/1958906460?start_time=2012-03-01&#038;end_time=2012-03-03&#038;is_search=1">searched on Weibo.com</a> for David&#8217;s posts between the date range containing March 2nd, you&#8217;d see that the search page claims of 5 results, whereas only 4 are actually returned! This is shown in this following image:</p>
<p><a href="http://jmsc.hku.hk/blogs/ricecooker/wp-content/uploads/2012/03/班志远-香港大学CMP的微博-新浪微博-隨時隨地分享身邊的新鮮事兒.png"><img src="http://jmsc.hku.hk/blogs/ricecooker/wp-content/uploads/2012/03/班志远-香港大学CMP的微博-新浪微博-隨時隨地分享身邊的新鮮事兒-637x1024.png" alt="" title="班志远--香港大学CMP的微博 新浪微博-隨時隨地分享身邊的新鮮事兒" width="637" height="1024" class="alignnone size-large wp-image-555" /></a></p>
<p>For our monitoring tool, specifically the user_timeline function that casts the net for deleted posts, we have been using the first version (V1) of the API (because of rate limiting issues). V1 seems to me like the abandoned entrance for a data store that is more and more complex in its layers of visibility (and invisibility). These layers don&#8217;t seem supported with 100% fidelity on V1, and will return copies of user timeline that sometimes contained the &#8220;ghostly&#8221; post, and sometimes just won&#8217;t (probably depending on which physical server you end up accessing).</p>
<p>Not only there are &#8220;permission denied&#8221; or &#8220;weibo does not exist&#8221; posts, as described in <a href="http://jmsc.hku.hk/blogs/ricecooker/2012/02/08/how-do-you-catch-and-archive-deleted-posts-on-sina-weibo/">our previous post</a> about the methods behind the tool, but there are now posts wavering between a state of publication or non-publication (at least from the weibo.com website&#8217;s point of view, they were dead).</p>
<p>Anecdotally, I noticed that &#8220;permission denied&#8221; posts usually meant that their dependent posts (by people who repost) would also be marked as deleted with &#8220;permission denied&#8221; as well. In this case of the Hu Shuli post, the original post was alive and well, while the repost was deleted.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/03/26/sina-weibo-zombie-accounts-slain-entries-and-now-ghostly-posts/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/03/26/sina-weibo-zombie-accounts-slain-entries-and-now-ghostly-posts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>China News Archive</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/02/22/china-news-archive/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/02/22/china-news-archive/#comments</comments>
		<pubDate>Wed, 22 Feb 2012 08:14:25 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=546</guid>
		<description><![CDATA[Ever looked for an automated archive of Chinese news websites? For months, we&#8217;ve been collecting screenshots and HTML snapshots of up to 20 websites based in China or covering China. We now have a webpage for it. http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots/ The screenshots are classified by news source and with a minimalistic (if not just minimal) interface, organised [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots/"><img src="http://jmsc.hku.hk/blogs/ricecooker/wp-content/uploads/2012/02/qq_20120222-1605-150x150.png" alt="" title="qq_20120222-1605" width="150" height="150" class="alignnone size-thumbnail wp-image-547" /></a></p>
<p>Ever looked for an automated archive of Chinese news websites? For months, we&#8217;ve been collecting screenshots and HTML snapshots of up to 20 websites based in China or covering China. <a href="http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots/">We now have a webpage for it.</a></p>
<p><a href="http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots/">http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots/</a></p>
<p>The screenshots are classified by news source and with a minimalistic (if not just minimal) interface, organised by day and regrouped by month. For instance, you could go to the <a href="http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots?newssite=qq">QQ News archive</a>, an archive for <a href="http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots?newssite=qq#m201202">February 2011</a>, and a particular link to <a href="http://research.jmsc.hku.hk/social/chinanews/index.py/listScreenshots?newssite=qq#d20120222">today</a>.</p>
<p>There&#8217;s also a <a href="http://research.jmsc.hku.hk/social/chinanews/">version</a> for accessing navigeable HTML pages when available.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/02/22/china-news-archive/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/02/22/china-news-archive/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Spam spam spamalot</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/02/14/spam-spam-spamalot/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/02/14/spam-spam-spamalot/#comments</comments>
		<pubDate>Tue, 14 Feb 2012 00:45:53 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=543</guid>
		<description><![CDATA[I&#8217;ll start filtering weibos that contain links with photo.weibo.com and event.weibo.com. It seems like most containing such links go to spam-like galleries. Bad, bad, mega-bad. The amount of spam is just ridiculous today.]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll start filtering weibos that contain links with photo.weibo.com and event.weibo.com. It seems like most containing such links go to spam-like galleries. Bad, bad, mega-bad. The amount of spam is just ridiculous today.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/02/14/spam-spam-spamalot/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/02/14/spam-spam-spamalot/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How do you catch and archive deleted posts on Sina Weibo?</title>
		<link>http://jmsc.hku.hk/blogs/ricecooker/2012/02/08/how-do-you-catch-and-archive-deleted-posts-on-sina-weibo/</link>
		<comments>http://jmsc.hku.hk/blogs/ricecooker/2012/02/08/how-do-you-catch-and-archive-deleted-posts-on-sina-weibo/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 05:02:50 +0000</pubDate>
		<dc:creator>Cedric Sam</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Online Social Networks]]></category>

		<guid isPermaLink="false">http://jmsc.hku.hk/blogs/ricecooker/?p=525</guid>
		<description><![CDATA[It&#8217;s the holy grail of any media researcher working on China: how do you quantify contents removal from social media services such as Sina Weibo? Here at JMSC, we&#8217;ve been developing tools to scour and assess social media of all sorts for the purpose of researching online media in Hong Kong and mainland China. (This [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s the holy grail of any media researcher working on China: how do you quantify contents removal from social media services such as <a href="http://weibo.com/">Sina Weibo</a>? Here at JMSC, <a href="http://github.com/JMSCHKU/Social">we&#8217;ve been developing tools</a> to scour and assess social media of all sorts for the purpose of researching online media in Hong Kong and mainland China. <em>(This is the same project that generated <a href="http://research.jmsc.hku.hk/social/obs.py/sinaweibo/">WeiboScope</a>, for those tuned in.)</em></p>
<p><a href="http://research.jmsc.hku.hk/social/search.py/sinaweibo/#lastpermissiondenied" title="WeiboScope Search: deleted posts"><img src="http://farm8.staticflickr.com/7016/6839287437_ed11dd4f93.jpg" width="406" height="500" alt="Screenshot at 2012-02-08 11:56:04"></a></p>
<p>With the extensive archive of weibos we&#8217;ve accumulated so far and mechanisms underlying its retrieval, we were able to develop a routine that finds and marks deleted posts (<a href="http://jmsc.hku.hk/blogs/ricecooker/2012/01/31/sina-weibo-deleted-posts-archive/">method explained here</a>).</p>
<p>The result is <a href="http://research.jmsc.hku.hk/social/search.py/sinaweibo/#lastdeleted">an archive of deleted posts</a> (and <a href="http://cmp.hku.hk/~/asl/">the CMP&#8217;s Anti-Social List</a>). Not only is it possible to find a large number (which is not exhaustive, we admit) of such posts within the day of their doom time, but also be able to presume with clear-cut evidence whether these deleted posts were removed by the user itself or <a href="http://research.jmsc.hku.hk/social/search.py/sinaweibo/#lastpermissiondenied">simply deleted by system managers</a> (<a href="http://jmsc.hku.hk/blogs/ricecooker/2011/08/30/what-an-inconsistent-api/">we first noticed the difference in August 2011&#8230;</a>).</p>
<p><a href="http://jmsc.hku.hk/blogs/ricecooker/2012/01/31/sina-weibo-deleted-posts-archive/">As described in a post last week</a>, the idea behind this archive is simple and straightforward to implement, once you&#8217;ve got the infrastructure.</p>
<p><a href="http://www.flickr.com/photos/smurfmatic/6839346293/" title="deleted_weibo_previous by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7029/6839346293_58dc274108.jpg" width="500" height="335" alt="deleted_weibo_previous"></a><br />
<em>A previous copy of the user timeline, containing all posts</em></p>
<p><a href="http://www.flickr.com/photos/smurfmatic/6839346287/" title="deleted_weibo_current by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7155/6839346287_5ceb768b66.jpg" width="500" height="335" alt="deleted_weibo_current"></a><br />
<em>A current copy of the user timeline, with a missing post</em></p>
<p>Both copies of a user timeline (post IDs extracted from the full JSON response) are obtained during two consecutive <a href="http://en.wikipedia.org/wiki/Application_programming_interface">API</a> calls, which may span a few minutes or several hours. The smaller this interval between pollings, the more precise would the routine be in finding the exact removal time (and the chances of missing something, smaller too).</p>
<p>A post is found to be deleted when you could see it in the previous version, but not the current one. Since we keep a copy of every post we see, we simply mark this post, and can then view them all in a custom webpage. Easy enough, right?</p>
<p><a href="http://www.flickr.com/photos/smurfmatic/6839360025/" title="Screenshot at 2012-02-08 12:14:35 by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7014/6839360025_a5299e0b83.jpg" width="500" height="131" alt="Screenshot at 2012-02-08 12:14:35"></a></p>
<p><a href="http://www.flickr.com/photos/smurfmatic/6839359997/" title="Screenshot at 2012-02-08 12:14:26 by Cedric Sam, on Flickr"><img src="http://farm8.staticflickr.com/7030/6839359997_30432edd86.jpg" width="500" height="131" alt="Screenshot at 2012-02-08 12:14:26"></a></p>
<p>The previous two images show you exactly what, from a programmer&#8217;s point of view, Sina Weibo returns us for two different kinds of deleted posts. The former, with an API response &#8220;weibo does not exist&#8221;, identifies a post that was presumably deleted by the user. The latter, which returns &#8220;permission denied&#8221;, is presumably a post deleted by the system.</p>
<p>We don&#8217;t know the intention between both types of messages, but we can guess based on what their contents are generally. The first are generally made of spam-like posts that would be deleted on any online social network in the world. The second seem to provide more legitimate contents, including some made by so-called VIP users verified by Sina (we only check a 2500-odd sample of users, so can&#8217;t really infer on their representation).</p>
<p>This feature is indeed powerful, because it finally puts a number on post removal on Sina Weibo. (We computer science majors strongly dislike conclusions not based on numbers and data.)</p>
<p>It is however currently impossible to tell with certainty what gets deleted (versus what&#8217;s not), since our user sample is strongly biased towards public commentators, and perhaps because the number of posts found is still extremely small.</p>
<p>What it does give is an understanding of how post removal works, how much time it usually takes for something to be removed, and whether reposts of posts or just reposts but not original posts, get deleted (in fact, it happens). It&#8217;s a privileged peek, indeed, at what&#8217;s going on on the Chinese Internets, right here, right now.</p>
<p>Until we accumulate a substantial archive to do anything useful, our colleagues at <a href="http://cmp.hku.hk/">China Media Project</a> have started <a href="http://cmp.hku.hk/~/asl/">compiling and explaining deleted posts on their Anti-Social List</a>.</p>
<g:plusone href="http://jmsc.hku.hk/blogs/ricecooker/2012/02/08/how-do-you-catch-and-archive-deleted-posts-on-sina-weibo/"  size="standard"   annotation="none"  ></g:plusone>]]></content:encoded>
			<wfw:commentRss>http://jmsc.hku.hk/blogs/ricecooker/2012/02/08/how-do-you-catch-and-archive-deleted-posts-on-sina-weibo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

