日本語のページ (in Japanese)
Who are we?
Diversity Mining Laboratory at Nagasaki University, led by Prof. Tomonari Masada, is a vigorous community working in data mining.
Our aim is to explore and organize the content diversity latent in massive data for helping people to have a better insight on the data.
We apply various probabilistic modeling techniques, especially Topic Modeling, to achieve our aim.
Our Research Topics
Topic Analysis of Minutes of the National Diet of Japan
We provide this visualization only in Japanese.
Extraction of Topic Evolutions from References in Scientific Articles and Its GPU Acceleration
This work tried to extract latent topic transitions from linked documents as a single transition matrix between latent topics.
The paper was accepted as a short paper for CIKM 2012.
The original long version of the paper is available at
this scribd page
.
This is a joint work with Prof. Atsuhiro Takasu in NII.
Clustering of the MNIST training images of the digit "6"
This is a visualization of clustering [
Masada+ ICONIP2013
].
The result given above is obtained for the MNIST dataset.
The number of clusters is controlled by Chinese restaurant process.
This result corresponds to "one" among the "three"-ways of our clustering in [
Masada+ ICONIP2013
].
Clustering of pixel columns and pixel rows
of hand-written digit images
This is a visualization of clustering [
Masada+ ICONIP2013
].
The result given above is obtained for the MNIST dataset.
Black pixels are the pixels detected as
irrelevant
by our method.
We can use clustering results for classifying test images.
Extracting visual topics from tens of thousands of images with GPU
visual topics extracted by our GPU-based implementation of collapsed variational Bayesian inference for LDA
We implement CVB for LDA with CUDA [
Masada+ IEA/AIE2009
] and run an inference over 32,000 images.
This is a subset of Tiny Images dataset
[1]
.
Here we give the topic extraction results only for tens of images. The full set of results can be browsed at
this Web page
.
Gray scale pixel values show topic probabilities at each pixel in each image.
Segmenting citation data with latent permutations
We propose a completely new citation segmentation method based on a proposal by Chen et al.
[2]
Generalized Mallows model is used effectively for extending LDA to realize topic sequence mining.
We proposed an unsupervised method [
Masada+ WISS2010
] and its semi-supervised version [
Masada+ ICADL2011
].
The above figure presents a segmentation example obtained by our semi-supervised segmentation.
This is a joint work with Prof. Atsuhiro Takasu in NII.
Extracting topical trends
vanilla LDA
LYNDA [
Masada+ CIKM2009
]
BToT [
Masada+ ISNN2010
]
We propose two methods, i.e., LYNDA and BToT, for extracting topical trends from a document set.
The three figures above give the results obtained by LDA, LYNDA[
Masada+ CIKM2009
], and BToT[
Masada+ ISNN2010
].
Each colored region corresponds to a different topic.
Vertical axis represents document dates ranging from Jan. 1, 2002 to Dec. 31, 2005.
Horizontal axis represents topic popularity at each date.
The analyzed data is a set of Yomiuri newspaper articles.
You can try an interactive presentation of the results at the following places:
LDA
,
LYNDA
, and
BToT
.
This is a joint research with Prof. Atsuhiro Takasu in NII.
Extracting per-topic temporal transitions of popular words from parallel corpora
We propose a new topic model [
Masada+ PAKDD2011
] for extracting temporal transitions of word probabilities for each topic.
Our model is extended for parallel corpus analysis and is applied to Chinese-English abstracts of computer science papers.
The years of the abstracts range from 2000 to 2009.
We only show five among tens of the extracted topics.
Each topic is represented by the top three Chinese and English words of large probability in each year.
No Chinese-English dictionaries are used.
This is a joint research with Prof. Atsuhiro Takasu in NII. The dataset was collected and cleaned up by
Haipeng Zhang
.
^
A. Torralba and R. Fergus and W. T. Freeman.
Tiny Images.
MIT-CSAIL-TR-2007-024, 2007.
^
H. Chen, S.R.K. Branavan, R. Barzilay, D.R. Karger.
Global Models of Document Structure Using Latent Permutations.
NAACL/HLT 2009.
Who are we?
Our Research Topics
Topic Analysis of Minutes of the National Diet of Japan
Extraction of Topic Evolutions from References in Scientific Articles and Its GPU Acceleration
Clustering of the MNIST training images of the digit "6"
Clustering of pixel columns and pixel rows
of hand-written digit images
Extracting visual topics from tens of thousands of images with GPU
Segmenting citation data with latent permutations
Extracting topical trends
Extracting per-topic temporal transitions of popular words from parallel corpora