Han Xiao

Han Xiao in Taiwan 2013.09

Ph.D. candidateRoom 01.08.061
xiaoh@in.tum.deInstitute of Informatics
+49 (0)89 289-18590Technischen Universität München
View Han Xiao's LinkedIn profileView Han Xiao's profile

Table of Contents

Bio

I'm a Ph.D. candidate at Technischen Universität München (TUM). I got M.Sc. at TUM in 2011. My research interests include online learning, semi-supervised learning, active learning, Gaussian process, support vector machines and probabilistic graphical models; as well as their applications in knowledge discovery. My advisors are Claudia Eckert and René Brandenberg. From Sept. 2013 to Jan. 2014, I was a visiting scholar in Shou-De Lin's Machine Discovery and Social Network Mining Lab at National Taiwan University.

I'm currently looking for an R&D job related to data mining/knowledge discovery/machine learning. The expected graduation date would be Sept. 2014.

Publications

  1. Han Xiao and Claudia Eckert “Efficient Online Sequence Prediction with Side Information,” in 13th IEEE International Conference on Data Mining (ICDM), Dec. 2013. (AR:19%).
    [BibTeX] [Abstract] [PDF] [Slides] [Code]

    @conference{HanXiao2013c,
      author = {Han Xiao and Claudia Eckert},
      title = {Efficient Online Sequence Prediction with Side Information},
      booktitle = {13th IEEE International Conference on Data Mining (ICDM '13)},
      year = {2013},
      address = {Dallas, TX USA},
      month = dec
    }
    
    Abstract:  Sequence prediction is a key task in machine learning and data mining. It
      involves predicting the next symbol in a sequence given its previous symbols.
      Our motivating application is predicting the execution path of a process on an
      operating system in real-time. In this case, each symbol in the sequence
      represents a system call accompanied with arguments and a return value. We
      propose a novel online algorithm for predicting the next system call by
      leveraging both context and side information. The online update of our
      algorithm is efficient in terms of time cost and memory consumption.
      Experiments on real-world data sets showed that our method outperforms
      state-of-the-art online sequence prediction methods in both accuracy and
      efficiency, and incorporation of side information does significantly improve
      the predictive accuracy.
    
  2. Han Xiao and Claudia Eckert “Lazy Gaussian Process Committee for Real-Time Online Regression,” in 27th AAAI Conference on Artificial Intelligence (AAAI), July. 2013. (AR:29%).
    [BibTeX] [Abstract] [PDF] [Slides] [Code]

    @conference{HanXiao2013b,
      author = {Han Xiao and Claudia Eckert},
      title = {Lazy Gaussian Process Committee for Real-Time Online Regression},
      booktitle = {27th AAAI Conference on Artificial Intelligence (AAAI '13)},
      year = {2013},
      address = {Washington, USA},
      month = july
    }
    
    Abstract:  A significant problem of Gaussian process (GP) is its
    unfavorable scaling with a large amount of data. To overcome this issue, we
    present a novel GP approximation scheme for online regression. Our model is
    based on a combination of multiple GPs with random hyperparameters. The model is
    trained by incrementally allocating new examples to a selected subset of GPs.
    The selection is carried out efficiently by optimizing a submodular function.
    Experiments on real-world data sets showed that our method outperforms existing
    online GP regression methods in both accuracy and efficiency. The applicability
    of the proposed method is demonstrated by the mouse-trajectory prediction in an
    Internet banking scenario.
    
  3. Han Xiao, Huang Xiao and Claudia Eckert “Learning from Multiple Observers with Unknown Expertise,” in 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Apr. 2013. (AR:29%).
    [BibTeX] [Abstract] [PDF] [Slides] [Code]

    @conference{HanXiao2013a,
      author = {Han Xiao and Huang Xiao and Claudia Eckert},
      title = {Learning from Multiple Observers with Unknown Expertise},
      booktitle = {17th Pacific-Asia Conference on Knowledge Discovery and Data
      Mining (PAKDD '13)},
      year = {2013},
      address = {Australia},
      month = april
    }
    
    Abstract:  Internet has emerged as a powerful technology for collecting
    labeled data from a large number of users around the world at very low cost.
    Consequently, each instance is often associated with a handful of labels,
    precluding any assessment of an individual user's quality. We present a
    probabilistic model for regression when there are multiple yet some unreliable
    observers providing continuous responses. Our approach simultaneously learns the
    regression function and the expertise of each observer that allow us to predict
    the ground truth and observers' responses on the new data. Experimental results
    on both synthetic and real-world data sets indicate that the proposed method has
    clear advantages over ``taking the average'' baseline and some state-of-art
    models.
    
  4. Huang Xiao, Han Xiao and Claudia Eckert “OPARS: Objective Photo Aesthetics Ranking System,” in 34th European Conference on Information Retrieval (ECIR), Mar. 2013. (Demo paper).
    [BibTeX] [Abstract] [PDF] [Poster] [Demo]

    @conference{HuangHan2013a,
      author = {Huang Xiao and Han Xiao and Claudia Eckert},
      title = {OPARS: Objective Photo Aesthetics Ranking System},
      booktitle = {34th European Conference on Information Retrieval (ECIR'13)},
      year = {2013},
      address = {Moscow, Russia},
      month = mar
    }
    
    Abstract:  Learning to predict aesthetics score of an image is an
      important open problem in the content based image retrieval. Even though image
      ratings can be acquired from online communities, the substantial amount of
      disagreement among users makes it inappropriate as the training data.
      Motivated by this problem, we present method that learns a regression function
      when the objective ground truth is unknown and only subjective responses are
      available. Experimental result on the semi-synthetic data set shows the
      effectiveness of the proposed algorithm. Interesting result on a real-world
      image rating data set is discussed.
    
  5. Han Xiao, Huang Xiao and Claudia Eckert “Adversarial Label Flips Attack on Support Vector Machines,” in 20th European Conference on Artificial Intelligence (ECAI), Aug. 2012. (AR: 28%).
    [BibTeX] [Abstract] [PDF] [Slides] [Code] [More]

    @inproceedings{HanXiao2012,
      author = {Han Xiao, Huang Xiao and Claudia Eckert},
      title = {Adversarial Label Flips Attack on Support Vector Machines},
      booktitle = {20th European Conference on Artificial Intelligence},
      year = {2012}
    }
    
    Abstract:  To develop a robust learning algorithm in the adversarial
    setting, it is important to understand the adversary's strategy. We address the
    problem of label flips attack where an adversary contaminates the training data
    through flipping labels. We analyze the objective of the adversary and formulate
    an optimization problem for finding the optimal label flips under a given
    budget. An attack algorithm targeting support vector machines (SVMs) is derived.
    Experiments demonstrate that the performance of SVMs is significantly degraded
    under the attack.
    
  6. Han Xiao, T. Stibor and Claudia Eckert “Evasion Attack of Multi-Class Linear Classifiers,” in 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), May. 2012. (AR: 36%).
    [BibTeX] [Abstract] [PDF] [Slides]

    @inproceedings{HanXiao2012,
      author = {Han Xiao, Thomas Stibor and Claudia Eckert},
      title = {Evasion Attack of Multi-Class Linear Classifiers},
      booktitle = {16th Pacific-Asia Conference on Knowledge Discovery and Data Mining},
      year = {2012}
    }
    
    Abstract:  Machine learning has yield significant advances in
      decision-making for complex systems, but are they robust against adversarial
      attacks? We generalize the evasion attack problem to the multi-class linear
      classifiers, and present an efficient algorithm for approximating the optimal
      disguised instance. Experiments on real-world data demonstrate the
      effectiveness of our method.
    
  7. Han Xiao and T. Stibor, “Supervised Topic Transition Model for Detecting Malicious System Call Sequences,” in SIGKDD Workshop on Knowledge Discovery, Modeling, and Simulation, Aug. 2011. (Best Student Paper Award).
    [BibTeX] [Abstract] [PDF]

    @inproceedings{HanXiao2011,
      author = {Han Xiao and Thomas Stibor},
      title = {Supervised Topic Transition Model for Detecting Malicious System Call Sequences},
      booktitle = {SIGKDD Workshop on Knowledge Discovery, Modeling, and Simulation},
      year = {2011},
      note = {(Best Student Paper Award)}
    }
    
    Abstract:  We propose a probabilistic model for behavior-based malware
    detection that jointly models sequential data and class labels. Given labeled
    sequences (harmless/malicious), our goal is to reveal behavior patterns and
    exploit them to predict class labels of unknown sequences. The proposed model is
    a novel extension of supervised latent Dirichlet allocation with an estimation
    algorithm that alternates between Gibbs sampling and gradient descent.
    Experiments on real-world data set show that our model can learn meaningful
    patterns, and provides competitive performance on the malware detection task.
    Moreover, we parallelize the training algorithm and demonstrate scalability with
    varying numbers of processors.
    
  8. Han Xiao and T. Stibor, “Toward Artificial Synesthesia: Linking Images and Sounds via Words,” in NIPS Workshop on Machine Learning for next generation Computer Vision challenges, Dec. 2010.
    [BibTeX] [Abstract] [PDF]

    @inproceedings{HanXiao2010,
      author = {Han Xiao and Thomas Stibor},
      title = {Toward Artificial Synesthesia: Linking Images and Sounds via Words},
      booktitle = {NIPS workshop on Machine Learning for next generation Computer Vision challenges},
      year = {2010}
    }
    
    Abstract:  We tackle a new challenge of modeling a perceptual experience
    in which a stimulus in one modality gives rise to an experience in a different
    sensory modality, termed synesthesia. To meet the challenge, we propose a
    probabilistic framework based on graphical models that enables to link visual
    modalities and auditory modalities via natural language text. An online
    prototype system is developed for allowing human judgement to evaluate the
    model’s performance. Experimental results indicate usefulness and applicability
    of the framework.
    
  9. Han Xiao and T. Stibor, “Efficient Collapsed Gibbs Sampling For Latent Dirichlet Allocation,” in 2nd Asian Conference on Machine Learning (ACML), Oct. 2010. (AR: 31%)
    [BibTeX] [Abstract] [PDF]

    @inproceedings{Xiao2010,
      author = {Han Xiao and Thomas Stibor},
      title = {Efficient Collapsed Gibbs Sampling For Latent Dirichlet Allocation},
      booktitle = {2nd Asian Conference on Machine Learning},
      year = {2010}
    }
    
    Abstract:  Collapsed Gibbs sampling is a frequently applied method to
    approximate intractable integrals in probabilistic generative models such as
    latent Dirichlet allocation. This sampling method has however the crucial
    drawback of high computational complexity, which makes it limited applicable on
    large data sets. We propose a novel dynamic sampling strategy to significantly
    improve the efficiency of collapsed Gibbs sampling. The strategy is explored in
    terms of efficiency, convergence and perplexity. Besides, we present a
    straight-forward parallelization to further improve the efficiency. Finally, we
    underpin our proposed improvements with a comparative study on different scale
    data sets.
    
  10. Han Xiao, Xiaojie Wang and Chao Du, “Injecting Structured Data to Generative Topic Model in Enterprise Settings,” in 1st Asia Conference on Machine Learning (ACML), Oct. 2009. (AR: 25%)
    [BibTeX] [Abstract] [PDF]

    @inproceedings{HanXiao2009,
      author = {Han Xiao and Xiaojie Wang and Chao Du},
      title = {Injecting Structured Data to Generative Topic Model in Enterprise Settings},
      booktitle = {1st Asia Conference on Machine Learning},
      year = {2009}
    }
    
    Abstract:  Enterprises have accumulated both structured and unstructured
    data steadily as computing resources improve. However, previous research on
    enterprise data mining often treats these two kinds of data independently and
    omits mutual benefits. We explore the approach to incorporate a common type of
    structured data (i.e.organigram) into generative topic model. Our approach, the
    Partially Observed Topic model (POT), not only considers the unstructured words,
    but also takes into account the structured information in its generation
    process. By integrating the structured data implicitly, the mixed topics over
    document are partially observed during the Gibbs sampling procedure. This allows
    POT to learn topic pertinently and directionally, which makes it easy tuning and
    suitable for end-use application. We evaluate our proposed new model on a
    real-world dataset and show the result of improved expressiveness over
    traditional LDA. In the task of document classification, POT also demonstrates
    more discriminative power than LDA.
    
  11. Han Xiao and Xiaojie Wang, “Constructing Parallel Corpus from Movie Subtitles,” in 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL), Mar. 2009.
    [BibTeX] [Abstract] [PDF]

    @inproceedings{Xiao2009,
      author = {Han Xiao and Xiaojie Wang},
      title = {Constructing Parallel Corpus from Movie Subtitles},
      booktitle = {22nd International Conference on the Computer Processing of Oriental Languages},
      year = {2009}
    }
    
    Abstract:  This paper describes a methodology for constructing aligned
    German-Chinese corpora from movie subtitles. The corpora will be used to train a
    special machine translation system with intention to automatically translate the
    subtitles between German and Chinese. Since the common length-based algorithm
    for alignment shows weakness on short spoken sentences, especially on those from
    different language families, this paper studies to use dynamic programming based
    on time-shift information in subtitles, and extends it with statistical lexical
    cues to align the subtitle. In our experiment with around 4,000 Chinese and
    German sentences, the proposed alignment approach yields 83.8% precision.
    Furthermore, it is unrelated to languages, and leads to a general method of
    parallel corpora building between different language families.
    

Technical notes

Munich Data Geeks: Adversarial and Robust Learning(PDF)July 2013
How to Write a Seminar Report.(PDF)Jan. 2013
Take-home Message of EM Algorithm and its Implementation.(PDF)July 2012
Robust Machine Learning in the Adversarial Settings.(Slides)Apr. 2012
A Tutorial of Evading the Binary Linear Classifier.(PDF)Aug. 2011
Improving Object Annotation via Adjectives and Verbs: a Margin-based Approach.(Slides)July 2011
Pair-copula Construction for Non-Gaussian Graphical Models: A case study on Benin 1996 child's undernutrition data set.(PDF) (Code)Feb. 2011
Derivation of Variational Inference for Correspondence-LDA model.(PDF)Sept. 2010
Towards Parallel and Distributed Computing in Large-Scale Data Mining.(PDF)Feb. 2010
Graphical Representation, Generative Model and Gibbs Sampling.(Slides)May 2009
Derivation of Gibbs Sampling for LDA and Topic-over-Time model (Chinese).(PDF)Mar. 2009
Thinking in EM algorithm - from GMM to general view (Chinese).(Slides)Mar. 2009

Codes & tools

  • EOSP: Efficient online sequence prediction.
  • vistrace: a visualization tool for UNIX system calls (gallery).
  • MEX-compatible C code for computing sparse string kernel matrix.
  • A modified version of Neil Lawrence's Gaussian process toolbox (original). My GP implementations depend on the modified code.
  • A toy example of SVM boosting. A modified libsvm that supports instances weights is required.
  • If you use Dropbox to sync papers between a PC and an iPad, here is a python script that links PDF in your Download folder to the Dropbox.
  • If you prefer high-contrast theme, here is an apple script that switches to white-on-black when you use Preview, Finder and Matlab.
  • TSE: a search engine for theorems and statements.
  • Artificial synesthesia:a demo for predicting sounds from images, and vice versa.
  • ECGS: efficient collapsed Gibbs sampling for latent Dirichlet allocation.
  • CorrLDA: a C++ implantation of correspondence latent Dirichlet allocation.

Misc


I confess that I've been blind as a mole. But it's better to learn wisdom late than never know it at all.

– Sherlock Holmes

Date: 2014-03-08 23:47:05 CET

Author: Han Xiao

Validate XHTML 1.0