### abstract ###
Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain)
Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects
Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned
Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync
One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets
This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies
It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm
Our method is able to scale to large data sets through the use of sampling
We validate our method on both light-curve data and other time series data sets
We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results
We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD's reported anomalies are comparable to or better than all other methods
Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena \keywords{Anomaly detection Time Series Data}
### introduction ###
Quasars  CITATION , radio pulsars  CITATION , and cosmic gamma-ray bursts  CITATION  were all discovered by alert scientists who, while examining data for a primary purpose, encountered aberrant phenomena whose further study led to these legendary discoveries
Such discoveries were possible in an era when scientists had a close connection with their data
The advent of massive data sets renders unexpected discoveries through manual inspection improbable if not impossible
Fortunately, automated anomaly detection programs may resurrect this mode of discovery and identify atypical phenomena indicative of novel astronomical objects *}  Our research applies anomaly detection to photometric time series data, called  light-curve  data
Our specific application is to find anomalies in sets of light-curves of periodic variable stars
Most stars, like our own Sun, are of almost constant luminosity, whereas variable stars undergo significant variations
There are over 350,000 cataloged variable stars with more being discovered
The 2003 General Catalogue of Variable Stars  CITATION  lists known and suspected variable stars in our own galaxy, as well as 10,000 in other galaxies
For  periodic  variable stars, the period of the star can be established
Common types of periodic variable stars include Cepheid, Eclipsing Binaries and RR Lyrae, details of which can be found in  CITATION
The study of periodic variable stars is of great importance to astronomy
For example, the study of Cepheids yielded the most valuable method for determining the Hubble constant, and the study of binary stars enabled the discovery of a star's true mass
Finding a new class or subclass of variable stars will be of tremendous value
Figure  shows a typical light-curve from each star class before and after we perform our pre-processing techniques (described in Section )
The y-axis measures the magnitude of brightness of the star
Magnitude is inversely proportional to the brightness of the observation, thus, the y-axis is plotted with descending values
The x-axis measures folded time
A folded light-curve is a light-curve where all periods are mapped onto a single period, which is why there may be multiple points on the y-axis for a single time point
We describe light-curves and the process of folding in more detail in Section  }  Our research is motivated by the challenges inherent to performing anomaly detection on large sets of periodic variable light-curves
Several of these challenges are common to many time series data sets
There are a large number of time-points in each light-curve (high dimensionality), low signal-to-noise ratio, and voluminous amounts of data
Indeed, new surveys, such as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), have the capacity to produce light-curves for billions of stars  CITATION
Any technique developed for light-curves must scale to very large data sets
A unique challenge of working with light-curve data is that the periods of the light-curves are not synchronized because each is generated by a different source (star)
To understand why phasing poses such a challenge for anomaly detection in this domain, consider Figure , which illustrates how two similar light-curves may appear dissimilar under a similarity measure like Euclidean distance if a phase adjustment is not performed
The top panel shows two similar light-curves whose phases are not synchronized
The middle panel shows the square of the correlation plotted as a function of the phase adjustment
The maximum similarity occurs at a phase shift of approximately 0 3
The bottom panel shows the two light-curves after the dotted light-curve is shifted by this amount
We define the optimal phase shift between two light-curves as the shift that yields the maximum similarity value
This phasing problem presents a challenge to both general anomaly detection techniques, and those developed specifically for time series
A general anomaly detection method, even with a metric that works for unphased data, may not work out of the box
With regard to time series anomaly detection techniques, our task of finding anomalies in  SYMBOL  distinct time series differs from most work which assumes a single contiguous time series (not necessarily periodic) in which anomalous sub-regions are sought
PCAD is our solution to the problem of anomaly detection on large sets of unsynchronized periodic time series
The heart of PCAD is a modified k-means clustering algorithm, called Phased K-means (Pk-means), that runs on a sampling of the data
Pk-means differs from k-means in that it re-phases each time series prior to similarity calculation and updates centroids from these rephased curves
Because Pk-means is a modification of k-means, we provide a proof that Pk-means does not break k-means's convergence properties
The Pk-means subroutine runs offline on a sampling of the data
The use of sampling enables PCAD to scale to large data sets
The online portion of PCAD is the calculation of the anomaly score for each time series from the set of centroids produced offline by Pk-means
This operation is linear in the size of the data set
Another advantage of PCAD is its flexibility to discover two types of anomalies: local and global
We define the terms local and global anomaly and provide scoring methods for both
Once each time series is assigned an anomaly score, PCAD ranks the time series accordingly and outputs the top  SYMBOL  for review
To our knowledge, PCAD is the only anomaly detection method developed specifically for unsynchronized time series data that can output both global and local outliers
Our paper presents empirical evidence on four data sets that PCAD effectively finds known anomalies and produces a better ranking of anomalies when compared to naive solutions and other state-of-the-art anomaly detection methods for time series
We discuss the effect of sample size and the parameter  SYMBOL  (used by Pk-means) on the anomaly detection results, and show experimental results on light-curve data with an unknown number of anomalies
Our paper concludes with an astrophysicists's discussion of the significance of the anomalies found by PCAD
