Big data

Article by Adam Thomas August 30, 2016

UD researcher works to make understanding big data easier

Big data handles everything from baseball to brain imaging to biology, but two types of the underlying functions that serve functional data analysis — multidimensional and multivariate functional data — can sometimes pose challenges by virtue of their large dimension and complex structures.

The University of Delaware’s Xiaoke Zhang is working to remedy this situation by conducting research that will substantially narrow the gap between the handling of big data and the statistical methods and computational tools used to understand it.

The research is funded by a $137,981 grant from the National Science Foundation’s Division of Mathematical Sciences.

Zhang, assistant professor in the Department of Applied Economics and Statistics in UD’s College of Agriculture and Natural Resources, is the principal investigator on the project along with Raymond Ka Wai Wong from Iowa State University.

The findings could have major benefits in helping scientists better understand information gleaned through big data in a host of fields, including health care and the environment.

Multidimensional and multivariate functional data

Zhang said that functional data, which plays an important role in the era of big data, is something as simple as measuring a function over time and space — such as the growth curve for a child.

“Right now, we are able to measure data points much faster and we have a better capability of storing that amount of data, so we are able to measure a more complicated form of functional data. That’s where multidimensional and multivariate functional data come in,” said Zhang.

Multidimensional functional data is data that can be observed over time and/or in multiple dimensions — such as the brain volume of a person. For example, researchers could measure the brain signals of several people over 10 minutes using a magnetic resonance imaging (MRI) machine and end up with one three-dimensional image every two seconds.

Multivariate functional data is different than multidimensional data in that when measuring children’s growth over time, in addition to their height, researchers could simultaneously measure their weights, blood pressure and other factors to look at how these different features interact with each other.

“For example, we may use one measurement to predict the other, or sometimes we are only interested in their associations,” said Zhang.

Currently, multidimensional and multivariate functional data are more popular and common but they are extremely difficult to analyze in both their dimensions and their size.

“The structures inside [the data] can be very complicated, so we may not be able to analyze their underlying mechanism easily anymore. That involves a lot of statistical analysis and also some computational algorithms to analyze, so part of the overall objectives in this project is to provide statistical methods and computational tools for handling such data in practice,” said Zhang.

Understanding data

To help professionals across a number of interdisciplinary fields — such as neuroscience, climate change and engineering — better understand and apply multidimensional and multivariate functional data, Zhang said that the researchers will develop three projects within the overall proposal.

The first project is about how to estimate the covariance function, which plays an important role in all statistical areas because it can tell researchers about the variability of data and can also tell what data points are correlated at different points in time or in space.

It can also play a critical role for subsequent analysis in that the covariance function estimation could be a building block for more advanced approaches.

“Most of the time, we do have some properties we know and we’d like to respect these properties, however in functional data analysis most of the methods cannot do that automatically. People may have to have a raw, initial estimate, and then they tailor it afterwards. So that complicates the computational part and also the statistical analysis part,” said Zhang. “Our goal for this project is to design an automatic one-step approach to have a covariance estimator that can respect the properties we prefer and also sometimes we can reduce dimension dramatically using the data. So let the data talk and tell us what the final estimator will look like.”

In the second project, the researchers will look to study multivariate functional data to find a procedure that doesn’t depend on any model in particular in order to alleviate the restrictions of the regression method and also to capture linear and nonlinear dependency.

“Just like we have different measurements from the same subject for height, weight, blood pressure and things like that, most of the time, to study the dependency between the different components, people would like to either use a correlation measure or a regression method. The correlation doesn’t depend on any model, but for the regression model, people need to assume what the model will look like,” said Zhang, who added that there are restrictions with both approaches.

“For example, for the correlation for functional data, it’s not easy to define a good correlation measure and also the correlation measure can only tell us more about linear dependency. However, we know that if the two components are correlated in a nonlinear way, the correlation measure cannot pick them up,” said Zhang.

The third project will study the multivariate functional data but considers a more complicated piece called multivariate functional time series, which has a special form.

“Imagine you observe climate data. Suppose we focus on one weather station and we may be able to observe the temperature, humidity, and other measurements of weather. Right now with the technology we have, we are able to observe them like these -- we can take the measurements for every hour for each day, so we will have 24 measurements in each day. These are the measurements for one day but meanwhile we can observe the weather across days, over a couple of days. We can believe that within each day, we may have some dependency structure but between days, we may also have some dependency structures. That’s what we call a functional time series in the sense that we observe a time series of functions,” said Zhang.

Right now, the functions are multivariate so the goal in the third project is trying to use several measurements to predict the others, what is called a mutual prediction.

“It tells us in the short term which measurements can predict the others, but we’re more interested in the long term and whether we can say something about the prediction,” said Zhang.

For all three of the projects, Zhang said that they would not only provide statistical methodologies and computational methods, but also “we would like to guarantee that when the sample size is large enough, our estimator could be very close to the true underlying function.”

Zhang said he is hoping to collaborate with researchers from UD’s Center for Biomedical and Brain Imaging (CBBI) and possibly the Department of Psychological and Brain Sciences.