Whenever a scientific experiment is conducted, the results are turned into numbers, often producing huge datasets. In order to reduce the size of the data, computer programmers use algorithms that can find and extract the principal features that represent the most salient statistical properties.
Reza Oftadeh, doctoral student in the Department of Computer Science and Engineering at Texas A&M University, advised by Dylan Shell, faculty in the department, developed an algorithm applicable to large datasets that can directly order features from most salient to least.
“There are many ad hoc ways to extract these features using machine learning algorithms, but we now have a fully rigorous theoretical proof that our model can find and extract these prominent features from the data simultaneously, doing so in one pass of the algorithm,” Oftadeh said.
“For example, if you have hundreds of thousands of dimensions and want to find only 1,000 of the most prominent and order those 1,000, it is theoretically possible to do but not feasible in practice because the model would have to be run repeatedly on the dataset 1,000 times,” Oftadeh said.
The next step of their work is to generalize their method in a way that provides a unified framework to produce other machine learning methods that can find the underlying structure of a dataset and/or extract its features by setting a small number of specifications.