Big data: potential and pitfalls
Collecting data is easier than ever, but ensuring we unlock its secrets is more complicated. By Anna Mouton.
Humans excel at identifying patterns — we extract general principles from our experience and use those to deal with new situations. “Humans are incredible in some contexts,” says Prof. Anton Basson. “But we can’t handle very large amounts of data where the patterns are not obvious.”
Basson heads the Mechatronic, Automation and Design Research Group at the Department of Mechanical and Mechatronic Engineering at Stellenbosch University. His research includes helping companies organise their data and apply machine learning to make sense of it.
“Machine learning is pattern recognition,” he explains. “The machine says, if I have that pattern, then that is the result. And the machine can recognise very complicated patterns that humans cannot, but it requires a lot of patterns to learn from.”
Machine learning pros and cons
“Machine-learning models are very good at prediction,” says agricultural economist Prof. Jan Greyling. He leads the AgroInformatics Initiative within the Faculty of AgriSciences at Stellenbosch University.
“ChatGPT is just a statistical model that predicts the next word based on a training set,” he says. “The problem is that we don’t completely understand how it gets from the input data to the output.”
This is one reason why modern data science is a team approach. “There’s a particular role for the machine-learning specialist, but you also need the domain expert,” says Basson.
Domain experts include technical advisers such as horticulturists or soil scientists who can contextualise the data and results. Although humans may not have the raw computational power of machines, knowledgeable humans are still better at discerning when something doesn’t make sense.
Machine learning can be applied in a top-down or bottom-up approach. The first is closer to the traditional scientific method, where data collection follows the formulation of specific questions or hypotheses. The second — data mining — seeks useful information in data collected for another reason.
Dr Albert Strever of the Department of Viticulture and Oenology and the South African Grape and Wine Research Institute at Stellenbosch University uses machine learning and natural language processing in grapevine remote sensing-related research and scoping and scanning applications to unearth promising technologies.
He finds that machine learning can sometimes be a faster and cheaper alternative to traditional experiments. “I think there’s space for both,” says Strever.
“You have to distinguish the place for traditional science with experimentation and the place for scoping or scanning — throwing data in a bin and seeing patterns or discovering themes.”
The AgroInformatics Initiative
Greyling defines a data scientist as anyone who works with large data sets but doesn’t put himself in that category. “I’m much more of a data wrangler — I clean and integrate data,” he says. “In my experience, that is 95% of the work. Just 5% goes into fitting algorithms and so forth.”
He would like every research group within AgriSciences to do data-intensive research. As part of his AgroInformatics Initiative role, he advises research groups and organises research days to build awareness of data science.
“There’s surprisingly little interaction between departments and individuals within the Faculty,” he notes. “Some people are working on similar problems, and they don’t know about each other. So the first objective of the Initiative is to create a community of practice.”
The second objective is to skill students in data science. Greyling has observed that AgriSciences students mostly have sufficient mathematical and statistical knowledge but need more coding and implementation capabilities.
Throughout the year, the AgroInformatics Initiative holds training workshops aimed at postgraduate students from all the AgriSciences departments. However, data science skills are not only valuable for academic research. According to Strever, big growers also struggle to cope with the masses of data their operations generate.
“We need to create capacity but to do that, we need funding,” he says. “The AgroInformatics Initiative must grow, otherwise we can’t serve all the departments within the Faculty and all the different agricultural industries.”
Meanwhile, the rate of data collection is outstripping our ability to manage it.
“People tend to get very excited about physical things like drones and variable-rate spreaders. All those things collect data,” says Greyling. “The future will offer great opportunities when we integrate that data and make it functionally interoperable.”
He sees data as an asset and considers establishing a data repository and sharing platform perhaps the most important goal of the AgroInformatics Initiative.
How to keep data FAIR
Scientific data is best managed according to FAIR principles — data must be findable, accessible, interoperable, and reusable. Much of this depends on attaching metadata to the dataset.
“The metadata tells you who captured this data, where, using what process and technology, and how it was manipulated,” says Greyling. “Because if you don’t capture time and space, you can’t overlay different things.”
Trying to find a dataset without metadata would be like searching through a library where the books are stored randomly, and none have covers or title pages. And once you find the dataset you seek, it has to be interoperable — the book has to be in a language you understand.
Lack of interoperability is one reason it can be so hard to migrate data between apps. “There’s commercial interests in vendor lock-in,” says Basson. “But in some cases, the problem is more fundamental.”
Obstacles range from different measurement units or time intervals to different data structures. The data structure is often determined by an initial problem statement or research question, and answering other questions can add computation and cost.
“Another major issue is data longevity,” says Basson. “What digital information from ten years ago can you still access? It’s a massive disruptor when a business has invested its information and operational control in a system that’s reached the end of its life.”
He highlights data longevity in the context of digital twins. “If your digital system must mirror a physical system that’s going to last for 20 years, the physical system and the technology of your digital system are going to change.”
Banking data
“It’s increasingly becoming compulsory for researchers to make research data publicly available or have it in some form of repository,” says Greyling. “People often dump a lot of data somewhere, but if you don’t look after data carefully, then, in the end, it’s just numbers. It’s not data.”
Strever reckons that collecting data without thinking about its retrieval is irresponsible. “Stellenbosch University now has a data management policy, so we must consider how we work with and share data. I think it’s a good thing but difficult for scientists — it’s not how we’ve been brought up,” he jokes.
Funders are also interested in how project data will be managed. Privacy and storage — including storage costs — must be part of agreements.
“What I’ve seen with my own research is that you collect all kinds of data and analyse it, write your report or thesis, and that’s the end of it,” says Greyling. “So a lot of data gets lost and is never used again because we don’t invest time and money into looking after it.”
A shortage of high-quality data currently hampers the application of machine learning to solving agricultural problems. A FAIR system for managing not only the data generated by researchers but also by growers, service providers, and industry bodies could help the deciduous-fruit industry reap similar rewards to those already experienced in many industrial sectors.
“I think massive opportunities open up when you use machine learning to make sense of lots of data,” concludes Basson. “The data system has to contend with many practical constraints. One mustn’t underestimate the effort and cost, but the potential is significant.”