Model or data?

Translator and Editor:

Advisor:  

At present, machine learning is a field that attracts a lot of attention in the scientific community and industry. There is fierce competition between research groups and development groups in increasing the accuracy of machine learning models and deep learning models by focusing on algorithm tuning and code optimization. This promotes the rapid progress of machine learning. However, spending too much time on increasing the power of machine learning models brings certain concerns in the overall development strategy. Because unlike traditional software that depends entirely on the power of the code it creates, artificial intelligence systems are built on a core of models and data. If we only focus on changing the model, it can cause a waste of resources and resources because it does not cover the factors that affect the accuracy of the model. “As a rule, when an artificial intelligence system does not work well, the general trend is to improve the code and algorithms. But for many practical applications, improving data quality will yield better results,” said Andrew Ng, a leading scientist in the field of machine learning, citing an example that shows the effectiveness of two approaches: model-centric and data-centric.

Comparison table of effectiveness between two methods. Source: Deeplearning.AI

Typically, 80% of a machine learner’s job is to clean data because “Garbage In, Garbage Out” (GIGO). Andrew Ng wonders if 80% of our job is data preparation, why don’t we care about ensuring data quality – a top priority for machine learning. Typical of this neglect is that most people often take a quick look at arxiv to get an idea of ​​where machine learning research is going, and spend a lot of effort on model tuning in the hope of knocking down the accuracy benchmarks of popular models like Google’s BERT, OpenAI’s GPT-3. However, these epic models only account for 20% of a business problem. What separates a good implementation from a bad one is the quality of the data.

Source: by Paleyes and partners.

The above shows that a model-centric strategy does not help us improve artificial intelligence systems effectively. Moreover, we can easily use pre-trained models, provided with source code, or through licensed APIs. On the other hand, the benefits of focusing more on data processing are undeniable, but this approach is not easy because the process of collecting data and processing it to produce a database of sufficient quality to serve as a training model is fraught with challenges. According to a study conducted by Cambridge scientists, the most important but often overlooked issue is data dispersion. The problem arises when data is streamed from different sources, which may have different schemas, conventions, and ways of storing and accessing data. It is a tedious process for machine learning engineers to combine information into a single dataset suitable for machine learning, so most engineers are not enthusiastic and enthusiastic about building this process. In addition, there is another challenge caused by the size of the dataset. While small datasets often have trouble with noisy data, larger datasets can be difficult to label. Another important part of the data collection process is labeling each sample. This can also be difficult when collecting data in areas that require specialized knowledge because access to typical experts such as doctors can be limited due to lack of funding. In addition, according to data scientists, lack of access to high-variance data is one of the main challenges when deploying machine learning solutions from the lab environment to the real world.

Nguồn: Deeplearning.AI

In practice, there are many different scenarios that can facilitate or hinder the deployment of AI systems. For example, a network company has a software that collects data from many users, creating a large dataset for training. This is a favorable initial condition for developing a machine learning model. However, in another environment such as agriculture or healthcare, where there are not enough data samples, we cannot expect to have a million tractors or a million patients to help increase the amount of data collected! Therefore, Andrew Ng directs the community’s attention to MLOps – a field that focuses on building and deploying machine learning models according to a standardized process. Some basic rules that Andrew Ng proposed to help deploy machine learning effectively:

The most important task of MLOps is to provide high-quality data.

Consistency in data sample labels is key. For example, check how labelers use bounding boxes. There may be multiple ways to label, and even if they are good in their own right, inconsistencies can ruin the results.

Systematically improving data quality on a baseline model is better than running a state-of-the-art model on low-quality data.

In case of errors during training, take a data-centric approach.

By focusing on the data, problems with smaller datasets (less than 10,000 samples) can be significantly improved.

When working with smaller datasets, tools and services to improve data quality are crucial.

“If 80% of our work is data preparation, then data quality assurance is the most important part of the ML development team,” says Andrew. Good data should be consistent, include all the special cases, have timely feedback from data production, and be appropriately sized. He advises against relying solely on engineers to figure out the best way to improve datasets. Instead, he hopes the machine learning community will develop MLOps tools that help create high-quality, repeatable, and systematic datasets and AI systems. He also says that MLOps is a new field; in the future, the most important goal of MLOps development teams should be to ensure a consistent, high-quality data flow across all stages of the project.

Source:

Leave a Reply

Your email address will not be published. Required fields are marked *