Result

AI / Machine Learning

Professional transfer of machine learning models to the productive phase

"It took me three weeks to develop the machine learning model. Now, a year has passed and it still hasn’t been put to use in production." This complaint by an AI developer describes the dilemma that many companies find themselves in when they conduct ML projects in order to be able to exploit the benefits of artificial intelligence and machine learning (AI/ ML) on a large scale.

Some CBA Lab members also believe that certain challenges remain when it comes to transferring to production operations ML solutions developed as prototypes.

Companies still have a long way to go in terms of implementing an advanced and well established DevOps approach, the use of which is now part of the normal development and production process for non-AI/ML systems. DevOps describes the interconnections between the development process and IT operations.

The AI/ML workstream at CBA Lab therefore focused precisely on this transition between development and production.

The goal here, according to Workstream Coordinator Dr. Jürgen Klein, who is also Chief Architect at Carl Zeiss AG, was to develop an approach that meets all requirements relating to quality and the degree of automation in ML projects.

For this reason, the workstream has also extensively examined the so-called MLOps approach. MLOps is an approach that utilizes both machine learning and the benefits offered by the development and operations model (DevOps).

The workstream addresses some of the challenges that accompany the use of ML:

Operating ML systems is more complex and expensive than is the case with traditional software because ML models require more extensive training, deployment, and monitoring, as well as periodic retraining. Versioning is also demanding because ML models, training programs, and validation and test data have to be versioned as well in order to ensure clarity across the board.
Data protection issues are often more complex and unclear with ML. For example, there is the question as to whether images of people may be used for ML model training courses. In addition, when decisions are to be made by an ML system (e. g. loan decisions), there are no clear rules regarding how detailed the decision needs to be presented to the those affected by it in order to ensure they completely understand it.
Experienced data science and ML engineering specialists are a rare resource on the labor market. Such specialists are needed, however, in order to put together competent development teams and ensure sufficient expertise in software engineering and operations. The teams that are being used at the moment also lack expertise diversity, and this can lead to a situation in which pilot projects function well on a small scale but cannot be scaled up in terms of technical functions and organization (cross-functional teams could offer a solution here).
Costs are often underestimated because those who calculate them fail to take into account that greater expense is involved here than is the case with traditional software projects (e. g. higher costs for personnel, data preparation, special hardware, training courses for models, waiting times on the business end, transfer to production).
There is a lack of clarity with regard to decisions.
Unknown dependencies exist regarding the data that is used for model training.
There is often a lack of data, or else unreliable data is used.

Some of these challenges can be ad- dressed by structuring ML applications using an expanded DevOps approach – the so-called MLOps approach, which expands the familiar DevOps principles in order to specifically support the development and operation of ML-based solution components in an optimal manner. The addition of new data sets, as well as the gradual degradation of model performance, requires a system of continuous training (CT) in order to keep everything stable or make improvements.

Because an ML model is usually only a small but nevertheless critical component of a software system, its interaction with other components must be constantly monitored and analyzed. This also means that new models need to be checked using special test procedures such as data and model validations.

At the same time, the MLOps principle only works if the organization in question also has the required capabilities and expertise as defined by the workstream in a capability framework. The principle consists of the following elements:

People and expertise

People and the required expertise are the fundamental requirement for successful MLOps. It’s not just data scientists or ML engineers who are needed, as a large number of different skills and capabilities are required. The individuals who possess such skills and capabilities need to be recruited, trained and retained by the organization.

Culture

The organization also needs to prepare its culture for new technologies. The individual participants in an MLOps initiative, as well as others throughout the organization, must all be willing to accept ML-supported processes and assist with their further development. Support from top management is also crucial here, as it is not possible for the organization to dedicate itself to MLOps until top management has indicated that it will fully support such an approach.

Processes

Changes that occur when ML is introduced always influence the processes in an organization. More specifically, processes undergo changes as a result of the systematic incorporation of data streams.

Data

Data is the fuel for an ML organiza- tion. There can be no ML without correct high-quality data. Companies often have problems with the quality of historical data, which is why basic capabilities such as data preparation, processing, and quality assurance need to be improved in order to get companies ready for ML.

Technology and infrastructure

ML is based on a complex technology stack and requires high-performance infrastructure that needs to operate in a very dynamic environment. Continuous technological innovation and maintenance are also a basic requirement for ML, which means the financial and human resources needed for this must be made available.

Risk, compliance, and ethics

The use of systems that might potentially make autonomous decisions harbors certain risks. For example, unbalanced data can lead to biased results and unethical decisions, which in the worst case could endanger people and threaten the entire organization. This presents new challenges in terms of managing risks and ensuring compliance.

The workstream has summarized its results and findings in a detailed white paper. In addition, the workstream made it possible for all participants to see which ML tools are used for which purposes at CBA Lab companies that utilize AI.