Over the past ten years, the aggregate threat of the newly arrived digital-native market disruptors has proven to be serious enough to push some of the big incumbents off the cliff and cause severe revenue losses to many others.
The large players have been facing a “do or die” mandate which grows exponentially now: To replicate what these little foes have as the secret sauce to their success: Agility and AI-enhanced Digital Technology.
Artificial Intelligence (AI) and its prodigy child, Machine Learning (ML) have become a deciding factor for competitiveness and relevancy is many market sectors on a massively expanding scale. With the acceleration of governments’ spending on ML projects – to establish and promote smart public services – we now witness ML projects defined and implemented in the public sector at an accelerated rate.
While statistics show that nearly 70% of organizations are already using Machine Learning at some level and 97% are going to use ML within the next 3 years, roughly more than 85% of Machine Learning projects fail to complete or deliver value in the end, which translates into tens of $billions of dollars in financial losses and millions of wasted expert hours each year and this trend is expected to continue through 2022!
There are many predictable – and avoidable – mistakes that would lead to the failure of ML projects. They are easy mistakes with expensive price tags, which would not only cause financial and moral damage to the teams but also hamper their efforts in maintaining their market positioning in an extremely competitive landscape that is heavily bombarded by the Digital-native disruptors.
Based on my experience and observation of ML projects over the past few years, I would like to share the following tips on succeeding in ML projects at enterprise levels:
Data is the bloodline and should be treated as such
There is a famous quote, attributed to Napoleon that says, “An army marches on its stomach”. I would like to bring that quote into the 21st century by saying “A Machine Learning project marches on its Data pipeline!”
Your ML project – and later your ML model servicing in production – should have access to available, abundant, relevant, and good quality data and you need to have a properly structured Data Strategy to keep it flowing.
As a body can only be as healthy as the blood that is pumped through its veins, a Machine Learning project, or product in service, is reflective of the quality of the data it is trained with.
Data can have an array of issues when it arrives at your doorstep. If the data quality is poor, it will take away precious time and budget from your ML teams to bring it up to the needed quality through cleaning up, enriching and formatting.
Data problems can also include a variety of biases due to a number of collection and sampling issues, which would not only cause inaccurate outcomes but also be illegal to use.
To maximize the value that is received from Data, it should be democratized and provisioned across the organization, to all groups and teams, based on the needed regulatory and compliance requirements of the enterprise and local and international governing bodies.
When it comes to establishing the Data pipelines, the organization must focus on multiple horizons: the data that is readily available now (immediate), what can be generated or acquired within a short and reasonable timeframe (short term), what data will be needed within the next few months (long-term).
I would not recommend upfront planning for longer windows, as the changes in the market would cause too much drift in your entire data setup over time and would render today’s efforts less relevant and useful at that point.
Use Machine Learning to address real market problems
As attractive and techie it would sound to have a Machine Learning pipeline (even establishing your own corporate MLOps), the cost and the ongoing effort to keep your ML Models to market service level and maintaining them, would mount up fast, and unless you are solving real problems for your customers (or improving their experience), it will soon run out of economic justification.
To properly establish the reason to have your ML Models in the production environment, and get your ROI reasonable, you need your business experts to work closely with your Data Scientists to define the business problems that are going to be addressed. Remember that not all problems need Machine Learning answers.
As per the tip above, Data will be needed to feed and drive your Machine Learning solution, so the availability of data, its quality, interval, and stability is a great deciding factor when choosing a Machine Learning approach.
Once we have established the ROI and have the Data supply figured out, to establish a functional and efficient ML team, we need to look beyond just pairing the business experts and data scientists.
Form up your Machine Learning ARTs®
The Scaled Agile Framework (SAFe®) which is the most adopted Agile Scaling framework among the Fortune 500 enterprises (over 70% adoption rate), uses a highly effective approach when forming the large teams which serve the Development Value Streams in an organization.
A Development Value Stream is the group and sequence of people (from all teams and segments), processes, and tools, that work together, to create the products and services that will be sold to customers, or tools that will enable the sales team to sell them.
SAFe® calls these large teams, Agile Release Trains (ARTs) and positions them at the heart of every program in an enterprise. ARTs are designed to group together the experts – which may come from several silos – into a solution creating and delivering machine, which can almost independently design, develop, test, deploy and even release their work to the production.
As I mentioned at the beginning of this article, Enterprise Agility is a vital factor for organizations in their efforts to survive and thrive in the ever-changing market landscape.
Organizations cannot stay competitive unless they can scale their Agility all the way up to enterprise portfolio level, which is what SAFe specializes in, allowing the aggregate power of all programs and their ARTs to combine upward toward the manifestation of strategic plans, and to allow Strategic Plans to cascade down to all programs and to their ARTs for agile market response and re-positioning.
Onboard the leadership to champion the change
Machine Learning brings, not only a major shift in the way organization thinks about technology and pictures the final product, but also a strong demand for a cultural paradigm shift that goes across the entire fabric of the enterprise, as it follows the people (business and technology), processes and tools used by all that are involved in its creation, service provisioning, and maintenance.
Leadership needs to learn about the “What and Why?” of Machine Learning, to the level that would enable it to make educated strategic decisions about it, and then commit to championing the “How?” through all the involved groups and silos. They cannot be expected to answer all the questions, but to lead the charge on the orchestration required to find them.
Machine Learning requires an investment of time and money and the only way to weather the painstaking initial period in finding what resonates best with the market and how to provide it to customers, is through a culture of continuous exploration, learning, and training.
Invest in your people before investing in Machine Learning
There is an ongoing shortage of skilled ML experts in the market. In fact, it’s quite hard and expensive to find and hire people skilled in all areas of data science, especially the MLOps pipeline. A smart way to approach this is to invest internally and train and specialize the existing team into new skills they require to serve as part of the Machine Learning ART.
Since ART is made of people from all engaged groups in the ART, you need to train everyone – based on their future role in the ML pipeline – from business (aka subject matter/domain) experts to technology.
Some big players, like Facebook, Google, Amazon, and Alibaba, have their own internal training academies and programs, which excel in training existing staff and new hires.
When training the organization, it is recommended to start with the leadership, and then to cascade down the management org chart. This way, you will have ML-Savvy executives running the cultural and technical shift, with the required authority and budgetary levels to facilitate the work.
Migrate to Cloud or start from there
All the major cloud service providers have extensive, time-tested tools for every segment of the Machine Learning pipeline, from Data extraction all the way to ML service provisioning and Monitoring.
Their high scalability, elasticity, and availability make them the best choice in maximizing the reach of your ML budgets by adjusting the resourcing needed for running your MLOps in a dynamic and transparent way.
The Agility gained through this service structure, combined with the available cloud-based tools, will reduce the needed effort and time your experts should invest in running your Machine Learning projects and serving your ML products. This saved time, energy and money can allow your organization to invest more in exploratory and market testing, which in turn will nurture your teams with live and timely feedback from the market, to be used in fine-tuning your solution approach, and re-training your ML Model as the parameters and data will change with the market over time.
The low cost of forming up temporary pipelines for experimentation, will boost your teams’ creativity and avoids unnecessary capital spending on infrastructure.
Your teams can always lock in the needed core capacity to execute the production ML models and their data pipelines, to benefit from significant discount pricing on reserved resources while staying safe with the active, live and elastic capacity that will automatically rises up to meet any spikes in demand that your service may encounter.
The global reach of the cloud services also enables your organization to reduce service latency to unprecedently low numbers by providing replications of your services in the nearest possible locations to your customers.
MLOps and DevSecOps should collaborate and co-exist
DevSecOps is the infinite dual loop that has integrated and pipelined the software product lifecycle management with embedded security considerations and implementations in all its steps.
DevSecOps uses Continuous Integration (CI) and Continuous Delivery (CD) which allow for your pipeline to maintain a flow of developed and tested code to join the existing code and push it to the production, thus raising your organizational agility in responding to changes in the market trends and customer interests.
Recent enhancements to DevSecOps tools have brought in the power of ML in many aspects of its planning, development, testing and deployment stages. Both DevSecOps and MLOps can benefit from Automation in their pipeline.
MLOps is a set of techniques and processes responsible for managing your ML Model’s lifecycle management including the needed data provisioning.
DevSecOps principles apply to MLOps as far as the software creation is involved, but there are some key differences between the two that would not allow MLOps to simply merge into DevSecOps:
- Continuous Experimentation: ML Models require many iterations of execution and fine-tuning before the accuracy and performance of a model are acceptable for the production environment. This is different from how DevSecOps works where much less retesting and parameter tweaking is required. As a result, it is not easy to formulate how long a new ML Model would need to arrive at the service performance level required for the production environment.
DevSecOps is designed for agility in market response by the rapid creation of software solutions (or rapid incremental updates to existing ones), while we cannot force the experimentation in ML Models to go faster beyond what computational power and automation can provide. A big portion of the time required for experimentation stays with the human (manual) work on the models.
- Continuous Monitoring: As ML Models are trained with a set of data that was prepared in a certain timeframe in the past, their accuracy and performance are reliant on the data pattern and value ranges to stay consistent over time. Though for some applications, the data may hold steady for long periods of time, in most cases, due to the ongoing market changes and customers’ shifting demands, the data changes impact the accuracy of the ML Model and cause performance degradation. To catch and fix that issue, ML Models need to be monitored continuously and triggers to be set to execute retraining of the Model using the fresh batch of data off the market.
DevSecOps has production performance monitoring in place but this is just to make sure the SLAs that have been set are still valid and the services maintain the needed response time. It is not usually related to concerns about accuracy degradation.
- Continuous Training: The continuous cycle of ML Model monitoring flows into the loop of re-training the ML Model whenever new data shows enough drift that would justify the processing cost of re-executing the training process for the ML Model. DevSecOps does not have such a cycle as part of its standard pipeline.
- Team Structure: MLOps would require a different set of specializations than what you have in your DevSecOps teams. We need Data Scientists, Data Engineers, Model Risk specialists, Machine Learning Architects, and Engineers. The common expertise between the two Ops would be the need for Software Developers and Release Engineers.
- Testing Approach: DevSecOps would use the Software Testing disciplines (including Unit Testing, Functional Testing, Integration / Regression Testing …) while ML Models (in addition to all of those) would need Data Validation, Model Evaluation and Validation in the pipeline.
Value Streams (as we mentioned earlier) are the collaboration environments where both DevSecOps and MLOps create the aggregate value delivery through supporting each other and amplifying the power of Machine Learning with agile support and prototyping of the supporting systems required to keep it running and performing as expected.
DevSecOps and MLOps both adhere to CI and CD disciplines and that is where they meet with their most collaborative capacity. While DevSecOps would run their code through CI/CD as many as multiple times in minutes (as is the case with Tech Unicorns like Amazon and Facebook), the MLOps would send the ML Models for integration into the staging, preproduction, and eventually production pipeline, every once in a while, when a new model is validated and it would be the time to put it into training using the data and prod-like environments for final tuning and validation. Also, once the ML Model is in production, the orchestration of its supporting software would need to operate smoothly and as per the needed SLAs that are defined.
MLOps need special Metrics at Model and Pipeline levels
The ML Code is only a small (less than 5%) of the entire ML pipeline. The rest is composed of Configuration, Data Collection, Feature Extraction, Data Verification, Analysis, Process Management, Machine Resource Management, Serving Infrastructure and Monitoring.
This means that when selecting Metrics and trying to establish KPIs, we need to factor in all these functions as part of our measurement and tracking structure.
ML Models also require their own metrics used in the continuous monitoring of their accuracy and performance required in the production environment. This is in addition to the evaluation metrics that are used during their experimentation stage and later in the pre-prod and production validation levels.
Fortunately, there are several time-tested, existing solutions in the market that we can choose from, customize, enhance, and properly fit into our pipeline. Many of these tools are already provided by cloud platforms and serve a multi-layered coverage that goes from the cloud service layer all the way and into each of your ML Model’s live performances.
DevSecOps would share in some of the metrics that relate to pipeline performance and SLAs in production. In both cases, we need to ensure Metrics are tied into Action Items that will be triggered once a certain threshold is met so that the required staff is alerted of a rising problem and relevant automated mitigatory actions are launched as needed.
Machine Learning is now a great competitive advantage to many market competitors and their rising contenders, and we can expect the market size to continue to grow at a 44% CAGR rate through 2025, to surpass $100 billion.
As per McKinsey, Artificial Intelligence can add around $13 trillion to the world economic output in 2030, and we can expect Machine Learning to have Lion’s share of that large number.
Today, the organizations’ need for Machine Learning, resembles their need for Software and later Web presence, during the 1990s and 2000s. We are at the onset of an era where without Machine Learning models, no business can continue to stay in the market and customers would not buy from providers who are not providing the benefit of ML enhanced services to them. This is while most Machine Learning projects are failing due to a lack of proper understanding of the pitfalls and best practices.
I hope that following the shared tips and recommendations will help in the successful implementation of your ML Models and serving them to your customer.
Article written by Arman Kamran, CTO of Prima Recon and Enterprise Transition Expert in Scaled Agile Digital Transformation