Background

The availability of petabytes scale of structured and unstructured data has lent itself for large-scale harnessing thanks to Data Science and AI. Advances in parallel and distributed computing, and the algorithms to process data of all types including text, video, images, speech, and structured data have all made it possible to realise computing at scale. Several interesting applications in the business have already been actualised. Some of these include video/image-based authorisation systems, automated processing of scanned invoices and documents, updating product master data based on images of objects on the shelf, speech sentiment mining for applications in call centres, chatbots to manage e-commerce customers online, and the like.

It may be noted that in as much as exciting the applications are, building these systems calls for utmost care in architecture and implementation in order to avoid failures and resultant customer and brand impact. The availability of hundreds of types of algorithms in the public domain has made it easy for building quick proofs of concept. However, this has also resulted in poorly designed systems when it comes to their usage at scale in a cost-effective manner that yields high business ROI.

Frameworks for Data Science

The CRISP-DM framework that we covered in our other blog (DATA SCIENCE BLOG 1 REFERENCE) is a practical way to manage programs around data science. However, in order to execute and monitor these programs, we have a variety of end-to-end software frameworks that are available for enterprise use. This includes the likes of Amazon Sagemaker, Microsoft CNTK, Google Tensorflow, Rapidminer, KNIME, Datarobot, Data IKU, and so on. These are similar to IDE’s for software development. They provide user role-based access to a variety of activities that can be done by a team in a collaborative manner.

Apart from providing for features such as version control for any ML code that is written by data scientists and data engineers, these tools also provide the ability to convert the code into usable components that can be deployed for production use. The functionality supported include Data discovery from both internal and external data sets, data wrangling, missing data treatment, analytics model building, machine learning model building, model validation, API’s to consume the output of models that are validated, and enabling a large team to collaborate to achieve the goals.

As illustrated in the figure, such frameworks succeed well if they have an in-built library of models. These models may range from clustering, classification algorithms, all the way up to deep learning methods to solve a variety of problems. The model base itself would be rich in a variety of models to solve strategic, tactical, and operational problems in the enterprise. The types of models could also range domain-wise, across financial, marketing, operations, accounting, and engineering. Ideally, for solving large-scale problems, it is important to support complex model building with model building blocks. For instance, this could be a sequential blend of descriptive, predictive, and prescriptive models.

A model base management system would then support the creation of such models. We should keep in mind the need to integrate this at the back end with database interfaces, and to the front-end consumption systems through information sharing interfaces. For text mining applications, a knowledge-based system would also need to be part of the data science framework.

Frameworks on Cloud

Several cloud-based options for managing the end-to-end data science and AI lifecycle are available today. Most of the top vendors including Azure, AWS, and Google Cloud have options for data discovery, data preparation, data wrangling, model building, model validation, and model deployment options all on their proprietary cloud platforms. The top open-source frameworks such as Spark, Tensorflow, Keras, and others are at the heart of such platforms. In order to implement end-use AI applications that are high performing and reliable, it is important to architect the entire life cycle on the cloud, ideally on a single vendor’s cloud as most of the software would be optimised on native systems. Enabling a diverse and distributed team of data engineers, data scientists, business analysts, and IT stakeholders is made possible if one adopts such cloud-based platforms. The time taken to build and deploy such solutions has shrunk considerably over the past few years with the advent of pre-built, containerised, micro services based Data, analytics, and AI frameworks offered by the cloud providers. The figure below illustrates a generic framework for cloud services to support AI.

Conclusion

Large-scale adoption of data science and AI/ML today is possible thanks to the availability of computing and modelling frameworks and software associated with these. Although calling for investments to develop these capabilities in-house, it is seen that these frameworks are also available in an as-a-service approach.