The three environments for AI Professionals — Research, Development, and Production
The individuals who want to pursue skills required for data roles like Applied Engineer, Data Analyst, Data Engineer, Data Scientist, Data Solutions Architect, Machine Learning Engineer, Research Scientist, etc., are confused because the fields are relatively new and there is a lot of overlap between these roles. Moreover, the definitions of roles and skills required are different for different organizations because organizations have a different understanding of each role based on their requirements, organization culture, and allocated budget.
This article introduces three different environments for AI professionals and focuses on different tasks and skills required for each environment. In general, a good AI professional needs to be aware of the basics of all three environments and be an expert in some tasks in at least one environment. Based on their interest and expertise in tasks of environments, they can further pursue skills in depth and expand their skillset. Awareness about these environments can help individuals avoid confusion while making career decisions in the dynamic data world.
AI professionals first collaborate with stakeholders and domain experts to define the business problem. They might also present and validate assumptions related to the problem. Once the problem is defined and assumptions are validated, AI professionals can start working in the research environment.
A research environment is where AI professionals define experiments and might use tools like Jupyter Notebook and Jupyter Lab to collect data, clean it, perform Exploratory Data Analysis (EDA), and present findings to the team. Each experiment’s findings might help select and generate new features from data and build models that can potentially solve the problem. Later, metrics are defined to evaluate and select models across different experiments. Sometimes, an ensemble of models from different experiments might result in higher performance. The code might be very messy in this environment or phase. It might also be hard for others to reproduce your code successfully on their machines.
The different steps in the research environment include
- Define Experiments: Different experiments can be defined based on the definition of the problem. For example, suppose we have a classification problem. In that case, experiments might be defined based on different approaches like conventional supervised learning, weak supervision active learning, semi-supervised learning, pre-training, etc. Experiments are prioritized based on the type, project timeline, and quantity and quality of data.
- Data Collection: For a given experiment, the AI professional might collect structured or unstructured data from existing proprietary databases, use open-source datasets, or extract data using python scripts like crawling text or images from relevant websites.
- Data Cleaning: The steps in cleaning depend on the data, problem, and experiment. For example, AI professionals can impute missing values, normalize extreme values, remove duplicate samples, etc., to classify structured data.
- Exploratory Data Analysis (EDA): The goal of EDA is to find patterns in cleaned data which helps in selecting relevant features for modeling and understanding relationships among them. EDA can also help identify how to further clean the data for modeling.
- Feature Engineering: The knowledge from domain experts and EDA patterns help AI professionals create new features that might increase the performance of models in the experiment. In short, generating relevant new features from existing features is called feature engineering.
- Data Modeling: The goal of a model is to try to replicate domain experts’ decision-making process. AI professionals come up with mathematical algorithms and build models using relevant features to automate the decision-making process.
- Tuning and Evaluation: Optimal hyperparameters can be found to maximize the model performance by comparing the metrics of each version of the model in an experiment on evaluation data.
- Experiments Tracking and Evaluation: Steps 2 to 7 are repeated for each experiment and evaluated at the end. Experiment tracking tools like Neptune AI and Weights and Biases can efficiently track experiments information with a good User Interface.
The model in an experiment with the highest performance is selected for working further in the development environment.
A development environment is where AI professionals create components by cleaning and modularising code from Jupyter notebooks, adding dependencies (PyTorch, Numpy, and Pandas, etc.), and packaging them. A component is an organized, modular, maintainable, and reusable code that performs one step like data extraction in the AI/ML pipeline.
In applied Machine Learning, the AI/ML Pipeline automates performing a sequence of steps in components and interaction between the components defined by the AI/ML system design. The components include data collection, data preprocessing, model development and fine-tuning, post-processing on predictions, model evaluation, model deployment, maintenance, and monitoring.
Some best practices while packaging the code:
- Create a GIT repository to define the code repository structure and branching strategy.
- Install IDE like PyCharm to automatically create virtual environments for projects and allow easy integration with GIT.
- Convert Jupyter Notebook code into object-oriented code and save in .py files. Have appropriate variable names, add comments, and organize different files into components with proper hierarchy.
- Create config files containing standard information across multiple components like input file location, model location, output file location, cloud or external API credentials, model parameter values, hyperparameters values, etc. Config files make adding new variables easy for all components across the pipeline and modifying and removing existing variables.
- Write and automate tests for multiple components. Write modules to test each component individually (unit testing) and test the interaction between components (integrating testing).
- Use a logger to log the message and time. Logging makes debugging easy, especially when the code base becomes huge and complex. A logging message can have a logging level like critical, error, warning, info, debug, or notset. Critical is an essential message to log, and notset is an unimportant message to log. Levels ensure the minimum level to log. For example, if you set “level = logging. warning”, any message logged as critical, error, or warning is only logged, and other levels are ignored.
- Version control plays a crucial role in the development environment. Unlike traditional software engineering, where only changes in code are tracked (code versioning), data used for training, testing, and evaluation can also be tracked (data versioning), especially if data is large and dynamic. DVC, Delta Lake, and LakeFS are some open-source data versioning tools.
- Often based on the requirement, a server is built using web frameworks like FastAPI, Flask, or Django to deliver predictions to other software components.
The packaged code is further used in the production environment.
Based on the size and timeline of the project, development and production environments are the same or different. Generally, the production environment is a phase where the models in the pipeline are scalable, monitored, and served in real-time by containers.
Some of the tasks performed in the production environment are:
- Design Optimization: In general, there is a lot of gap between the number of models and the quality of models in the research environment, development environment, and production environment. Hence, if required, the AI/ML system design created before in the development environment needs to be optimized and redesigned for production.
- Containerization using Docker: Developers might use multiple components like Data Extractor, Elastic Search, Rest API, Messaging Queues, etc. Each component has its respective dependency libraries. Having components with different versions of a library in the same environment might lead to conflict. With the help of Docker, AI Professionals can standardize environments and run different containers for different components in isolation, where each container has dependent libraries for the respective component. An environment can be created by Docker using a DockerFile. DockerFile contains instructions like navigating to a respective folder, installing dependencies, setting environment variables, loading configuration parameters for the model, etc. Scaling is easy with containers because AI professionals can spin up new containers for the same component in seconds to satisfy the scaling requirements.
- Workflow Orchestration and Infrastructure Abstraction: Workflow Orchestration tools like Google’s Kubernetes and Red Hat’s Openshift can quickly spin up multiple containers on different machines on demand, manage resources like memory and compute for containers, have high container availability for the product. Depending on the organization, the infrastructure of the workflow orchestration tool is owned by separate teams like DevOps or the AI professionals themselves. Some AI professionals might find it tedious to work with infrastructure abstraction tools. They can use infrastructure abstraction tools like Google’s Kubeflow and Netflix’s Metaflow, which are built on top of workflow orchestration tools that allow them to focus more on models and stop worrying about low-level infrastructure.
- Continuous Integration and Continuous Delivery (CI/CD): CI/CD enables AI professionals to work together in a shared code repository where updates to a part of code by an individual are automatically pushed, built, tested, delivered, and deployed to the shared code repository, and code issues can be tracked and resolved respectively.
- Monitoring and Maintaining the Deployed Models: Unlike traditional software, AI/ML models are dynamic and degrade over time. Hence, it is essential to measure, monitor, and govern the different metrics and tune models before they negatively impact user experience and business value.
In general, the model’s health can be measured by three different metrics.
a. Resource Metrics: This measures incoming traffic, CPU/GPU memory usage or utilization (Does server efficiently utilize resources?), prediction latency (Does server handle requests quickly?), throughput (Does server maintains good throughput and scales based on requests?), and cost (Are hosting and inference costs of entire ML pipeline are as expected or increased?).
b. Data Metrics: It is essential to check if the input data format is correct first instead of debugging the entire pipeline.
- Anomaly Checks: Simple checks like having max and minimum values for each feature (age cannot be negative or 100000) can identify and validate extreme or anomalous data points in input data. Later the team can brainstorm and find root causes for receiving these anomalies from users.
- Data Quality Issues: Users might give synonyms (“Girl” for “Female”) or incorrect values (“Mail” instead of “Male”) as input to the pipeline. In these cases, the model might fail to recognize the value in the feature “Gender” (data might be absent while training model) and assign NaN for the feature. Even though the model doesn’t break, the predictions produced by the model might be wrong. Hence, testing new data that the model hasn’t seen before is essential.
- Data Drift: When we train a model with some static data, it assumes specific patterns based on the distribution of provided data. However, real-world data is dynamic. Because of these changes, the assumptions made by the model might no longer be valid, and the model might get biased, which leads to bad performance in real-time model evaluation. For example, water consumption in hospitals during COVID-19 is very high compared to historical data. Hence, we cannot use a model built on historical water consumption data during COVID-19. This phenomenon is called “Data Drift.” Periodically detecting changes in the distribution of data using statistical tests can help to detect data drift.
c. Model Metrics: It is crucial to decide the expected performance of models before deploying them into production and periodically check if expected KPIs meet. If model predictions or expected KPI values are bad compared to benchmarks, Ai professionals might encounter the Model Drift issue. Model Drift is a phenomenon where the relationship between features changes, and the model no longer gives accurate predictions. For example, the relationship between births and deaths ratio changed during COVID-19 causing Model Drift. Model Drift can be detected by periodically analyzing feedback from feedback loops and correlating with what is affecting the business.
Based on Model Metrics and Data Metrics, if re-training is required, the AI professional might start repeating the research, development, and production environment to deploy and monitor the new model. Often, re-training is also a way to improve the model’s performance to reflect the change in data over time.
It is good for any individual who wants to make a career in AI to be aware of the basics of all three environments. This awareness can help individuals identify skills required to work in an environment. Based on their interest, they can choose to specialize in one or more of these three environments and eventually make their career decisions in the current rapidly changing data world.
Add me on LinkedIn. Thank you!