Blog

Navigating the GenAI Tools Landscape: Unveiling the Hierarchy of Tools

Navigating the GenAI Tools Landscape: Unveiling the Hierarchy of Tools

Open Immersive Reader
Let’s review the landscape of Generative Artificial Intelligence (GenAI) tools and unravel the layers that shape this cutting-edge technology.

1️⃣ Foundational Models: The Pillars of Intelligence
At the base of the GenAI hierarchy are foundational models, the bedrock of artificial intelligence. These models like GPT-4, BERT, Bard, Claude, PALM-2, LLaMA, DALL-E, Cohere, etc. serve as the fundamental building blocks, enabling diverse applications and breakthroughs across industries.
Foundation models distinguish themselves from traditional machine learning architectures by not aiming to excel at specific tasks but rather to provide an optimal base for constructing specialized AI models. This innovative approach surpasses conventional Transfer Learning methods, such as those used in Image Classification, by creating more generalized models. This significantly reduces the necessity to update weights associated with the final layers during fine-tuning, streamlining the process of tailoring models for domain-specific tasks. For instance, a Large Language model can initially grasp language understanding from diverse data and then fine-tune to address specific inquiries, such as those related to medical records or math exams. These models are crafted through self-supervised learning, generating labels directly from input and training data—such as artificially removing a word from a sentence and tasking the foundation model with predicting it.
For the majority of enterprises, the foundational model stratum is generally deemed impractical as a business. The substantial investment necessitated for training and compute resources proves prohibitively expensive, rendering this domain accessible only to a select cadre of major industry players. Most businesses will use rather than build new foundational models.

2️⃣ Memory API Layer: Empowering Contextual Understanding
Above the foundational models lies the Memory API layer, adding a nuanced dimension to AI capabilities. This layer equips models with memory and contextual understanding, fostering more intelligent and context-aware interactions.
Language Models (LLMs) operate in a stateless manner, lacking the ability to recall prior messages in a conversation. Developers bear the responsibility of preserving the conversation history and supplying context to the LLM, often necessitating storage in a persistent database for context retrieval in subsequent interactions. Equipping LLMs with short-term and long-term memory stands as a pivotal task for developers.
Another challenge arises from the absence of a one-size-fits-all rule for LLMs. To address various scenarios like sentiment analysis, classification, question answering, and summarization, developers may need to employ multiple specialized models. Managing these diverse LLMs introduces complexity and demands intricate handling.
Introducing a Unified API Layer for LLM Applications: LangChain, an SDK crafted to streamline LLM integration, tackles the aforementioned challenges. Operating akin to an ODBC or JDBC driver abstracting underlying databases,
LangChain further simplifies the intricacies of underlying LLM implementations by presenting a straightforward and unified API. This abstraction enables developers to effortlessly swap in and out models with minimal code modifications.
Emerging concurrently with the ascent of ChatGPT, LangChain, initiated by its creator Harrison Chase in late October 2022, has rapidly evolved into a robust tool for interacting with LLMs. Embraced by an actively contributing community, LangChain has solidified its position as one of the premier frameworks for seamless LLM interaction.
Functioning as a dynamic ecosystem, LangChain efficiently integrates with external tools, orchestrating the flow to obtain desired outcomes from LLMs.

3️⃣ Vector Databases: Bridging Information Gaps
Vector databases serve as repositories for structured and unstructured data, including text or images, incorporating their respective vector embeddings. These embeddings represent the numerical essence of the data, captured in an extensive list of numbers, conveying the semantic meaning of the original data object. Typically, machine learning models are employed to generate these vector embeddings.
In the realm of vector space, where akin objects cluster closely, the proximity of data objects is gauged by the distance between their vector embeddings. This paves the way for a novel search methodology known as vector search, which retrieves objects based on similarity. In contrast to conventional keyword-driven searches, semantic search introduces a more adaptable approach to item retrieval.
Imagine a three-dimensional representation of a semantic vector space where the query term “kitten” resides in close proximity to “cat,” “dog,” and “wolf,” while maintaining distance from unrelated terms like “apple” and “banana.”
While conventional databases can accommodate the storage of vector embeddings for vector search, vector databases stand out as AI-native platforms optimized for rapid, large-scale vector searches. Given that vector search entails computing distances between the query and every data object, employing a classical K-Nearest-Neighbor algorithm proves computationally intensive. Vector databases employ vector indexing to pre-calculate these distances, facilitating swift retrieval during queries. Consequently, vector databases empower users to efficiently locate and retrieve similar objects at scale in production environments.

4️⃣ Datalake: The Reservoir 
There’s datalake such as DeepLake which stands as an AI-focused database, distinguished by a storage format meticulously tailored for deep-learning applications. Its versatile utility encompasses:
Storage of data and vectors during LLM application development.
Effective management of datasets in the training of deep learning models.
Facilitating the seamless deployment of enterprise-grade LLM-based products, Deep Lake provides storage capabilities for diverse data types (including embeddings, audio, text, videos, images, pdfs, annotations, etc.), along with features such as robust querying, vector search, data streaming during scalable model training, data versioning, lineage tracking, and integrations with leading tools such as LangChain, LlamaIndex, Weights & Biases, among others. Noteworthy is Deep Lake’s compatibility with data of any size, its serverless architecture, and its capacity to consolidate all data securely in your preferred cloud environment, offering a unified storage solution.

🌟 Harmonizing the Symphony of GenAI Tools
The true power of GenAI emerges when these tools work in harmony. Together, they create a symphony of intelligence, reshaping the possibilities of what AI can achieve.
In the dynamic landscape of GenAI, understanding the hierarchy of tools is key to unleashing their full potential. As we continue to explore and innovate, this hierarchy serves as a roadmap for harnessing the transformative capabilities of artificial intelligence.

👣 Your Next Steps
Strategically, business leaders should identify the most advantageous position within the tool hierarchy to establish their enterprises, while judiciously allocating limited resources for maximum business objectives. For personalized guidance and advanced strategies, connect with me at peterchen@hyperplanar.com. Let’s collaboratively explore this new frontier to unlock the full potential of this new technology and achieve your business objectives.

#GenAI #ArtificialIntelligence #Innovation #TechTrends #Langchain #DataLake #DeepLake #LargeLanguageModels #Vectordatabases #Strategy #AIUseCases🚀

read more

Understanding Data Science & Machine Learning Project End-to-End Workflow

 

When we are faced with a new data science project, how should we go about it? It’s good to have a systematic process to go about doing things.

Doing data science right is not a mysterious process. We’ll review how the sausage is made for best results in data science projects.

The steps for a typical data science project is as follows:

1) Understand the Business Problem
2) Obtain the Data
3) Explore and visualize the Data
4) Prepare the Data
5) Select a model and train it
6) Fine-tune the model(s)
7) Present the solution
9) Productionalize the model(launch, monitor, and maintain your machine learning system)

Let’s examine each of these steps for an example case study.

1. Understanding the Business Problem

The most important aspect of a data science project is not the latest machine learning algorithms, hyperparameter optimizations, or the programming language it will be implemented in, but rather a deep understanding of the business problem at hand.

We need to understand the problem we are trying to solve. What are the requirements, constraints, and end use cases.

In this case study, we use sample data of car prices. While this is just an example data set, one can imagine it might be used by online car dealership or website like Kelley Blue Book to price cars.

2. Obtain the Data

Given this is an example case study, we will get the data from a Github repo here . In real life, the process of obtaining the data can potentially involve a lot of work. If you work in a data science team, then there might be team members whose sole responisbilities are to curate and manage the data. Such teams can be the Database team from IT or data engineers.

Often times the data is in some kind of database. You will have to pull the data using SQL. Sometimes that can be as straighforward as just selecting a table from a database. Other times, it can involve complex joins and aggregations.

This step can be as simple or as complex as it gets depending on the type of data and the environment you are in.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
In [3]:
%matplotlib inline
In [4]

car = pd.read_csv("C:\\Users\Admin\Downloads\ToyotaCorolla.txt")

Let’s examine the first few rows of data.

In [5]:
car.head(5)
Out[5]:
Price Age KM FuelType HP MetColor Automatic CC Doors Weight
0 13500 23 46986 Diesel 90 1 0 2000 3 1165
1 13750 23 72937 Diesel 90 1 0 2000 3 1165
2 13950 24 41711 Diesel 90 1 0 2000 3 1165
3 14950 26 48000 Diesel 90 0 0 2000 3 1165
4 13750 30 38500 Diesel 90 0 0 2000 3 1170

3. Explore and Visualize the Data

Let’s explore and visualize our data.

There is one categorical variable named FuelType. Let’s look at the different fuel types.

car['FuelType'].value_counts()
Out[97]:
Petrol    1264
Diesel     155
CNG         17
Name: FuelType, dtype: int64
car.plot(kind="scatter", x="Age",y="Price")
Out[98]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fe6404de48>

This shows the relationship between Price and the Age of the car. The older the car the cheaper it is. This makes intuitive sense.

In [99]:
car.plot(kind="scatter", x="Weight",y="Price")
Out[99]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fe6425a6a0>

This graph shows the relationship between Price and Weight of the car and it implies the heavier the car the most expensive it is. This makes a lot of sense most of the time. However, this is not always the case. A smaller luxury car can be more expensive than a heavy truck.

There’s a better more compact way to explore. We can do what’s called a “scatter matrix” of all numerical values against each other.

In [100]:
from pandas.plotting import scatter_matrix

attributes = ["Price","Age","KM","Weight"]
scatter_matrix(car[attributes]);
In [101]:
car.hist(bins=50, figsize=(20,15))
plt.show()
In [102]:
car_corr_matrix = car[attributes].corr()
In [103]:
car_corr_matrix["Price"].sort_values(ascending=False)
Out[103]:
Price     1.000000
Weight    0.581198
KM       -0.569960
Age      -0.876590
Name: Price, dtype: float64

4. Prepare the Data

The data preparation is the most time consuming aspect of any data science project. This can involve data cleanup, data transformations, renaming variables, dropping missing values, imputing missing values, removing NAs, and potentially many other steps. Data preparation or data munging can take up to 80% of a data scientist’s time.

At HyperPlanar, we provide data cleaning, data munging, and data preparation services so your data scientists don’t have to!

We have one categorical variable called Fuel Type. We need to convert it into a numerical variable so that we can feed it into our machine learning models. We use one hot encoder

In [6]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
encoder = LabelEncoder()

FuelType_cat = car['FuelType']
FuelType_cat_encoded = encoder.fit_transform(FuelType_cat)
FuelType_cat_encoded

print(encoder.classes_)
print(FuelType_cat_encoded)
['CNG' 'Diesel' 'Petrol']
[1 1 1 ... 2 2 2]

One issue with the about encoding of categorgical variables into numerical ones is that assigning numbers to each category we are making the algorithm thinks there is some kind of implicit hierarchy or distance between the category. For example, is a numerical value of 2 means it’s twice that of 1?

Hardly. We are just merely want to transform our categorical variable into numerical values.

A better approach is to turn them into dummy variables, which creates one binary attributes per category. If the category is present then it will be a 1 else it will be 0. This is called one-hot encoding. Fortunately, scikit learn provides such functionality as shown below.

In [105]:
import warnings
warnings.filterwarnings('ignore')

encoder = OneHotEncoder()
FuelType_cat_1hot = encoder.fit_transform(FuelType_cat_encoded.reshape(-1,1))
FuelType_cat_1hot 
Out[105]:
<1436x3 sparse matrix of type '<class 'numpy.float64'>'
	with 1436 stored elements in Compressed Sparse Row format>

There are hosts of things that might needed to be address in order to prepare the data. How to handle missing values?

Do we eliminate the missing values? Do we impute the missing values? Do we interpolate the missing values?

This is a big discussion that can be driven by a number of business, practical, and implementation issues.

Feature Scaling

A very important transformation to your data is features scaling. Most models and machine learning algorithms will perform better when you rescale your data to a range between 0 and 1.

Creating Pipelines

In real life, to scale the efforts of our data preparation, we need to build robust data pipelines that does the data preparation in generalized and automated fashion.

Fortunately, scikit learn allows one to do exactly that. One can build your own custom data pipeline to handel any data processing need.

Below we create a two stage data pipeline for numerical data. The first stage does imputation for missing values. In this example, we fortunately didn’t have any missing values. But should we had any missing values then the data pipeline will impute that to be the median value. The second stage does normalization of the data using the standard scaler function.

The number of stages can be arbitrarily long depending how much data cleaning and preparation are needed.

In real life, some of this data preparation pipelines are often built using industrial strength ETL tools outside of Python even. However, even after all of the ETL work, when the data finally gets to the data scientist, she might still need to do further data refinements. Thus it’s important to understand this data pipeline framework provied by scikit-learn. It can come in handy.

In [130]:
# Select only numerical attributes from the car dataframe
car_num = car_copy.drop('FuelType',axis=1)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

car_num_tr = num_pipeline.fit_transform(car_num)

What about categorical variable?

We saw earlier that we can use OneHotEncoder to transform categorical variables into dummy variables.

Check the version of scikit-learn you have. If you have version 0.20 or above than you can use the ColumnTransformer to combine both the numerical data pipeline and the categorical data pipeline into one full data pipeline.

In [131]:
print('The scikit-learn version is {}.'.format(sklearn.__version__))

from sklearn.compose import ColumnTransformer

# Select only numerical attributes from the car dataframe
car_num = car_copy.drop('FuelType',axis=1)
car_num_attribs = list(car_num)

# Select only categorical attributs from the car dataframe
car_cat = ["FuelType"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, car_num_attribs),
        ("cat", OneHotEncoder(), car_cat),
    ])

car_prepared = full_pipeline.fit_transform(car_copy)
The scikit-learn version is 0.20.3.
['Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight']

Splitting your data into training and test set

In [ ]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(car, test_size=.2, random_state=99)

car_copy = train_set.drop("Price", axis=1)
car_labels = train_set["Price"].copy()

5. Select a model to train

There are many machine learning models to choose from. We will use linear regression for illustrative purposes.

In [132]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(car_prepared, car_labels)
lin_reg
Out[132]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [133]:
lin_reg.coef_
Out[133]:
array([-2320.41707424,  -638.71996324,   961.23676461,    27.23176155,
          89.08251266,  -835.8564215 ,    -5.52230809,  1009.52269589,
       -1410.85742461,  2038.36899598,  -627.51157137])
In [134]:
lin_reg.intercept_
Out[134]:
11076.220173445206

Let’s pick some data that is not part of the training set to see how the model performs.

In [7]:
some_data = car.iloc[:5]
some_labels = car["Price"].iloc[:5]
some_labels
some_data
Out[7]:
Price Age KM FuelType HP MetColor Automatic CC Doors Weight
0 13500 23 46986 Diesel 90 1 0 2000 3 1165
1 13750 23 72937 Diesel 90 1 0 2000 3 1165
2 13950 24 41711 Diesel 90 1 0 2000 3 1165
3 14950 26 48000 Diesel 90 0 0 2000 3 1165
4 13750 30 38500 Diesel 90 0 0 2000 3 1170
In [136]:
some_data = car.iloc[:5]
some_labels = car["Price"].iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [16602.55959497 16167.72756952 16567.12272735 16155.95161215
 15912.06421056]

We can see above that the predictions are quite off. Can we improve on this? Let’s look at a common metric for model performance. It’s called the RMSE(root mean squared error).

In [137]:
from sklearn.metrics import mean_squared_error

car_predictions = lin_reg.predict(car_prepared)
lin_mse = mean_squared_error(car_labels, car_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Out[137]:
1323.9928909490125

Let’s try a different model to see how it performs. This time using the decision tree model we saw a dramatic improved in RMSE. But is this due to purely overfitting the training data set.

In [139]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(car_prepared, car_labels)

car_predictions = tree_reg.predict(car_prepared)
tree_mse = mean_squared_error(car_labels, car_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
Out[139]:
4.173919355648411

Cross validation

A common approach to address this potential overfitting is through the process of K-fold cross-validation feature. The basic idea is to randomly splits the training set into n distinct subsets called folds. Then we train and evaluate the model n times. We would pick a different fold(subset) for evaluation each time and use the remaining 9 folds for training. The result of this cross validation is stored in an array with the 10 evaluation scores.S

Cross Validation: Decision Tree

In [141]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, car_prepared, car_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)


def display_scores(scores):
     print("Scores:", scores)
     print("Mean:", scores.mean())
     print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)
Scores: [1563.69399319 1190.40301908 1491.26645915 1375.29064438 1586.06342052
 1350.21876166 1478.27632    1395.09771202 1392.03688006 1494.0589246 ]
Mean: 1431.6406134653814
Standard deviation: 110.36980015246972

Looking above at the 10 RMSE scores for the decision tree, now realize they are not as great as we originally saw. Ahh.. That’s the value of doing cross validation testing and evaluation.

Cross Validation: Linear Regression

In [142]:
lin_scores = cross_val_score(lin_reg, car_prepared, car_labels,
                             scoring="neg_mean_squared_error", cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Scores: [1790.60806985 1079.4890871  1354.41645262 1156.13583165 1223.24186552
 1672.36954956 1299.98847273 1159.90326109 1391.3549945  1338.29801536]
Mean: 1346.5805599978762
Standard deviation: 215.70173075276068

Cross validation: Random Forest

In [146]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(car_prepared, car_labels)

forest_scores = cross_val_score(forest_reg, car_prepared, car_labels,
                             scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Scores: [1254.82777096  999.11895771 1105.87744269 1097.95918942 1288.27337492
 1258.0307541  1370.67418261  983.05177377 1113.47562604 1201.30878942]
Mean: 1167.259786164218
Standard deviation: 121.14272213056994

6. Fine-tune the model(s)

Now that we have a few candidate models that we like, how can we fine tune them. One common approach is to do an exhaustive search over the paremeters of the model. Fortunately, we have such a functionality built-in using GridSearch.

In [149]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(car_prepared, car_labels)

grid_search.best_params_
Out[149]:
{'max_features': 6, 'n_estimators': 30}
In [150]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
     print(np.sqrt(-mean_score), params)
1420.4531384950621 {'max_features': 2, 'n_estimators': 3}
1328.0672920170255 {'max_features': 2, 'n_estimators': 10}
1216.4670304401113 {'max_features': 2, 'n_estimators': 30}
1297.2390660711815 {'max_features': 4, 'n_estimators': 3}
1190.8879934544348 {'max_features': 4, 'n_estimators': 10}
1126.5438814442048 {'max_features': 4, 'n_estimators': 30}
1305.1139392798302 {'max_features': 6, 'n_estimators': 3}
1150.2807036722 {'max_features': 6, 'n_estimators': 10}
1123.5588187027242 {'max_features': 6, 'n_estimators': 30}
1336.9600584441537 {'max_features': 8, 'n_estimators': 3}
1154.5831679963803 {'max_features': 8, 'n_estimators': 10}
1123.597469529665 {'max_features': 8, 'n_estimators': 30}
1439.6842690066067 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
1289.4326406195564 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
1372.2628117973209 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
1227.6852538592918 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
1279.458643051782 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
1215.2991496837428 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
In [151]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[151]:
array([0.65000746, 0.16869407, 0.02841248, 0.00488693, 0.00128537,
       0.01683511, 0.00555957, 0.11872873, 0.00068003, 0.00272335,
       0.0021869 ])
In [152]:
attributes = car_num_attribs + car_cat

sorted(zip(feature_importances, attributes), reverse=True)
Out[152]:
[(0.650007455300971, 'Age'),
 (0.1686940707330714, 'KM'),
 (0.11872872591807493, 'Weight'),
 (0.028412477973376236, 'HP'),
 (0.016835109955781705, 'CC'),
 (0.005559567904762862, 'Doors'),
 (0.004886934479704075, 'MetColor'),
 (0.0012853708182478045, 'Automatic'),
 (0.0006800340552067958, 'FuelType')]
In [153]:
final_model = grid_search.best_estimator_

X_test = test_set.drop("Price", axis=1)
y_test = test_set["Price"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)   
In [156]:
print(final_mse)
print(final_rmse)
1174400.7928086417
1083.6977405202254
In [ ]:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                          loc=squared_errors.mean(),
                          scale=stats.sem(squared_errors)))

7. Present the Results

Now that we have found our best models with the best parameters, we will need to present the results to management or clients.

This part might take a different type of thinking and mindset. The stakeholders might not be interested in hearing the 20 different models we went through, all of the data cleaning and preparation, nor the latest hyperparameters optimizations techniques.

This goes back to the understanding the business problem. What are we trying to solve? And what is the best solution to the business problem. Explaining the solution and approach in a concise and easy to understand fashion will go a long way in broad acceptance of the results and solutions. This might involve socializing the results to other teams, other deparments, and other stakeholders as well.

8. Operationalize

Once the data science team arrived at the optimal model, then comes the big decision should they operationalize the model in production?

The question is not so obvious. One would think if the data science team has found the best solution, let’s rock ‘n roll! Well… not quite.

Back in 2009, Netflix started a competition whose main objective is to improve its recommendation engine. It got a lot of publicity and fanfare. The winning prize was a one million dollar to the person or team that can improve the most. The Bellcore team from the storied Bell Labs won the one million prize.

Now one would think the Netflix would take the winning algorithm and roll that into production. However, reality is much more complex. While the winning model/algorithms had the best results, it was too complex to operationalize. The amount of engineering effort required to implement and scale it up to a full running operation was prohibitively expensive.

You can read all about it in this Wired magazine article here.

The moral of the story is that we must balance optimal results with practical implementation costs. Sometimes, the best models don’t necessarily get operationalized. Management might still pick the simpler and easier to implement model in production.

That’s why experience matters. One as to balance between model complexity, implementation cost, and a host of other practical issues.

Conclusion

We hoped you enjoyed this blog post and got a glimpse of the workflow of a typical data science project.

Please feel free to reach out to us at our contact page and let us know how we can be of service to you and your organization.

read more