Data, Technology

Machine Learning quick start guide for enthusiast

Last Update: June 28, 2024

Contributors

nayem

Tech Stack

0 +

To achieve a better and professional level in Machine Learning as well as a quick start, We have to have some basic and professional level skills and knowledge in some related topics. In this post we will see the related topics, where to start, where the resources are available for study and practice.

Understand your data first:

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” —Arthur Conan Doyle.
Machine Learning is highly dependent on data, historical data. we are to collect, clean, preprocess (For ML model need numeric data) the data then identify and discover the hidden pattern of data to teach the machine. There are basically two types data – Structured (organized) data, Unstructured (unorganized) data.
- These two data types can be defined as follows:
  - Quantitative data: This data can be described using numbers, and basic mathematical procedures, including addition, are possible on the set.
    Quantitative/Numeric data can be broken down, one step further, into discrete and continuous quantities. These can be defined as follows:
    - Discrete data: This describes data that is counted. It can only take on certain values. Examples of discrete quantitative data include a dice roll, because it can only take on six values, and the number of customers in a café, because you can’t have a real range of people.
    - Continuous data: This describes data that is measured. It exists on an infinite range of values.
  - Qualitative data: This data cannot be described using numbers and basic mathematics. This data is generally thought of as being described using “natural” categories and language.
    Qualitative /Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Alabama, Alaska, etc.). Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1, yes/no, or true/false. Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5)
- There also four levels of data to understand:
  - The nominal level: Data at the nominal level is mostly categorical in nature. At the nominal level, we deal with data usually described using vocabulary (but sometimes with numbers), with no order, and little use of mathematics. In order to find the center of nominal data, we generally turn to the mode (the most common element) of the dataset.
  - The ordinal level: At the ordinal level, we have data that can be described with numbers and also have a “natural” order, allowing us to put one in front of the other. At the ordinal level, the median is usually an appropriate way of defining the center of the data. The mean, however, would be impossible because division is not allowed at this level.
  - The interval level: Allow Addition, Subtraction. At this level, we can use the median and mode to describe this data; however, usually the most accurate description of the center of data would be the arithmetic mean, more commonly referred to as, simply, “the mean”
  - The ratio level: the ratio level proves to be the strongest of the four. Not only can we define order and difference, the ratio level allows us to multiply and divide as well. The arithmetic mean still holds meaning at this level, as does a new type of mean called the geometric mean. Data at the ratio level is usually non-negative.

Understanding of data will help to collect, clean and pre-process data as per the need to build the ML model

Programming Langulage & Lybraries:

For machine learning Python/R is in leading but from a developer perspective Python could be a choice for it’s huge community and rich libraries. We need a bettter understanding with all of it’s language related basics like- Lists, List Slicing, Tuples, Dictionaries, Counters, Sets, Zip and Argument Unpacking , data structure and algorighms. As well as following libraries are need to know while starting-
- NumPy is a well known general-purpose array-processing package. An extensive collection of high complexity mathematical functions make NumPy powerful to process large multi-dimensional arrays and matrices
- SciPy is a free and open-source library that’s based on NumPy. It can be used to perform scientific and technical computing on large sets of data. Similar to NumPy, SciPy comes with embedded modules for array optimization and linear algebra. It’s considered a foundational Python library due to its critical role in scientific analysis and engineering.
- scikit-learn is a free Python library based on NumPy and SciPy, that’s often considered a direct extension of SciPy. It was specifically designed for data modeling and developing machine learning algorithms, both supervised and unsupervised.
- Pandas is another Python library that is built on top of NumPy, responsible for preparing high-level data sets for machine learning and training. It relies on two types of data structures, one-dimensional (series) and two-dimensional (DataFrame).
- Seaborn is another open-source Python library, one that is based on Matplotlib (which focuses on plotting and data visualization) but features Pandas’ data structures. Seaborn is often used in ML projects because it can generate plots of learning data.
- Matplotlib is a Python library focused on data visualization and primarily used for creating beautiful graphs, plots, histograms, and bar charts. It is compatible for plotting data from SciPy, NumPy, and Pandas. If you have experience using other types of graphing tools, Matplotlib might be the most intuitive choice for you.there are many more libraries
Linear Algebra:
- Vectors Basics: Introduction to vectors, Vector arithmetic, Coordinate system
- Vector Projections and Basis: Dot product of vectors, Scalar and vector projection, Changing basis of vectors, Basis, linear independence, and span
- Matrices: Matrices introduction,Types of matrices,Types of matrix transformation,Composition or combination of matrix transformations
- Gaussian Elimination: Solving linear equations using Gaussian elimination,Gaussian elimination and finding the inverse matrix, Inverse and determinant
- Matrices from Orthogonality to Gram–Schmidt Process: Matrices changing basis ,Transforming to the new basis ,Orthogonal matrix, Gram–Schmidt process
- Eigenvalues and Eigenvectors: Calculating eigenvalues and eigenvectors

For Linear algebra Khan Accademy and edX Essential Math for Machine Learning: Python Edition

Statistics & Probability:
- Populations and Samples
- Mean, Median, Mode
- Variance, Range, Inter Quartile Range (IQR), Skewness
- Correlation, Correlation and Causation
- Dependence and Independency
- Conditional Probability
- Bayes’s Theorem and Random variables
- Distributions
  - Continuous Distribution
  - Normal Distribution
- Central Limit Theorem (CLT)
- Hypothesis Testing: Null Hypothesis(H0), Alternate Hypothesis (HA)
- Errors in Hypothesis Testing: Type 1 Error, Type 2 Error
- Statistical Significance and P-Values
- The chi-square goodness of fit test
- Confusion Matrix, Precision, Accuracy, Recall, F1 Score

Resource for learning: Statistics and probability – Khan Academy, Introduction to Statistics-Coursera, Statistics for Data Science with Python-Coursera

Database Knowledge: Some basics of SQL and NoSQL may help. Basic of SQL queries like-
- Join (Inner, Outer, Full, Cross)
- Pivot and Unpivote
- Window Functions
  - Aggregate Window Functions – SUM, MIN, MAX, AVG, COUNT
  - Ranking Window Functions – RANK, DENSE_RANK, ROW_NUMBER, NTILE
  - Value Window Functions – LAG() and LEAD,FIRST_VALUE and LAST_VALUE
IDE:
- Spyder: Scientific Python Development Environment (Spyder) is a free & open-source python IDE. It is lightweight and is an excellent python ide for data science & ML
- Jupiter Notebook: For its simplicity this one became a great IDE among the data enthusiasts as it is the descendant of IPython. Best thing of JuPyter is that there you can very easily switch between the different versions of python (or any other language) according to your preference.
- Visual Studio Code: Visual Code is one of the most used Python IDE by ML & DS professionals. It works on Windows, Mac, and Linux operating systems.
Machine Learning: Supervised and Unsupervised learning are the two techniques of machine learning. But both the techniques are used in different scenarios and with different datasets.
Supervised learning is a machine learning method in which models are trained using labeled data.In supervised learning, models need to find the mapping function to map the input variable (X)
with the output variable (Y).
Unsupervised learning is another machine learning method in which patterns inferred from the unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data by its own. Machine Learning steps are –
- Business Understanding: What you want to achieve by implementing it. This is the “business understanding”.
  - Identify the business objective/problem
    - Regression problems: We have data that needs to be mapped onto a predictor variable, so we need to learn a function that can do this mapping.
    - Classification problems: Here, we have data that needs to be divided into predefined classes, based on some features of the data. We need an algorithm that can use previously classified data to learn how to put unknown data into the correct class. Ex: K-Nearest Neighbor for classification of a categorical outcome or prediction of a numerical outcome, Naïve Bays Classifier- Can be applied to data with categorical predictor.
    - Summarization problems: Suppose we have data that needs to be shortened or summarized in some way. This could be as simple as calculating basic statistics from data, or as complex as learning how to summarize text or finding a topic model for text.
    - Dependency modeling problems: For these problems, we have data that might be connected in some way, and we need to develop an algorithm that can calculate the probability of connection or describe the structure of connected data.
    - Change and deviation detection problems: In another case, we have data that has changed significantly or where some subset of the data deviates from normative values. To solve these problems, we need an algorithm that can detect these issues automatically.
  - Assess the situation
  - Determine the analytical goals
  - Produce a project plan
- Data Understanding: Determining what kind of data can be collected to build a deployable model. In the next phase, “data understanding,”
  - Collect the data
  - Describe the data
  - Explore the data
  - Verify the data quality
- Data Preprocessing:
  - Importing/Selecting the Dataset
  - Handling Missing Data
  - Handling Categorical Data
  - Splitting the Dataset into the Training set and Test set
  - Feature Scaling

- Modeling: As per the business need we have to chose appropriate model from various kind of available models-
  - Regression: Value estimation, Ex- how much wills this customer use the service?
    - Linear Regression: Linear regression is an approach in modeling that helps model the scalar linear relationship between a scalar dependent variable, Y, and an independent variable, X, which can be one or more in value: y = X_ +_
      - Simple linear regression: A simple linear regression has a single variable, and it can be described using the following formula: y= A + Bx
      - Multiple linear regression model: Multiple linear regression occurs when more than one independent variable is used to predict a dependent variable: Y_ = a +b1x1+b2x2 +…+ bnxn
      - Polynomial Regression: A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation: y=a + b*x^2
    - Non-Linear Regression:
      - Support Vector Regression SVR
      - Decision Tree Regression
      - Random Forest Regression
  - Classification: Will this customer purchase service S1 if given incentive I? Which service package (S1, S2, or none) will a customer likely purchase if given incentive I?
    - Decision trees: Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables.
    - Random Forest Classification: Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest.
    - Logistic regression: Logistic regression extends the idea of linear regression to the situation where the dependent variable Y is categorical. We can think of a categorical variable as dividing the observations in to classes. Don’t get confused by its name! It is a classification not a regression algorithm.
    - The naive Bayes classifier: It works on Bayes theorem of probability to predict the class of unknown data set. Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
    - SVM (Support Vector Machine): It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are known as Support Vectors)
    - K-Nearest Neighbor: KNN can be used for both classification and regression predictive problems. The “K” in KNN algorithm is the nearest neighbors we wish to take vote from?
  - Clustering:
    - k-means clustering: The k-means clustering is an unsupervised learning technique that helps in partitioning data of n observations into K buckets of similar observations.
    - Hierarchical clustering: Hierarchical clustering is an unsupervised learning technique where a hierarchy of clusters is built out of observations
    - Divisive hierarchical clustering: This is a top-down approach where observations start off in a single cluster and then they are split into two as they go down a hierarchy.
  - Recommender Systems: Recommendations help monetize user behavior data that businesses capture. This allows them to recommend the content that they like. Recommender systems are a way of suggesting or similar items and ideas to a user’s specific way of thinking.
- Model Evaluation:
  - Evaluate the results, is it Overfitting (Good performance on the training data, poor generliazation to other data) or Underfitting (Poor performance on the training data and poor generalization to other data)
  - Review the process
  - Determine the next steps based on evaluation score
- Test & Deployment: After validation using various matix model is ready for testing. Here test set will be used for testing. The more data we will use the more the model will be robust.More resource

for study and parctice : Machine Learning with Python -Coursera , Kaggle, edX
Hope this will help to quick start and Achieve Success Quicker.
Thanks…..