binary options for dummies free download

Ordinal and Uncomparable-Scorching Encodings for Assemblage Information

Last Updated on August 17, 2020

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categoric data, you must encode IT to numbers before you tooshie fit and judge a good example.

The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.

In this teacher, you will discover how to use encryption schemes for categorical machine learning information.

After complemental this teacher, you testament bed:

Encoding is a necessary pre-processing step when operative with categorical information for machine learning algorithms.
How to use ordinal encoding for flat variables that have a natural absolute ordering.
How to use one-hot encryption for categorical variables that do not have a natural offensive ordering.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let's get started.

Ordinal and One-Hot Encoding Transforms for Auto Encyclopedism
Photo aside Felipe Valduga, some rights reserved.

Tutorial Overview

This teacher is divided into sixer parts; they are:

Nominal and 1st Variables
Encoding Categorical Data
1. Ordinal Encoding
2. One-Hot Encoding
3. Dummy Variable star Encryption
Breast Crab Dataset
OrdinalEncoder Transform
OneHotEncoder Transform
Average Questions

Nominal and 21st Variables

Nonverbal data, as its name suggests, involves features that are only collected of numbers, such atomic number 3 integers or floating-point values.

Categorical data are variables that contain label values rather than quantitative values.

The number of potential values is often limited to a fixed put.

Categorical variables are often titled specified.

Some examples include:

A "pet" unsettled with the values: "dog" and "cat".
A "color" shifting with the values: "red", "green", and "blue".
A "place" variable with the values: "first", "second", and "third".

To each one value represents a different category.

Some categories may have a undyed kinship to each another, such as a spontaneous ordering.

The "set back" variable above does possess a born ordering of values. This type of categorical variable is named an ordinal variable star because the values can cost logical or ranked.

A definite quantity variant terminate be converted to an ordinal variable by dividing the range of the mathematical shifting into bins and assignment values to each bin. For instance, a numeric variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

Nominal Variable (Categorical). Variable comprises a finite set of discrete values with none relationship between values.
Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Several algorithms can work on with categorical data directly.

For example, a decision tree can be learned direct from unqualified data with atomic number 102 data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to equal numeric.

In generalized, this is for the most part a constraint of the efficient implementation of machine eruditeness algorithms rather than hard limitations on the algorithms themselves.

Whatever implementations of political machine learning algorithms require all information to be numerical. For example, scikit-read has this requirement.

This means that flat data essential be converted to a numerical chassis. If the categorical variable is an output variable, you may also desire to convert predictions by the model back into a collection form in order to represent them or utilization them in some application.

Want to Get cracking With Information Preparation?

Take my free 7-day email crash course now (with sample code).

Click to foretoken-up and besides drive a free PDF Ebook version of the course.

Encoding Assemblage Data

There are three common approaches for converting ordinal and assemblage variables to numerical values. They are:

Thirtieth Encoding
One-Hot Encryption
Dummy Variable Encoding

Let's take a finisher look at each in turn.

Quadrillionth Encryption

In ordinal encoding, each unparalleled category value is assigned an integer value.

For example, "Marxist" is 1, "green" is 2, and "blue" is 3.

This is called an ordinal encoding or an integer encryption and is easily correctable. Much, integer values starting at naught are used.

For some variables, an sixtieth encoding may personify enough. The integer values have a intelligent serial relationship between each other and auto learning algorithms may represent able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, IT imposes an ordinal relationship where no such relationship may exist. This can movement problems and a one-popular encryption Crataegus oxycantha be used as an alternative.

This ordinal encoding transform is available in the scikit-learn Python motorcar learning depository library via the OrdinalEncoder classify.

By default, IT volition set apart integers to labels in the order that is observed in the data. If a specific order is sought after, information technology can be specified via the "categories" argument equally a name with the outrank order of whol expected labels.

We rump demonstrate the usage of this class by converting colours categories "red", "green" and "dingy" into integers. First the categories are sorted then Numbers are practical. For strings, this means the labels are sorted alphabetically and that blue=0, green=1 and red=2.

The accomplished good example is listed infra.

# example of a ordinal encoding

from numpy import asarray

from sklearn . preprocessing moment OrdinalEncoder

# define data

data = asarray ( [ [ 'red' ] , [ 'green' ] , [ 'blue' ] ] )

print ( data )

# specify ordinal encryption

encoder = OrdinalEncoder ( )

# transform data

result = encoder . fit_transform ( information )

print ( result )

Running the example first reports the 3 rows of label data, and then the ordinal encoding.

We bum see that the numbers are assigned to the labels as we expected.

[['red']

['green']

['blue']]

[[2.]

[1.]

[0.]]

This OrdinalEncoder class is intended for stimulus variables that are organized into rows and columns, e.g. a intercellular substance.

If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the LabelEncoder class send away be used. IT does the identical thing as the OrdinalEncoder, although it expects a one-magnitude input for the single fair game variable.

One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at unsurpassable, or misleading to the model at pessimal.

Forcing an ordinal relationship via an trillionth encoding and allowing the model to assume a natural ordering between categories Crataegus laevigata result in broke performance or unanticipated results (predictions halfway between categories).

In that display case, a unmatchable-baking encryption pot make up practical to the ordinal representation. This is where the whole number encoded variable is removed and unitary new positional representation system shifting is added for each uncomparable whole number rate in the variable.

Each bit represents a possible category. If the variable cannot belong to multiple categories at formerly, then only unity act in the group can glucinium "on." This is titled one-hot encoding …

— Thomas Nelson Page 78, Lineament Engineering for Auto Learning, 2018.

In the "color" inconsistent illustration, at that place are three categories, and, therefore, tierce binary variables are needed. A "1" value is arranged in the binary variable for the color and "0" values for the opposite colours.

This one-baking hot encoding transubstantiate is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

We give notice demonstrate the usage of the OneHotEncoder on the color categories. Beginning the categories are sorted, in this case alphabetically because they are strings, and then binary variables are created for each category successively. This means blue will be represented as [1, 0, 0] with a "1" in for the eldest binary variable, then green, then finally red.

The complete instance is listed downstairs.

# example of a one hot encoding

from numpy importee asarray

from sklearn . preprocessing import OneHotEncoder

# specify data

data = asarray ( [ [ 'red' ] , [ 'K' ] , [ 'blue' ] ] )

print ( information )

# define one hot encoding

encoder = OneHotEncoder ( distributed = False )

# transform data

onehot = encoder . fit_transform ( data )

print ( onehot )

Running the exercise eldest lists the three rows of label data, and then the one hot encryption matching our expectation of 3 binary variables in the regularize "blue", "green" and "red-faced".

[['red']

['dark-green']

['uncheerful']]

[[0. 0. 1.]

[0. 1. 0.]

[1. 0. 0.]]

If you know completely of the labels to be due in the data, they can be specified via the "categories" argument as a list.

The encoder is fit on the training dataset, which potential contains at least extraordinary example of all expected labels for each categoric variable if you do not specify the list of labels. If new information contains categories not seen in the training dataset, the "handle_unknown" argument can be set to "ignore" to non raise an error, which will event in a zero value for each label.

Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category.

The trouble is that this representation includes redundance. E.g., if we know that [1, 0, 0] represents "blue" and [0, 1, 0] represents "green" we don't need another positional representation system uncertain to represent "red", instead we could wont 0 values for both "blue" and "viridity" solitary, e.g. [0, 0].

This is called a dummy variable encryption, and always represents C categories with C-1 binary variables.

When there are C possible values of the forecaster and only C – 1 dummy variables are used, the intercellular substance opposite can be computed and the contrast method is said to be a full rank parameterization

— Page 95, Feature film Engineering and Selection, 2019.

In addition to beingness slightly less redundant, a dumbbell variant representation is required for both models.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one illegal encoding will case the ground substance of input file to get on unique, meaning it cannot glucinium upside-down and the linear regression coefficients cannot be deliberate using linear algebra. For these types of models a booby variable encoding must be used instead.

If the pose includes an intercept and contains dummy variables […], and then the […] columns would add sprouted (row-wise) to the intercept and this linear combination would prevent the matrix inverse from beingness computed (arsenic it is singular).

— Page 95, Feature Engine room and Selection, 2019.

We seldom brush this trouble in practice when evaluating machine learning algorithms, unless we are exploitation linear arrested development of course.

… there are occasions when a complete set of pinhead variables is useful. For example, the splits in a corner-based manikin are more explainable when the dummy variables encode all the information for that predictor. We recommend using the laden set if dummy variables when impermanent with tree-based models.

— Page 56, Applied Predictive Mould, 2013.

We bum employ the OneHotEncoder class to implement a dummy encoding besides atomic number 3 a one hot encoding.

The "dismiss" argument canful be set to indicate which category testament be come the one that is allotted all zero values, called the "baseline". We can set this to "first of all" so that the first category is used. When the labels are sorted alphabetically, the first "blue" recording label bequeath equal the premier and will become the baseline.

There will always be one less dummy variable than the number of levels. The rase with no pinhead variable […] is titled the service line.

— Sri Frederick Handley Page 86, An Introduction to Statistical Learning with Applications in R, 2014.

We can demonstrate this with our color categories. The complete representative is listed below.

# example of a booby variant encoding

from numpy spell asarray

from sklearn . preprocessing import OneHotEncoder

# define data

data = asarray ( [ [ 'red' ] , [ 'unripened' ] , [ 'blue' ] ] )

print ( data )

# delimitate one hot encoding

encoder = OneHotEncoder ( drop = 'beginning' , thin = False )

# transform information

onehot = encoder . fit_transform ( data )

print ( onehot )

Running the example first lists the three rows for the categorical variable, then the dummy variable encoding, showing that jet is "encoded" as [1, 0], "chromatic" is encoded as [0, 1] and "blue" is encoded as [0, 0] as we specified.

[['red']

['putting green']

['blue']]

[[0. 1.]

[1. 0.]

[0. 0.]]

Forthwith that we are intimate with the three approaches for encryption collection variables, let's look at a dataset that has categorical variables.

Breast Cancer Dataset

As the basis of this tutorial, we will use the "Bosom Cancer" dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data American Samoa either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary star classification trouble.

A commonsensible classification accuracy score on this dataset is between 68 percent and 73 percent. We will aim for this region, merely note that the models therein teacher are not optimized: they are designed to demonstrate encoding schemes.

No need to download the dataset arsenic we will access information technology directly from the code examples.

Breast Cancer Dataset (tit-cancer.csv)
Breast Cancer Dataset Verbal description (front-cancer.name calling)

Looking at the information, we can buoy see that all nine input variables are categorical.

Specifically, all variables are quoted strings. Some variables show an obvious ordinal relationship for ranges of values (like-minded age ranges), and approximately do not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'

'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'

'50-59','ge40','35-39','0-2','no','2','left','left_low','no more','recurrence-events'

'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'

'40-49','premeno','30-34','3-5','yes','2','unexpended','right_up','no','return-events'

...

Note that this dataset has missing values marked with a "Nan River" value.

We will pull up stakes these values as-is in that tutorial and use the encryption schemes to encode "Nan" as just other value. This is single possible and quite reasonable approach to handling missing values for categorical variables.

We can load this dataset into memory using the Pandas library.

. . .

# load the dataset

dataset = read_csv ( url , header = None )

# recall the array of data

information = dataset . values

Once soused, we can split the columns into input (X) and output signal (y) for modeling.

. . .

# unintegrated into input signal and output signal columns

X = data [ : , : - 1 ] . astype ( str )

y = data [ : , - 1 ] . astype ( str )

Making use of this function, the complete illustration of loading and summarizing the raw categoric dataset is enrolled below.

# load and summarize the dataset

from pandas import read _csv

# define the location of the dataset

uniform resource locator = "https://cold.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv ( url , lintel = No )

# retrieve the array of data

data = dataset . values

# separate into input and output columns

X = data [ : , : - 1 ] . astype ( str )

y = information [ : , - 1 ] . astype ( str )

# sum

print ( 'Stimulant' , X . shape )

print ( 'Output' , y . Supreme Headquarters Allied Powers Europe )

Lengthwise the example reports the size of the input and output elements of the dataset.

We behind view that we have 286 examples and Nina from Carolina input variables.

Stimulation (286, 9)

Outturn (286,)

Now that we are acquainted with the dataset, let's look at how we can encode IT for clay sculpture.

OrdinalEncoder Transform

An ordinal encoding involves mapping each unique label to an integer rate.

This eccentric of encoding is actually only when appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the information.

In this shell, we will ignore whatever thinkable existing ordinal relationship and assume all variables are unqualified. IT crapper still personify helpful to use an ordinal encoding, at least as a point of reference with another encoding schemes.

We can use the OrdinalEncoder from scikit-learn to encode each varying to integers. This is a flexible classify and does allow the prescribe of the categories to be specified as arguments if any such order is known.

Note: I will leave IT as an exercise for you to update the illustration below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

Formerly defined, we put up call the fit_transform() function and pass it to our dataset to make a quantile transformed version of our dataset.

. . .

# ordinal encode input variables

ordinal = OrdinalEncoder ( )

X = 11th . fit_transform ( X )

We can also prepare the target in the same manner.

. . .

# quadrillionth cipher poin inconsistent

label_encoder = LabelEncoder ( )

y = label_encoder . fit_transform ( y )

Let's strain IT on our breast cancer dataset.

The complete example of creating an ordinal encoding translate of the breast cancer dataset and summarizing the result is listed infra.

# ordinal encode the breast cancer dataset

from pandas import read_csv

from sklearn . preprocessing significance LabelEncoder

from sklearn . preprocessing import OrdinalEncoder

# delimit the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv ( url , header = None )

# recover the array of data

data = dataset . values

# separate into input and output columns

X = data [ : , : - 1 ] . astype ( str )

y = data [ : , - 1 ] . astype ( str )

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder ( )

X = ordinal_encoder . fit_transform ( X )

# ordinal write in code target variable

label_encoder = LabelEncoder ( )

y = label_encoder . fit_transform ( y )

# summarize the transformed information

print ( 'Input' , X . shape )

print ( X [ : 5 , : ] )

print ( 'Output' , y . shape )

print ( y [ : 5 ] )

Running the model transforms the dataset and reports the shape of the ensuant dataset.

We would expect the bi of rows, and in this case, the keep down of columns, to be unchanged, except all string values are now integer values.

As expected, in this case, we can insure that the number of variables is unchanged, only all values are now ordinal encoded integers.

Input (286, 9)

[[2. 2. 2. 0. 1. 2. 1. 2. 0.]

[3. 0. 2. 0. 0. 0. 1. 0. 0.]

[3. 0. 6. 0. 0. 1. 0. 1. 0.]

[2. 2. 6. 0. 1. 2. 1. 1. 1.]

[2. 2. 5. 4. 1. 1. 0. 4. 0.]]

Output (286,)

[1 0 1 0 1]

Next, let's evaluate automobile learning on this dataset with this encryption.

The best practice when encoding variables is to fit the encoding on the breeding dataset, then apply it to the train and test datasets.

We will for the first time split the dataset, then prepare the encoding connected the training laid, and utilize it to the test band.

. . .

# split the dataset into geartrain and test sets

X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.33 , random_state = 1 )

We john and then ready the OrdinalEncoder on the training dataset and habit it to transform the train and test datasets.

. . .

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder ( )

ordinal_encoder . fit ( X_train )

X_train = ordinal_encoder . transform ( X_train )

X_test = ordinal_encoder . transmute ( X_test )

The Same approach can be ill-used to prepare the aim variable. We can so fit a logistic regression algorithm connected the breeding dataset and evaluate it happening the try dataset.

The complete example is listed below.

# evaluate logistical regression connected the breast cancer dataset with an seventeenth encoding

from numpy import mean

from numpy import std

from pandas importee read_csv

from sklearn . model_selection import train_test_split

from sklearn . linear_model significance LogisticRegression

from sklearn . preprocessing meaning LabelEncoder

from sklearn . preprocessing spell OrdinalEncoder

from sklearn . metrics import accuracy _score

# define the location of the dataset

uniform resource locator = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset

dataset = read_csv ( url , head = None )

# retrieve the array of data

information = dataset . values

# separate into input and outturn columns

X = data [ : , : - 1 ] . astype ( str )

y = data [ : , - 1 ] . astype ( str )

# fragmented the dataset into train and test sets

X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.33 , random_state = 1 )

# ordinal encode input variables

ordinal_encoder = OrdinalEncoder ( )

ordinal_encoder . acceptable ( X_train )

X_train = ordinal_encoder . transmute ( X_train )

X_test = ordinal_encoder . transform ( X_test )

# 80th encode target unsettled

label_encoder = LabelEncoder ( )

label_encoder . fit ( y_train )

y_train = label_encoder . transform ( y_train )

y_test = label_encoder . transform ( y_test )

# define the pose

model = LogisticRegression ( )

# sound on the training set

model . fit ( X_train , y_train )

# predict on test determine

yhat = model . predict ( X_test )

# evaluate predictions

accuracy = accuracy_score ( y_test , yhat )

black and white ( 'Truth: %.2f' % ( accuracy* 100 ) )

Running the example prepares the dataset in the correct manner, and then evaluates a model fit on the changed data.

Note: Your results may depart given the stochastic nature of the algorithm or evaluation subprogram, Oregon differences in mathematical precision. Consider running the example a few times and compare the average outcome.

In that case, the model achieved a classification truth of about 75.79 percent, which is a well-founded score.

Next, let's take a closer smel at the one-fresh encoding.

OneHotEncoder Transform

A one-hot encoding is appropriate for aggregation data where nobelium relationship exists between categories.

The scikit-get a line library provides the OneHotEncoder class to mechanically unmatchable hot encode one or more variables.

Away default the OneHotEncoder will output data with a sparse representation, which is efficient given that nigh values are 0 in the encoded representation. We wish incapacitate this feature by setting the "sparse" argumentation to False so that we can review the effect of the encoding.

Once distinct, we can telephone call the fit_transform() function and pass it to our dataset to create a quantile changed version of our dataset.

. . .

# one hot encipher input variables

onehot_encoder = OneHotEncoder ( sparse = False )

X = onehot_encoder . fit_transform ( X )

As before, we mustiness label encode the target variable.

The staring example of creating a one-hot encoding metamorphose of the white meat cancer dataset and summarizing the resultant role is listed below.

# indefinite-hot encode the tit cancer dataset

from pandas import read_csv

from sklearn . preprocessing signification LabelEncoder

from sklearn . preprocessing import OneHotEncoder

# define the location of the dataset

URL = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# freight the dataset

dataset = read_csv ( url , heading = No )

# retrieve the lay out of data

information = dataset . values

# separate into stimulation and output columns

X = data [ : , : - 1 ] . astype ( str )

y = data [ : , - 1 ] . astype ( str )

# one hot encode input variables

onehot_encoder = OneHotEncoder ( sparse = Hollow )

X = onehot_encoder . fit_transform ( X )

# ordinal encrypt target inconsistent

label_encoder = LabelEncoder ( )

y = label_encoder . fit_transform ( y )

# summarize the changed data

print ( 'Input' , X . shape )

print ( X [ : 5 , : ] )

Running game the example transforms the dataset and reports the physique of the sequent dataset.

We would expect the come of rows to stay the same, but the number of columns to dramatically increase.

As expected, in this case, we can see that the number of variables has leaped up from 9 to 43 and all values are now binary values 0 or 1.

Input (286, 43)

[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]

[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]

[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]

[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]

[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.

1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]

Next, countenance's measure machine learning on this dataset with this encoding as we did in the previous section.

The encoding is fit on the training set then applied to both cultivate and test sets as before.

. . .

# one-calefactive encode input variables

onehot_encoder = OneHotEncoder ( )

onehot_encoder . suit ( X_train )

X_train = onehot_encoder . transform ( X_train )

X_test = onehot_encoder . transform ( X_test )

Tying this together, the complete example is listed below.

# evaluate supplying regression on the front Cancer dataset with an one-hot encoding

from numpy implication mean

from numpy import std

from pandas import read_csv

from sklearn . model_selection import train_test_split

from sklearn . linear_model import LogisticRegression

from sklearn . preprocessing import LabelEncoder

from sklearn . preprocessing spell OneHotEncoder

from sklearn . metrics import truth _score

# define the location of the dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-Cancer.csv"

# load the dataset

dataset = read_csv ( url , header = None )

# retrieve the array of data

data = dataset . values

# separate into input and yield columns

X = information [ : , : - 1 ] . astype ( str )

y = data [ : , - 1 ] . astype ( str )

# split the dataset into train and test sets

X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.33 , random_state = 1 )

# one-hot encode input variables

onehot_encoder = OneHotEncoder ( )

onehot_encoder . fit ( X_train )

X_train = onehot_encoder . transform ( X_train )

X_test = onehot_encoder . transform ( X_test )

# ordinal encode target variable

label_encoder = LabelEncoder ( )

label_encoder . fit ( y_train )

y_train = label_encoder . transform ( y_train )

y_test = label_encoder . transform ( y_test )

# define the model

pattern = LogisticRegression ( )

# fit on the training exercise set

mannequin . tally ( X_train , y_train )

# predict on test set

yhat = simulate . predict ( X_test )

# appraise predictions

accuracy = accuracy_score ( y_test , yhat )

publish ( 'Accuracy: %.2f' % ( accuracy* 100 ) )

Running the example prepares the dataset in the correct personal manner, then evaluates a model fit along the changed data.

Note: Your results may vary presented the stochastic nature of the algorithm or evaluation procedure, operating theater differences in numerical preciseness. Consider running the example a few multiplication and equate the medium outcome.

In this case, the model achieved a classification accuracy of about 70.53 percent, which is slenderly worsened than the ordinal encoding in the previous plane section.

Common Questions

This surgical incision lists some coarse questions and answers when encoding unqualified data.

Q. What if I have a miscellanea of numeric and accumulation data?

Or, what if I have a mixture of collection and ordinal information?

You bequeath need to prepare or encode each protean (column) in your dataset one by one, then concatenate wholly of the prepared variables back together into a single array for meet or evaluating the mold.

Alternately, you can use the ColumnTransformer to conditionally apply varied data transforms to different input variables.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one-hot encoded vectors to create a many-chiliad-component input vector?

You can use a one-hot encryption up to thousands and tens of thousands of categories. Also, having large vectors as stimulant sounds intimidating, just the models can loosely handgrip information technology.

Q. What encoding proficiency is the best?

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what kit and boodle best for your case.

Foster Reading

This plane section provides Thomas More resources on the topic if you are looking to go bad deeper.

Tutorials

3 Ways to Cipher Categorical Variables for Deep Learning
Wherefore One-Hot Cypher Information in Machine Learnedness?
How to Same Sultry Encode Episode Data in Python

Books

Feature Engineering for Automobile Learning, 2018.
Applied Predictive Modeling, 2013.
An Introduction to Applied mathematics Learning with Applications in R, 2014.

APIs

sklearn.preprocessing.OneHotEncoder API.
sklearn.preprocessing.LabelEncoder API.
sklearn.preprocessing.OrdinalEncoder API.

Articles

Categorical variable, Wikipedia.
Nominal class, Wikipedia.

Summary

In this tutorial, you observed how to use encoding schemes for categorical machine scholarship data.

Specifically, you learned:

Encoding is a required pre-processing step when employed with categorical data for machine learning algorithms.
How to use ordinal encryption for categorical variables that give birth a natural rank ordering.
How to use nonpareil-hot encoding for categoric variables that do non have a rude rank ordering.

Make out you have any questions?
Ask your questions in the comments below and I will do my best to response.

Get a Handgrip on Modern Information Preparation!

Prepare Your Machine Learning Data in Transactions

...with just a few lines of python code

Get word how in my hot Ebook:
Data Preparation for Car Learning

It provides self-consider tutorials with full working code on:
Feature Excerption, RFE, Data Cleaning, Information Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Information Preparation Techniques to
Your Machine Scholarship Projects

See What's Inside