How To¶
Defining the Problem to be Solved¶
In order to build a predictive model, AlphaD3M needs a problem specification that describes the prediction problem. A problem specification includes:
A
targetvariable, i.e., what should be predicted by the model. In the AlphaD3M environment, the target is an attribute from the dataset.A
task_keywordsvariable, which specifies the kind of prediction task and, therefore, the kind of technique that should be used to solve the prediction problem. In the AlphaD3M environment, thetask_keywordsparameter must be defined as a list of keywords that capture the nature of the machine learning task. A few examples of supported tasks are: tabular, nested, multiLabel, video, linkPrediction, multivariate, graphMatching, forecasting, classification, graph, semiSupervised, text, timeSeries, clustering, collaborativeFiltering, univariate, missingMetadata, remoteSensing, multiClass, regression, multiGraph, lupi, relational, audio, grouped, objectDetection, vertexNomination, communityDetection, geospatial, image, overlapping, nonOverlapping, speech, vertexClassification, and binary. See the complete list in our API documentation.A
metricvariable, you can also specify the performance metric (Evaluation Metrics) you are interested in. A few examples of supported metrics are: hammingLoss, accuracy, objectDetectionAP, rocAucMicro, f1Macro, meanSquaredError, f1, jaccardSimilarityScore, normalizedMutualInformation, rocAuc, f1Micro, hitsAtK, meanAbsoluteError, rocAucMacro, rSquared, recall, meanReciprocalRank, precision, precisionAtTopK, and rootMeanSquaredError. See the complete list in our API documentation.
The Credit dataset in CSV format is used for this example.
[1]:
from alphad3m import AutoML
[3]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/datasets/Credit/train_data.csv'
test_dataset = '/Users/rlopez/D3M/examples/datasets/Credit/test_data.csv'
automl = AutoML(output_path)
Then, we specify the problem by setting up the target, the task_keywords, and the metric. Here, we are defining a multi-class classification problem, where the goal is to predict the ‘Loan Status’. In this problem, we will use accuracy as the performance metric.
[4]:
automl.search_pipelines(train_dataset, time_bound=5, target='Loan Status', task_keywords=['classification', 'multiClass'], metric='accuracy')
INFO: Reiceving a raw dataset, converting to D3M format
INFO: Initializing AlphaD3M AutoML...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=a3d6bee1-cc24-4d67-9157-d3e1879e1fee, time=0:00:26.382475, scoring...
INFO: Scored pipeline id=a3d6bee1-cc24-4d67-9157-d3e1879e1fee, accuracy=0.77833
INFO: Found pipeline id=d82e5a47-76f6-48eb-b603-56f23f695fc2, time=0:00:44.600652, scoring...
INFO: Scored pipeline id=d82e5a47-76f6-48eb-b603-56f23f695fc2, accuracy=0.77917
INFO: Found pipeline id=6acb579c-d7ec-4125-b2ce-21770e39b5b2, time=0:01:02.807630, scoring...
INFO: Scored pipeline id=6acb579c-d7ec-4125-b2ce-21770e39b5b2, accuracy=0.8125
INFO: Found pipeline id=76f92c12-c034-4498-8561-65f94b27402a, time=0:01:21.103988, scoring...
INFO: Scored pipeline id=76f92c12-c034-4498-8561-65f94b27402a, accuracy=0.77083
INFO: Found pipeline id=3a6364e3-ec91-4839-98fb-2ba0a3e49207, time=0:01:39.402115, scoring...
INFO: Scored pipeline id=3a6364e3-ec91-4839-98fb-2ba0a3e49207, accuracy=0.76583
INFO: Found pipeline id=9426f447-72d2-4dd5-9fd6-13d25630487b, time=0:05:24.959725, scoring...
INFO: Found pipeline id=362ece87-b213-49eb-aace-3ebedad890d7, time=0:05:28.202371, scoring...
INFO: Search completed, still scoring some pending pipelines...
INFO: Scored pipeline id=9426f447-72d2-4dd5-9fd6-13d25630487b, accuracy=0.76167
INFO: Scored pipeline id=362ece87-b213-49eb-aace-3ebedad890d7, accuracy=0.22917
INFO: Scoring completed for all pipelines!
Handling Collections of Files¶
AlphaD3M supports datasets formatted as a collection of files (e.g. jpg, txt, wav or mp4). It supports collections of text, image, audio, video, and time-series files. The input must be a directory where the files are saved and a CSV file that contains, in one of its columns, the file names. Both of them, have to be under the same directory. To work with this type of dataset, users must define the parameter collection, which is a dictionary with the following values:
column: Name of the column that contains the file names.train_folderandtest_folder: Names of the directories where the collection of files are saved.
The LL1_TXT_CLS_airline_opinion_MIN_METADATA dataset, a collection of .txt files, is used for this example.
[1]:
from alphad3m import AutoML
[3]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/datasets/LL1_TXT_CLS_airline_opinion_MIN_METADATA/train_data.csv'
test_dataset = '/Users/rlopez/D3M/examples/datasets/LL1_TXT_CLS_airline_opinion_MIN_METADATA/test_data.csv'
automl = AutoML(output_path)
[4]:
automl.search_pipelines(train_dataset, time_bound=5, target='sentiment', task_keywords=['classification', 'multiClass', 'text'], collection={'column': 'txt_file', 'train_folder': 'train_collection', 'test_folder': 'test_collection'})
INFO: Reiceving a raw dataset, converting to D3M format
INFO: Initializing AlphaD3M AutoML...
INFO: AlphaD3M AutoML initialized!
INFO: Found pipeline id=a3d6bee1-cc24-4d67-9157-d3e1879e1fee, time=0:00:26.382475, scoring...
INFO: Scored pipeline id=a3d6bee1-cc24-4d67-9157-d3e1879e1fee, accuracy=0.77833
INFO: Found pipeline id=d82e5a47-76f6-48eb-b603-56f23f695fc2, time=0:00:44.600652, scoring...
INFO: Scored pipeline id=d82e5a47-76f6-48eb-b603-56f23f695fc2, accuracy=0.77917
INFO: Found pipeline id=6acb579c-d7ec-4125-b2ce-21770e39b5b2, time=0:01:02.807630, scoring...
INFO: Scored pipeline id=6acb579c-d7ec-4125-b2ce-21770e39b5b2, accuracy=0.8125
INFO: Found pipeline id=76f92c12-c034-4498-8561-65f94b27402a, time=0:01:21.103988, scoring...
INFO: Scored pipeline id=76f92c12-c034-4498-8561-65f94b27402a, accuracy=0.77083
INFO: Found pipeline id=3a6364e3-ec91-4839-98fb-2ba0a3e49207, time=0:01:39.402115, scoring...
INFO: Scored pipeline id=3a6364e3-ec91-4839-98fb-2ba0a3e49207, accuracy=0.76583
INFO: Found pipeline id=9426f447-72d2-4dd5-9fd6-13d25630487b, time=0:05:24.959725, scoring...
INFO: Found pipeline id=362ece87-b213-49eb-aace-3ebedad890d7, time=0:05:28.202371, scoring...
INFO: Search completed, still scoring some pending pipelines...
INFO: Scored pipeline id=9426f447-72d2-4dd5-9fd6-13d25630487b, accuracy=0.76167
INFO: Scored pipeline id=362ece87-b213-49eb-aace-3ebedad890d7, accuracy=0.22917
INFO: Scoring completed for all pipelines!
Exploring Tabular Datasets¶
The method plot_summary_dataset displays different views (compact, detail, and column views) to allow users to explore tabular datasets. It summarizes the column data using histograms. The column types are inferred using the datamart-profiler library. Additional column metadata is also shown in the column view such as mean, standard deviation, and unique values. Use the tabs above the table to switch between the different views of the
dataset.
The Credit dataset in CSV format is used for this example.
Note
You can partially interact with this visualization. Try it in Jupyter Notebook to get full access to all features.
[2]:
from alphad3m import AutoML
[3]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/datasets/Credit/train_data.csv'
test_dataset = '/Users/rlopez/D3M/examples/datasets/Credit/test_data.csv'
automl = AutoML(output_path)
[10]:
automl.plot_summary_dataset(train_dataset)
Exploring Text Datasets¶
The method plot_text_analysis displays a visualization to allow users to explore and analyze the text data. It includes word frequency analysis and named entities recognition, which help users to explore the fundamental characteristics of the text data. We use bar charts to create the visualizations integrated with the Jupyter Notebook environment. Word frequency analysis is a frequent task in text analytics. Word frequency measures the most frequently occurring words in a given
text. Common stop words like ‘to’, ‘in’, ‘for’, were removed for the word frequency analysis. Named entity recognition is an information extraction method. The entities that are present in the text are classified into predefined entity types like ‘Person’, ‘Organization’, ‘City’, etc. By using this method, users can get great insights into the types of entities present in the given textual dataset.
The IED attacks dataset in D3M format is used for this example.
Note
You can partially interact with this visualization. Try it in Jupyter Notebook to get full access to all features.
[11]:
from alphad3m import AutoML
[12]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset_path = '/Users/rlopez/D3M/examples/JIDO_SOHR_Articles_1061/TRAIN'
test_dataset_path = '/Users/rlopez/D3M/examples/JIDO_SOHR_Articles_1061/TEST'
automl = AutoML(output_path)
plot_text_analysis requires three parameters:
dataset_path: Path to dataset. It supports D3M datasetlabel_column: Name of the column that contains the categoriestext_column: Name of the column that contains the texts
[14]:
automl.plot_text_analysis(train_dataset_path, label_column='articleofinterest', text_column='article')
Word Frequency:
Analyzing 7779 documents (positive category)
Analyzing 12004 documents (negative category)
Named Entity Recognition:
Analyzing 7779 documents (positive category)
Analyzing 12004 documents (negative category)
You can also find other Jupyter notebook examples about how to use AlphaD3M with text datasets here.
Download this example as a jupyter notebook file ( .ipynb ).