CodeBook

Course Project for Gettting & Celaning Data based on Human Activity Recognition Using Smartphones Dataset

This CodeBook that describes the variables, the data, and any transformations or work that was performed to clean up
the source data to create a tidy dataset as per requirements of course project.

Information about Source Data Experment

The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years.
Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing
a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial
linear acceleration and 3-axial angular velocity at a constant rate of 50Hz.
The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned
into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

Original Data Source

Human Activity Recognition Using Smartphones Dataset
Data for analysis is downloaded from the below URL

Structure of Data present in source data folders

README.txt: Details of all the files in downloaded folder
features_info.txt: Shows information about the variables used on the feature vector.
features.txt: List of all features.i.e list of all measurement variables
activity_labels.txt: Lists the activity Id with their corresponding activity name.
train/X_train.txt: Training set.
train/y_train.txt: Training activity Id Labels
train/subject_train.txt: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.
test/X_test.txt: Test set.
test/y_test.txt: Test activity Id Labels
test/subject_train.txt: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.

The following files are available for the train and test data. Their descriptions are equivalent. - train/Inertial Signals/total_acc_x_train.txt: The acceleration signal from the smartphone accelerometer X axis in standard gravity units g. Every row shows a 128 element vector. The same description applies for the
total_acc_x_train.txt and total_acc_z_train.txt files for the Y and Z axis. - train/Inertial Signals/body_acc_x_train.txt: The body acceleration signal obtained by subtracting the gravity from
the total acceleration. - train/Inertial Signals/body_gyro_x_train.txt: The angular velocity vector measured by the gyroscope for each window sample. The units are radians/second.

Note: All the files in train/Inertial Signals and test/Inertial Signals will not be used for in this analysis

Details about Files to be used in analysis from Source Data

Common Files - features.txt: 561 rows of 2 varibles (feature Identifier and feature Name) - activity_labels.txt: 6 rows of 2 variables (activity identifier and activity name)

Test Dataset - xTest.txt: 2947 rows of 561 measurement variables. These are measurement variables listed in features.txt - yTest.txt: 2947 rows of 1 variables. This is the activity Identifier - subjectTest.txt: 2497 rows of 1 variable (subject Identifier)

Training Dataset - xTrain.txt: 7352 rows of 561 measurement variables. These are measurement variables listed in features.txt - yTrain.txt: 7352 rows of 1 variables. This is the activity Identifier - subjectTrain.txt: 7352 rows of 1 variable (subject Identifier)

Map of aggregated Data

Variable Names	subjectId	activityId	(variable names from `features.txt`)
Data	`subjectTest.txt`	`yTest.txt`	`xTest.txt`
Data	`subjectTrain.txt`	`yTrain.txt`	`xTrain.txt`

Requirements & Details of Transformations through `run_analysis.R` script

Requirements

run_analysis.R script has the following requirements to perform transformation on UCI HAR Dataset.

Merges the training and the test sets to create one data set.
Extracts only the measurements on the mean and standard deviation for each measurement.
Uses descriptive activity names to name the activities in the data set
Appropriately labels the data set with descriptive activity names.
Creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Detailed Functions of `run_analysis.R` Script

Downloads the dataset from the URL mentioned above and unzips it to create UCI HAR Dataset folder
Imports "test" and "train" datsets and creates data frames from then and then Merges the training and the test sets
to create one data frame.
Extracts a subset of data with only the measurements on the mean "mean()" and standard deviation "std()" for each measurement. Also excludes meanFreq()-X measurements or angle measurements where the term mean exists resulting in
66 measurement variables.
Updates the variable names in dataframe variable names for data fame to improve readibility
Appropriately labels the data set with descriptive activity names in place of activity Ids
Reshapes dataset to create a data frame with average of each measurement variable for each activity and each subject
Writes new tidy data frame to a text file to create the required tidy data set file of 180 observations and 68 columns (2 columns for activityName and subjectID and 66 columns for measurement variables)

Transformations Performed on the original dataset.

Merging the training and the test sets to create one data set.

Activities:

Download the dataset from the URL mentioned above and unzip it to create UCI HAR Dataset folder.
Script Imports test and train datsets and creates data frames from then and then merges the training and the test sets to create one data frame.

All files to be used as listed above are imported to created data frames and column variables names are updated as follows

data.frame	Variable Names
`featureVariables`	`"varId", "varName"`
`activityLabels`	`"activityId", "activityName"`
`xTest`	`featureVariables$varName`
`yTest`	`"activityId"`
`subjectTest`	`"subjectId"`
`xTrain`	`featureVariables$varName`
`yTrain`	`"activityId"`
`subjectTrain`	`"subjectId"`

Merge Activities

subjectTest, yTest, xTest were column bind using cbind function to create testData data frame which added "subjectId", "activityId" to dataset making it 563 column data.frame with 2947 rows subjectTrain, yTrain, xTrain were column bind using cbind function to create trainData data.frame which added "subjectId", "activityId" to dataset making it 563 column data.frame with 7352 rows

Aggregared and Merged Dataset

testData and trainData data.frame were rowbound using rbind function to create final aggregated dataset/data.frame AggregateData with 10299 rows and 563 columns

Below code shows few details of AggregateData

> head(names(AggregateData),10)
 [1] "subjectId"         "activityId"        "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y"
 [5] "tBodyAcc-mean()-Z" "tBodyAcc-std()-X"  "tBodyAcc-std()-Y"  "tBodyAcc-std()-Z" 
 [9] "tBodyAcc-mad()-X"  "tBodyAcc-mad()-Y" 
> 

> head(AggregateData[1:5])
  subjectId activityId tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1         2          5         0.2571778       -0.02328523       -0.01465376
2         2          5         0.2860267       -0.01316336       -0.11908252
3         2          5         0.2754848       -0.02605042       -0.11815167
4         2          5         0.2702982       -0.03261387       -0.11752018
5         2          5         0.2748330       -0.02784779       -0.12952716
6         2          5         0.2792199       -0.01862040       -0.11390197
>

Extraction of only the measurements on the mean and standard deviation for each measurement.

Activities:

Extract a subset of data with only the measurements on the mean mean() and standard deviation std() for
each measurement

Extraction of selected measurement values

grep functions are used to search for occurance of mean mean() and standard deviation std() in AggregateData
variable Names using escape characters.
Using escape characters to search exactly for mean() and std() occurance helps to exclude meanFreq()-X measurements
and/or angle measurements where the term mean exists
The resulting selection would have only 66 measurement variables.

Below code shows search using grep functions in column names of AggregateData data frame.

> head(featureVariables[grepl("mean\\(\\)|std\\(\\)", names(AggregateData)), ], 10)
   varId              varName
3      3    tBodyAcc-mean()-Z
4      4     tBodyAcc-std()-X
5      5     tBodyAcc-std()-Y
6      6     tBodyAcc-std()-Z
7      7     tBodyAcc-mad()-X
8      8     tBodyAcc-mad()-Y
43    43 tGravityAcc-mean()-Z
44    44  tGravityAcc-std()-X
45    45  tGravityAcc-std()-Y
46    46  tGravityAcc-std()-Z
> 
> 
> 
> tail(featureVariables[grepl("mean\\(\\)|std\\(\\)", names(AggregateData)), ])
    varId                    varName
518   518  fBodyBodyAccJerkMag-mad()
519   519  fBodyBodyAccJerkMag-max()
531   531     fBodyBodyGyroMag-mad()
532   532     fBodyBodyGyroMag-max()
544   544 fBodyBodyGyroJerkMag-mad()
545   545 fBodyBodyGyroJerkMag-max()
>

In order to extract a subset of only measurements on the mean mean() and standard deviation std() grep is used
to create index of matched column numbers.
This index of column numbers is used to create subset of data based on mean() and std() column indices and also includng first 2 columns which have subjectId and activityId values.

Below code show the Index of occurance for mean() and std()

> head(grep("mean\\(\\)|std\\(\\)", names(AggregateData)), 10)
 [1]  3  4  5  6  7  8 43 44 45 46
>

This stage crates a data.frame subsetAggregateData of 10299 observations and 68 variables

Clean up variable names to in the data set

Activities:

Update the variable names in dataframe for data fame to improve readibility

Variable Transformation activities

camelcase has been used for R objects created and variable names for data fame to improve readibility.
gsub function is used to remove instances of "-" and "()" from variables names in the extracted subset
subsetAggregateData data.frame
Instances of mean are replaced by Mean and std by Std to obtain proper camelcase in variable names
e.g. "tBodyAcc-mean()-X" is changed to "tBodyAccMeanX"

Usage of gsub function
gsub("-", "", names(subsetAggregateData))

Below code shows the changes in variable names after cleaning up variable names.

> head(names(AggregateData))
[1] "subjectId"         "activityId"        "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y"
[5] "tBodyAcc-mean()-Z" "tBodyAcc-std()-X" 
> 
> head(names(subsetAggregateData))
[1] "subjectId"     "activityId"    "tBodyAccMeanX" "tBodyAccMeanY" "tBodyAccMeanZ" "tBodyAccStdX" 
>

Label the data set with descriptive activity names.

Activities:

Appropriately label the data set with descriptive activity names in place of activity Ids

ActivityName Update

activityLabels and subsetAggregateData data frames are merged using merge function to add a new column
and create a new data.framesubFinalData with corresponding "activityName" for each "activityId" in each row of the
dataset

The below code snippet shows updated activityName in the dataset

> head(subFinalData[1:5])
  activityId activityName subjectId tBodyAccMeanX tBodyAccMeanY
1          1      WALKING        26     0.2314146  -0.017722438
2          1      WALKING        29     0.3312213  -0.018502366
3          1      WALKING        29     0.3755700  -0.024728610
4          1      WALKING        29     0.2332297  -0.034451457
5          1      WALKING        29     0.2362494  -0.014396940
6          1      WALKING        29     0.2645428   0.002484389
>

Final Dataset `finalData`

"activityId" column which is no longer needed because we mapped acitityName to activityId in the dataset
"activityId" column is dropped to create final data.frame called finalData
This data frame has 10299 observations and 68 columns.
2 columns for "activityName" and "subjectId" and remaing 66 for measurement variables with measurements
on the mean() and std()

Tidy Data Set with the average of each variable for each activity and each subject

Activities:

Reshape dataset to create a data frame with average of each measurement variable for each activity and each subject
Writes new tidy data frame to a text file to create the required tidy data

Transformation Details - melt and dcast

reshape2 package is leveraged for reshaping the dataset. library(reshape2)
melt function of reshare2 package is leveraged to reshape data based on id variables "activityName" and
"subjectId" against all measurement values variables to create finalDataMelt data frame.
melt takes wide-format data and melts it into long-format data.
finalDataMelt data frame has 679734 observations of 4 variables

Using `melt`

The below code shows data transformation done by using melt function to create finalDataMelt data frame

> head(finalDataMelt)
  activityName subjectId      variable     value
1      WALKING        26 tBodyAccMeanX 0.2314146
2      WALKING        29 tBodyAccMeanX 0.3312213
3      WALKING        29 tBodyAccMeanX 0.3755700
4      WALKING        29 tBodyAccMeanX 0.2332297
5      WALKING        29 tBodyAccMeanX 0.2362494
6      WALKING        29 tBodyAccMeanX 0.2645428
>

Using `dcast`

lastly, dcast function of reshap2 package is leveraged for creating new tidy data.frame called avgSujectActivities
This dataset provides average of each measurement variable for each activity and each subject
dcast takes long-format data and casts it into wide-format data.

Tidy data frame

The result od dcast operation is the tidy data frame avgSujectActivities
This data frame has 180 observations/rows and 68 columns/variables
68 columns(2 columns for activityName and subjectID and 66 columns for measurement variables)
Each measurement variable columns [3 to 68] is average value for each combination of subjectId and activityName

The below code shows few details of the Tidy data frame

> head(avgSujectActivities[1:7])
  activityName subjectId tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ tBodyAccStdX tBodyAccStdY
1       LAYING         1     0.2215982   -0.04051395    -0.1132036   -0.9280565   -0.8368274
2       LAYING         2     0.2813734   -0.01815874    -0.1072456   -0.9740595   -0.9802774
3       LAYING         3     0.2755169   -0.01895568    -0.1013005   -0.9827766   -0.9620575
4       LAYING         4     0.2635592   -0.01500318    -0.1106882   -0.9541937   -0.9417140
5       LAYING         5     0.2783343   -0.01830421    -0.1079376   -0.9659345   -0.9692956
6       LAYING         6     0.2486565   -0.01025292    -0.1331196   -0.9340494   -0.9246448
>

Tidy Data File

The avgSujectActivities data frame is written to a file using write.table function with "\t" separator to
create avgSujectActivities.txt file
By default column names are kept in file. Row Names have to be explicity excluded using row.names=FALSE argument
in write.table function

Description of variables in the Tidy Data

Variable Name	Details
`activityName`	Factor with 6 levels WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING
`subjectId`	Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30
`tBodyAccMeanX`	Average of Mean Value time doman Body Accelration in x direction
`tBodyAccMeanY`	Average of Mean Value time doman Body Accelration in Y direction
`tBodyAccMeanZ`	Average of Mean Value time doman Body Accelration in Z direction
`tBodyAccStdX`	Average of Standard deviation time doman Body Accelration in x direction
`tBodyAccStdY`	Average of Standard deviation time doman Body Accelration in Y direction
`tBodyAccStdZ`	Average of Standard deviation time doman Body Accelration in Z direction
`tGravityAccMeanX`	Average of Mean Value time doman Gravity Accelrationin x direction
`tGravityAccMeanY`	Average of Mean Value time doman Gravity Accelrationin Y direction
`tGravityAccMeanZ`	Average of Mean Value time doman Gravity Accelrationin Z direction
`tGravityAccStdX`	Average of Standard deviation time doman Gravity Accelrationin x direction
`tGravityAccStdY`	Average of Standard deviation time doman Gravity Accelrationin Y direction
`tGravityAccStdZ`	Average of Standard deviation time doman Gravity Accelrationin Z direction
`tBodyAccJerkMeanX`	Average of Mean Value time doman Body Accelration Jerk in x direction
`tBodyAccJerkMeanY`	Average of Mean Value time doman Body Accelration Jerk in Y direction
`tBodyAccJerkMeanZ`	Average of Mean Value time doman Body Accelration Jerk in Z direction
`tBodyAccJerkStdX`	Average of Standard deviation time doman Body Accelration Jerk in x direction
`tBodyAccJerkStdY`	Average of Standard deviation time doman Body Accelration Jerk in Y direction
`tBodyAccJerkStdZ`	Average of Standard deviation time doman Body Accelration Jerk in Z direction
`tBodyGyroMeanX`	Average of Mean Value time doman Body Gyro in x direction
`tBodyGyroMeanY`	Average of Mean Value time doman Body Gyro in Y direction
`tBodyGyroMeanZ`	Average of Mean Value time doman Body Gyro in Z direction
`tBodyGyroStdX`	Average of Standard deviation time doman Body Gyro in x direction
`tBodyGyroStdY`	Average of Standard deviation time doman Body Gyro in Y direction
`tBodyGyroStdZ`	Average of Standard deviation time doman Body Gyro in Z direction
`tBodyGyroJerkMeanX`	Average of Mean Value time doman Body Gyro Jerk signal in x direction
`tBodyGyroJerkMeanY`	Average of Mean Value time doman Body Gyro Jerk signal in Y direction
`tBodyGyroJerkMeanZ`	Average of Mean Value time doman Body Gyro Jerk signal in Z direction
`tBodyGyroJerkStdX`	Average of Standard deviation time doman Body Gyro Jerk signal in x direction
`tBodyGyroJerkStdY`	Average of Standard deviation time doman Body Gyro Jerk signal in Y direction
`tBodyGyroJerkStdZ`	Average of Standard deviation time doman Body Gyro Jerk signal in Z direction
`tBodyAccMagMean`	Average of Mean Value time doman Body Accelration magnitude
`tBodyAccMagStd`	Average of Standard deviation time doman Body Accelration magnitude
`tGravityAccMagMean`	Average of Mean Value time doman Gravity Accelration magnitude
`tGravityAccMagStd`	Average of Standard deviation time doman Gravity Accelration magnitude
`tBodyAccJerkMagMean`	Average of Mean Value time doman Body Accelration jerk magnitude
`tBodyAccJerkMagStd`	Average of Standard deviation time doman Body Accelration jerk magnitude
`tBodyGyroMagMean`	Average of Mean Value time doman Body Gyro magnitude
`tBodyGyroMagStd`	Average of Standard deviation time doman Body Gyro magnitude
`tBodyGyroJerkMagMean`	Average of Mean Value time doman Body Gyro Jerk magnitude
`tBodyGyroJerkMagStd`	Average of Standard deviation time doman Body Gyro Jerk magnitude
`fBodyAccMeanX`	Average of Mean Value frequency domainBody Accelration in x direction
`fBodyAccMeanY`	Average of Mean Value frequency domainBody Accelration in Y direction
`fBodyAccMeanZ`	Average of Mean Value frequency domainBody Accelration in Z direction
`fBodyAccStdX`	Average of Standard deviation frequency domainBody Accelration in x direction
`fBodyAccStdY`	Average of Standard deviation frequency domainBody Accelration in Y direction
`fBodyAccStdZ`	Average of Standard deviation frequency domainBody Accelration in Z direction
`fBodyAccJerkMeanX`	Average of Mean Value frequency domainBody Accelration Jerk in x direction
`fBodyAccJerkMeanY`	Average of Mean Value frequency domainBody Accelration Jerk in Y direction
`fBodyAccJerkMeanZ`	Average of Mean Value frequency domainBody Accelration Jerk in Z direction
`fBodyAccJerkStdX`	Average of Standard deviation frequency domainBody Accelration Jerk in x direction
`fBodyAccJerkStdY`	Average of Standard deviation frequency domainBody Accelration Jerk in Y direction
`fBodyAccJerkStdZ`	Average of Standard deviation frequency domainBody Accelration Jerk in Z direction
`fBodyGyroMeanX`	Average of Mean Value frequency domainBody Gyro in x direction
`fBodyGyroMeanY`	Average of Mean Value frequency domainBody Gyro in Y direction
`fBodyGyroMeanZ`	Average of Mean Value frequency domainBody Gyro in Z direction
`fBodyGyroStdX`	Average of Standard deviation frequency domainBody Gyro in x direction
`fBodyGyroStdY`	Average of Standard deviation frequency domainBody Gyro in Y direction
`fBodyGyroStdZ`	Average of Standard deviation frequency domainBody Gyro in Z direction
`fBodyAccMagMean`	Average of Mean Value frequency domainBody Accelration magnitude
`fBodyAccMagStd`	Average of Standard deviation frequency domainBody Accelration magnitude
`fBodyBodyAccJerkMagMean`	Average of Mean Value frequency domainBody Accelration jerk magnitude
`fBodyBodyAccJerkMagStd`	Average of Standard deviation frequency domainBody Accelration jerk magnitude
`fBodyBodyGyroMagMean`	Average of Mean Value frequency domainBody Body Gyro magnitude
`fBodyBodyGyroMagStd`	Average of Standard deviation frequency domainBody Body Gyro magnitude
`fBodyBodyGyroJerkMagMean`	Average of Mean Value frequency domainBody Body Gyro jerk magnitude
`fBodyBodyGyroJerkMagStd`	Average of Standard deviation frequency domainBody Body Gyro jerk magnitude

CodeBook

Information about Source Data Experment

Original Data Source

Structure of Data present in source data folders

Details about Files to be used in analysis from Source Data

Map of aggregated Data

Requirements & Details of Transformations through run_analysis.R script

Requirements

Detailed Functions of run_analysis.R Script

Transformations Performed on the original dataset.

Merging the training and the test sets to create one data set.

Activities:

Merge Activities

Aggregared and Merged Dataset

Extraction of only the measurements on the mean and standard deviation for each measurement.

Activities:

Extraction of selected measurement values

Clean up variable names to in the data set

Activities:

Variable Transformation activities

Label the data set with descriptive activity names.

Activities:

ActivityName Update

Final Dataset finalData

Tidy Data Set with the average of each variable for each activity and each subject

Activities:

Transformation Details - melt and dcast

Using melt

Using dcast

Tidy data frame

Tidy Data File

Description of variables in the Tidy Data

Requirements & Details of Transformations through `run_analysis.R` script

Detailed Functions of `run_analysis.R` Script

Final Dataset `finalData`

Using `melt`

Using `dcast`