xtenalyze: 2016

Monday, 16 May 2016

Watson Analytics 2.0

Continuing from where we left off the last time. We covered two out of the five section of the software indicated in the diagram below by the bits in grey. So we are discussing 'Assemble' to 'Refine'.

Assemble: With 'Assemble' making a dashboard has never been easier. The software has reduced making of charts to a simple 'drag and drop' process. The section has templates and layouts that users can use to make all kinds of dashboards. Once the template and layout is selected, fields are simply dragged into axes and filters to make graphs as shown in the diagram below.

There are several options of charts that can be made, it is just as simple as clicking on the chart that meets the objective and the job is done. Below is a diagram showing the types of charts available.

Social Media: This section scrapes social media for data based on criteria specified by the user. The criteria will include fields like language, source,The data collected can analyzed by topics,theme, geography,sentiment, active authors and demographics. This tool is useful for finding out the buzz around a service (or product or may be even a topic) in social media. The software gives clues from data generated around the chosen topic in form of a word cloud, just like in the diagram.

Users can include these clues as terms closely related to the search topic to get better results, they can also break down the topic by specifying a criteria of some sort, this is known as a 'Theme' in Watson Analytics. After the search is all sorted, a dataset is generated which can now be analyzed.

The diagram above shows a 'Sentiment' chart from the data collected. The dataset fields are displayed at the bottom of the chart. Other information like topics, themes, languages, dates, documents and mentions are displayed at the top. All these details will specify the search and scrape criteria for the data. Users can change charts by using the chart criteria to display charts for criteria that was earlier specified.

Refine: Watson Analytics assesses data before use, result of assessment are usually displayed on the tile that links to the data as shown in the diagram below.

It assesses data for missing values and inconsistencies that may affect use of the data and then awards marks. 100 for data that is consistent and has no missing values. The 'Refine' section does exactly what it says on the box. It is used for cleaning, munging and editing of data. The section can enable users to group data, form hierarchy, do calculations. Users can also search data, include/exclude rows and columns. Each column is assessed , marks and charts are displayed showing quality of data.

There is still much to learn in Watson Analytics, the blog is just a headstart. As always, there's more information on the internet.

Give Watson Analytics a try. Thanks for reading.

Monday, 2 May 2016

Introducing Watson Analytics

Watson Analytics

So you have 100,000 rows of data to analyze, you cannot afford to hire a Data Scientist and you do not have the required skills.Have no fear, IBM Watson Analytics have come to the rescue.

Watson Analytics is IBM's attempt to replicate what Apple has done in computing to data science. They have managed to make the subject simple enough for mere mortals (with little or no knowledge of Data Science) like you and me to understand.

The software has sections that it uses to work on data just as shown in the image above, they are namely:

Explore: This section deals with data exploration, which is just having a feel of data in an attempt to find out what the data is all about. The section has charts that will help provide insight as to what the data is all about. It's an array of charts options that will paint a good picture of the data. This section allows users to probe data for insights that will inform decision. Once the data is uploaded and refined, Watson dissects data and comes up questions that users might be interested in finding answers to.

Clicking on the question that best matches objectives, will reveal detailed answers with charts. The software will produce lots of graphs pertaining to the question asked. Clicking on a graph reveals a more detailed information. The image below shows graphs made from the data we tested with.

Predict: This section is basically linear regression in graphic user interface. It users predict values using variables in the datasets. Dependent variables are predicted using independent variables. Watson examines the quality of the dataset and scores it. This enables users to make changes that ensures accurate predictions. The software then explains that datasets using all sorts of graphs and tables showing things like skewness of data, Outliers, Box graphs etc. For users that are novices in linear regression (which is most of us), Watson is able to point out variables that predict other variables. In my dataset, Item variable drives Unit cost, as shown below.

It also shows field associations , which is another way of showing which variables are correlated just as shown below.

It also shows degree and accuracy of predictability in a circular diagram just like the one shown below. It also offers alternatives where necessary, like in my dataset Rep provides a better prediction for item, with predictive strength of 99.6%

With these kind of information, users now know what variables to tweak to get results from other variables. Will discuss other sections in subsequent blogs.

Enjoy Watson.

Sunday, 10 April 2016

Filtering in R

Filtering

In spreadsheets, filtering is done by clicking on 'Filters' found in the Data menu. Filters selects data that meets a specified criteria. Usually a drop-down box is attached to a column, unique values of all the data in the column appears in the drop-down box. Any value selected will make the software to display occurrences of the number selected, excluding all other values. In R, filtering is done by:
Slicing: Data once read into R software, can sliced to be get bits and pieces that are needed. As discussed from the previous blog, the nearest representation of tabular data (with rows and columns) is a dataframe. Using RStudio, we read sample data into R in the previous blog. To get a sense of what the data looks like, the 'str' function is used. It stands for structure. Running the command in the console produces:

From the results, we can see that the data has 43 observations of 7 variables, each variable represents a column, there is also the 'factor' term found in the results. In R, category data is called a factor, the categories are called levels. Since columns are variables in R, the sigil sign '$' is used, which also appears in the results above. To display contents of Region column will mean running the command below

df$Region

Will appear in the console as shown below:

The diagram shows that Region column has 3 categories of data namely Central, East, West.

To filter (called slicing in R). The square brackets are used. The rows,columns and filter criteria are written inside the brackets. The syntax is as shown below:

The dataframe [rows{Filter criteria},column]

To illustrate with our data, filtering data from Central Region will mean running the command below:

df2[df2$Region=='Central',]

Will appear in the console as shown below:

The importance of ',' in the syntax cannot be overemphasized because it what determines if rows or columns will be selected. In the example above, the command is for rows to be selected. If columns are needed in the filter, the command will be

df2[df2$Region=='Central',c('OrderDate','Rep','Units')]

Will run in the console to produce:

The data above shows data from the 'Central' region but with columns specified rather than the entire data. The next blog will discuss Pivot tables in R.

Enjoy filtering in R

Wednesday, 30 March 2016

Moving to R (Part 2)

Opening a spreadsheet file in R

Opening a file in R is called reading the file, as R is a command-line programming language. R like any other programming language requires modules to get stuff done and opening (reading) a spreadsheet file will only happen when the right module is attached. A module is simply a portion of a computer program that can be used independently or with other modules to meet an objective.

R Studio IDE

R Studio IDE shown above,has got to be the gold standard for what an R IDE should be. Shown below, it's got 4 windows namely:

1)Console: This window has the console which is where the compiler is, the compiler executes the R scripts

2)Workspace and history : This window has a directory-like container where resources connected to a task is kept. It also has a tab ' history 'where lines of codes compiled earlier are stored.

3)Files,plots,packages and help : This window has directory that shows all system files. that shows has a tab that shows graphs and another for packages. Packages are special R programs for doing lots of things.

4)Scripts and Data view: This window has tab for scripts and viewing data. Script written can be executed in the console. The data view tab is used for examining data.

Reading a spreadsheet file
There are several modules for reading spreadsheet file into R. This web page discusses it in details. Using R Studio, we will show how to read spreadsheet data into R. Below, are screenshots of the windows mentioned above reading file into R.

Scripts and Data view

Above the Script view, script to read spreadsheet data using two methods, Method 1 used the 'gdata' module using the 'require' keyword and read.xls function is used to load the spreadsheet file into R. Method 2 involved saving the spreadsheet file as a text or csv (Comma Seperated Value) and using read.csv function, the data was read into R. Observe that each data read has been assigned a value using the '=' assignment operator. Also see two tabs representing views of the data read in. Scripts written in the script window are executed in the Console view by pressing a combination of keys on the keyboard. The keys are :

Ctrl+Shift + Enter

Console

Pressing the keys above, makes the console to compile any codes from the script. Any operation in IDE must be translated to a code in the console for it to work. An example selecting a data view tab will be translated to 'view(the data to be viewed)'. The other two windows do not have changes.

Aggregation, Charting, Pivot tables, Filtering and so on, and so forth can be done in R, we will explore further in the next blog.

Enjoy trying R out.

Monday, 21 March 2016

Moving over to R

The problem

Data comes in all shapes and sizes. Sometimes they come with some many columns that makes it difficult for one to make sense of . A special kind tool has to be used to work on the data, that tool is R.

R is an open source software used for analytics and statistic computing. It has become quite popular because its capabilities. The software is able to do almost everything a spreadsheet does and more.

Structure

R is structured differently from spreadsheets. In spreadsheet, cells stack up to make rows and columns and data is arranged in a tabular form. Not so with R, the closest thing to tabular data representation is called a Dataframe. The closest thing to a column in R is called a vector and for row is list. For example a named column in a spreadsheet is represented as :

In R it is :Tax <- c(34,45,60,70,90,30,45), the 'c' stands for concatenate, which is also a function in spreadsheet for combining data. A named row in a spreadsheet can be:

In R becomes: Tom <- c(Tom,25,London). So a vector contains the same data type while a list do not have the same datatype.

Interface

Most spreadsheet software have a graphical user interface, that means that users interact with the program with a mouse,touchpad etc. R is a programming language, so interaction is command-line, users have to type in lines of code. R has a GUI module that has to be activated through commandline before use. It's called Rattle. Most R users use an IDE or editor of some sort to write code that R compiler will compile. A search on Google will list out most of them, but RStudio comes highly recommended. Spreadsheets operate on data in cells that stack up to make rows and columns, R being a programming language operates by using variables assigned to vectors and lists that stack up to make Dataframes.

Assignment operators
People conversant with R must have come across this symbol '<-', it's called a left side assignment operator, it assigns data to names the R software can recognize and work with, just like the example shown earlier, repeated below as :

Tom <- c(Tom,25,London)

There is a right side assignment operator, which looks like this '->', which is just the reverse of the one above, shown below as :

c(Tom,25,London) -> Tom

Finally, the assignment operator used in most programming languages, the '=' sign. This operator is my favourite because it takes just one keystroke to reach unlike the rest two that takes 3 keystrokes (the two symbols + a shift key), shown below as :

Tom = c(Tom,25,London)

Let's stop here for now. There will be another blog on R that will discuss how to do most thing that spreadsheets do in R.

Why don't you give R a go ?

Sunday, 28 February 2016

What pivot tables can do in spreadsheets

Wikipedia describes pivot table as a data summarization tool found in spreadsheets and business intelligence software. Although, the tool is used mainly for summarization, the tool has been put into other uses in various software. Pivot tables can do graphs, calculate values, percentage changes and so on and so forth.

Making a basic pivot table

To make a basic pivot table, there are few things that a spreadsheet user should

understand. The main objective is to aggregate the values in a numerical field or column. To illustrate, I'll use this web page that shows how to make a basic pivot table. In the data, the amount column has the numerical data and will be a vital component of the pivot table.

Every pivot table summarizes a numerical data from a column in a workbook. It uses filters to select data that fit a certain criteria whilst summarizing a numerical data. There are 3 components of pivot table and they are :

Rows labels: These are row filters applied to pivot tables whilst working on numerical data.

Column labels: These are column filters applied to pivot table whilst working on numerical data

Report Filters : This filter selects rows and columns that meet a set criteria whilst working on data.

Values: This is the numerical data to be worked on. Aggregation functions like average,count,sum, product can be performed on the data. Statistical functions like standard deviation can be performed as well.

Values

These values worked on can be taken directly from the data or derived from a calculation. The values can be formatted and displayed according to user requirements. They can also be displayed in special ways by using the "Show Values As' tab.

The diagram above shows the shows the 'Show Value As' tab. The columns or fields in the data can determine how values are displayed. There are are 15 different calculations that can determine data is displayed based on columns or fields in the data. This website explains the process in details. Using fields or columns as base field, it is possible to calculate percentage difference between fields, rankings between fields. Thorough planning is needed to pick a base field that will be able deliver what user intended.

Charts

This is basically the process of visualizing data from pivot tables. The advantage of this graph is the use of filters to visualize data that meets a given criteria. This website explains the process in details. Pivot chart is the quickest way of graphs that are supposed to meet multiple criteria. The diagram below is an example of a pivot chart. See the filter buttons that change the graphs depending on what filter is applied.

There's limit on how data can be manipulated with pivot tables, all is needed is planning to figure out what the end result should look like.

Enjoy working with pivot table.

Sunday, 14 February 2016

Why data lookups fail in spreadsheets

To be able to lookup data is quite an important skill to have for a spreadsheet user. However, most users struggle with this skill because of Cell Referencing.

The website spreadsheets.about.com describes cell reference as identification of a cell or a group of cells in a worksheet. So data lookup will not work without cell referencing. There are 3 types of cell referencing, namely:

Relative cell reference: This is the default setting in spreadsheets. Row and Columns change with formula with this cell reference.
Absolute cell reference: This cell reference has instructions to peg a row or column. It is done by inserting the '$' sign in front row column notation. For instance $A1, will peg reference to column A, the cell reference will change only for row numbers.
Mixed cell reference: This reference combines relative and mixed cell reference to look up data for formula. To find out more on cell referencing , check this page

Formulas that look up data
There are so many formulas that look up data. It depends on what the user's objective is for the data. Is data the final product or work-in-progress that goes into another formula. This web page provides guidelines on how to decide what formula to use and offers tutorials on how to use these formulas. I will discuss the impact of cell referencing on a few of these formulas with an example.

Let's assume that there are two tables in a worksheet, the first table is just as shown below.

The other table will be looking up data from this table above. This table has got data validation tool that enables it to pick data from a list, this means that it can look up price data for the entire item on the list. It looks as shown below.

Any item picked from the list will have it's price displayed by looking it up from the list above. The lookups has been done with formulas VLOOKUP,INDEX MATCH and OFFSET MATCH. I'll be discussing the cell referencing used to look up the data.

VLOOKUP

The syntax for VLOOKUP can be summarized as shown below:

vlookup(what is to be looked up, where to search, what column to search, the kind of match needed)

In the formula above, E1 in blue is the item to be looked up, $A$2: $B$5 in red is range of cells to search, 2 means that search should narrowed to the 2nd column which is price, 0 tells the program that exact match is required. For this formula, mixed cell referencing has been used, the character in blue has relative cell reference, while those in red have absolute cell reference. This way, any item picked (having a relative cell reference) can be looked up because search boundaries have been set with absolute cell references in red.

INDEX MATCH

The formula combines two formulas to look up data and the syntax is as follows:

index(the result area,match(what is to be looked up, the area to search, the kind of match needed)

In the formula, the area in pink shows the boundary for the search, the character in red is the value to be searched for in the search boundary. The areas in blue is where the result data can be found. Like in the formula above, mixed cell referencing is used, the item to be searched has relative cell reference, while the result and search boundaries do have absolute cell referencing.

OFFSET MATCH

Just like the formula above, this formula is also a fusion of two formulas, the syntax is as follows:

offset(the first row of search area,match(what is to be looked up, other rows in the search area , match needed), the number of adjacent rows to return)

The the first row is pegged with absolute cell reference in blue, the item to be searched for has relative cell reference ( this way, any item can be looked up), the search area is pegged with an absolute reference in red.

Cell referencing establishes search boundaries for formulas to use during look up. The rule of the thumb is to have items to be looked with relative cell references, search areas with absolute cell references so that the search areas are confined and the right data looked up.

Enjoy looking up your data

Sunday, 31 January 2016

Working with date and times in a spreadsheet.

There's quite a lot date and time in data. Knowledge of how to manipulate date and time can be very beneficial. Dates and times are made of seconds,minutes,Hours and Days. The representation in spreadsheets is not too different.

Dates in Spreadsheets.

Dates in spreadsheets are represented by serial numbers different from regular numbers. Operations like sorting,addition,subtraction on these numbers do not provide expected results. According to Microsoft, serial number 1 represents 1st January 1900 at 12:00:00 am. Increasing or decreasing these serial numbers produces different dates.The number 30010 represents 28th February, 1982. There are several functions for manipulating date in spreadsheet.

Time in Spreadsheets

Times are stored in spreadsheets as decimal numbers that goes from .0 to .99999, .0 represents 00:00:00 while .99999 is 23:59:59. As in dates, increasing and reducing these numbers will produce different times.

Manipulating Dates and Times

Above is the format that represents date and time in Spreadsheets. The 0 represents a digit. The representation is DD/MM/YY HH:MM:SS. YY can also be YYYY. The software checks the 1st two digits, if the number is equal to or less than 31, that becomes the day value, it then checks the next two, if the number is equal or less than 12, that's the month value. The last two or four becomes the year value. The software will recognize any subset from the above mentioned number pattern.

'12-03' typed into a cell will be interpreted by the software as 12th of March. '03/2004' will be March of 2004.

Did you know that?

Gives

Date and Time functions in Spreadsheets

There are a lot functions in spreadsheet that manipulate date and time. This web page from Microsoft discusses all these functions in details. A few thing to note about these functions:

Most functions work by splitting up components that make up the date and time in spreadsheets. For instance 'HOUR' function takes the hour function in date and time for calculation.
Functions can be combined or 'nested' with other function to achieve objectives whilst calculating.

Above is an example of nested functions to get the serial date for the last day of the current year.

Enjoy working with dates and times.

Saturday, 16 January 2016

The best way to capture and annotate images on the monitor

At some point, there will be a need to capture images on monitors and annotate , for use in reports, presentation, writings etc.

The basic way of getting this done is by using the 'PrtSc' button on the keyboard and annotating with a graphic software like Paint (for windows users). The process is the same for Linux user, 'PrtSc' prints the entire window. There are so many software that capture screen across the platforms, a search on Google will show most of them.

For users of Windows Vista, 7 , 8 and 10 , there's this wonderful software called Snipping Tool, it has to be in my view the best tool for screen capture. , represented by the icon shown below.

Click on the icon above (which can be found through a search or the favourite icons section), displays the window below.

Clicking on 'New' displays snipe options , which has been explained by windows.microsoft.com as :

Snips once made, can be saved as :

An Image file: Snips can be saved as (PNG,GIF and JPEG) file which be annotated and used for various purposes.

A HTML file: They can be also saved as a special type of HTML file called MHTML (Mime Hypertext Markup Language) which is optimised for emails, meaning that snips can be sent in emails

The software has one big flaw, it cannot annotate, does not have any text functions. Annotation can be done with any graphic software like Paint. However there's one graphic software that does the job quite well. It's called Draw.

The Draw software is an open source graphic software, originally part of the Openoffice Suite, that came with Writer,Calc, Impress as a component in the productivity suite from Apache. A fork of the software was named Libre was made when Apache was acquired by Oracle. The Open office and Libreoffice Draw are both available for downloads.

Open office Draw download page

Libreoffice Draw comes as part of the suite and cannot be downloaded separately, to download suite ,go to Libre Office Download page.

The reason for the choosing of 'Draw' over other graphic software is more tools to work with. It's got more connectors, symbols and font work, it's just like Microsoft Word that is optimized for graphics.

To load image , click on the 'Insert' from the horizontal menu and select image from from the drop-down menu and then choose 'File' to select the filepath to the image file , choose 'Scan' to get image from scanner, choose 'Fontwork gallery' to use click-art from the system.

Once image has been uploaded to page, annotations and symbols can be added with the tools at the bottom of the page . Completed work can be saved as PDF (Portable Document Format) but a good alternative is snipe the work directly from the monitor. This method enables for cropping to the right size to be done at the same time.

Enjoy Snipping.

Saturday, 9 January 2016

How to organize data in a spreadsheet

Spreadsheet software is an interactive software for analyzing, organizing ,charting and storing data. The software works by manipulating data stored in cells that make up rows and columns. A column is a cluster of cells arranged vertically while a row is a horizontally arranged cell cluster.

Data types in a spreadsheet software

For a spreadsheet software to work effective, it has to classify inputted data into data types that it recognizes and is able to process, the spreadsheet software recognizes 4 data types namely:

Dates and Times: Date data is usually a combination of any day,month and year data, so it can be day,month ; month, year ; day, month,year combinations . Time data is a breakdown of the day data where it combines seconds,minutes,hours. The software is able to recognize days of the week, months of the year automatically. It is able convert from one type to the other e.g seconds to minutes, minutes to hours.

Texts : These are basically combination of letters,numbers and symbols represented in data. The numbers in texts can not be used in calculations as they are enclosed in literals (which means a notation that represents a fixed value in a code) e.g '2' or "2345", numbers like this are not recognized by the software as numbers.

Values: These are regular numbers ,not enclosed in literals and so can be used in calculations. They can also be derivatives of a calculation

Formulas: These are instruction to the software, usually a combination of text and number put together in a manner that the software will decipher and have the task done just as requested.

Rules on how to organize data in a Spreadsheets.

Spreadsheets work better when data is in a tabular form. A good data table should :

a) Have a header row showing column labels

b) Have rows and columns

c) Each cell in the column must have data matching column labels

d) Each cell in the row may not have the same type of data because the row cells combine to make a record.

e) There should be no gaps between data in row and column cells

f) There should be no spaces in data inputted in the cells

The diagram below shows an annotated data table where all the guidelines mentioned above are met. The first row is the header row with the column labels, column data in line with column label, no gaps in row and column data, row cells combining to make a record.

Data Formating

This means presenting data in a manner that the spreadsheet software can interpret. Since data is made up of text and numbers, formating applies to both.

Text formating is changing the appearance of text, by changing font, position, color, weight etc.

Number Formating is merely instructing the software on how to interpret number data. Number data can be:

a) Percentage

b) Financial

c) Scientific

d) Decimal

e)Fractions

f) Rounded

Date and Tine Formating is an extension of number formating because dates and year in computing are number data, it can be:

a) Second

b)Minute

c)Hour

d)Day in numbers not letters

e)Date

f)Month

g)Year

Number formating can be done in two ways. The first way using the graphic user interface by right-clicking on a cell and selecting the 'Format Cells' option, and selecting the format of choice from the Category. Date and Time formating can be done this way as well.

The other method is to use the formula 'TEXT' to specify what format the data should be in. The syntax for the formula is:

TEXT(Cell to be formated,Format"Format Pattern")

For example, TEXT(now(),Format('ddd'), gives what day it is in full, Monday, Tuesday whatever the day is.

Formula formating points to a different cell and formats its content while GUI (Graphic User Interface) formating can convert contents of the same cell.

The software is able to choose a format from the data inputted in a cell. e.g 12/04 typed in a cell will be displayed as 12th April. Data that has been put in the right way is easier to process into Information, for Spreadsheet newbies a good knowledge of data formating is a good place to start.

Saturday, 2 January 2016

How to read emails from multiple accounts at once

Most people have more than one email account. For instance one account for work or business and another account for personal or private correspondences.

In most cases, these email accounts are attended to , one at a time by going through the web browser to the provider's web page e.g. Yahoo, Gmail and Hotmail accounts. People that can afford it get expensive email clients like the Outlook software that is part of Microsoft Office Suite. There are other email clients that can do the job, some of them are eM Client, Windows Live mail ( for window users with original copy of the software), claw mail. There are a few others that search on Google might reveal, but none does the job as well as Thunderbird email clients ( made by Mozilla Foundation, the same people that made the Firefox web browser

). This software has got to be the best email client for its price (which is free by the way). It does not come with all the functionality of high end email clients like Outlook Exchange ,it is not far off. It is so good that there's been need for comparison. Click on the link Thunderbird & Outlook Compared

Getting Thunderbird is quite easy, just google thunderbird email client and a link to the Mozilla Thunderbird website will show up in your search result. The page should look like the one below

Clicking on 'free download', selects the right software to install according to the operating system of machine. If that fails, clicking on 'Systems and Languages' brings out a page showing languages and operating systems that the software comes, just like the link says. Installation should be fairly straight forward from here on out. The English languge sections will appear just as shown below.

After installation, the next step is adding of email accounts, which is again quite straight foward. Clicking on 'File' will shows options of setting up a 'New Email Account' or an 'Existing Email Account'. The 'New Email Account' is for setting up an email account from scratch, thunderbird provides email accounts to user. The 'Existing Email Account' is for setting up an existing email account, a gmail, yahoo account or a domain email account ( the email account account that comes with a website) for example. The account set up page looks as shown below.

The setup process is simply filling out the the mail account setup form, typing in name your email account and password. The software automatically verifies the mail address and checks the validity of the password and the account is set up. Shown below is what the software looks like with Gmail, Hotmail and a domain mail account added.

Pressing the 'N' key displays the next email. There's a prompt when moving to the next mail account. Sending emails is done by clicking on the 'Write' button, it is possible to send email from any account of the account when the 'Send Mail' page is displayed. Clicking on the 'From' section of the page displays a drop- down box showing all email accounts from which one can be selected as shown below.

There are add-ons to import contacts from Google, for Calendar, for instant messaging from providers like Facebook and Google. Download this software and see what it can do for you.