Using Visual Analytics, Speech and search to quantify
Ethics in every day life.
Unify the world so that it can exist in future. Ethics is the unifying factor to unify the world. Environment and teachings combine to form the mold an individual’ ethical behaviors. Without challenging the spirituality of religions and the code of ethics that comes with each one of them, it may be useful to examine where is ethics prevailing and in terms of actions of the individuals.
Searchable Text Database
Open Source Options
- Full Text Search
- Interesting Search in My Opinion
- Sphinx
- MySql Search
- Sql Server search
- Lucene and Elastic Search on top of Lucene.
- Full Text search comparison
- http://full-text-search.findthebest.com/
- A very nice comparison http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
- http://taschenorakel.de/mathias/2012/04/18/fulltext-search-benchmarks/
- http://www.dbbest.com/blog/lucene-vs-sql-server-fts/
- http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/
- Sphinx http://sphinxsearch.com/
- I have personally used Sphinx with a Ruby on Rails project by installing sphinx in the background, installing a gem to interact with sphinx, defining which attributes in the model file to index, how to do searching, using delta index to speed up the process by using a delayed gem which would make a local copy of the change and when the index was updated after some period sometimes days sometimes after a week (done through cron job) then it would move the delta changes to the full index. I found the sphinx server easy to use once i got the hang of it. The delta index is normally smaller then the full index and is often the most recent changes which have not been integrated into the full index. This is normally used to avoid updating the whole index(as re-indexing is a time consuming process which can take long time based on the index size).
- http://en.wikipedia.org/wiki/Sphinx_(search_engine)
- Can be used as stand-alone or with MySQL, MariaDB and PostgreSQL, or by using ODBC with ODBC-compliant DBMS’s
- Sphinx latest release download http://sphinxsearch.com/downloads/release/
- Documentation
- Support for many programming languages integration and highly scalable.
- Has a lot of features related to natural language processing like using stopwords, tokenization etc.
- Note that the original contents of the fields are not stored in the Sphinx index. The text that you send to Sphinx gets processed, and a full-text index (a special data structure that enables quick searches for a keyword) gets built from that text. But the original text contents are then simply discarded. Sphinx assumes that you store those contents elsewhere anyway
- There are multiple modes of searching which can be found
- http://stackoverflow.com/questions/737275/comparison-of-full-text-search-engine-lucene-sphinx-postgresql-mysql
- MySql Full Text Search
- Modes of search:
- A boolean search interprets the search string using the rules of a special query language
- A natural language search interprets the search string as a phrase in natural human language
- A query expansion search is a modification of a natural language search
- Modes of search:
- Sql Server Full Text Search http://technet.microsoft.com/en-us/library/ms142571.aspx
- The beginning of the article give overview of text search, functionality, architecture, and modes of searching.
- Interesting section on this page are the related tasks at the end which gives more detail on how exactly to do the search. The most helpful article is the first one on how to get started with full text search http://technet.microsoft.com/en-us/library/ms142497.aspx
- http://lucene.apache.org/solr/ Apache Solr/Lucene
- Rest Api
- Stand alone
- Tutorial http://lucene.apache.org/solr/4_6_0/tutorial.html
- Interesting Project on top of Lucene http://www.elasticsearch.org/overview/
- Interesting because it supports real time analytics and real time search, document oriented, restful(like lucene) and full text search
- BaseX http://basex.org/
- Xml Database with full text search using XPath for search.
- Datapark search http://www.dataparksearch.org/ for search within a website or group or intarnet
- Documentation http://www.dataparksearch.org/index.en.html
- ht://Dig http://www.htdig.org/
- Apache Lucy http://lucy.apache.org/
- Loose C port for Lucene(Java search engine)
- Full Text Search
- Lemur Project http://www.lemurproject.org/
- Search for Websites http://www.searchdaimon.com/
- http://swish-e.org/ Swish-e
Oct 12-13 2013
- Domain for nData Consulting and nDataAnalytics and the nDataConsulting website about to expire in a month or so. Need to renew it.
- Discuss the visualization status and the progress made and the next steps.
October 5 2013 Agenda
- Asad: Discuss the progress with the visualizations.
September 28
- Discuss the project meeting about Recipe client.
- Discuss the current progress on the visualizations by Asad.
September 21 – September 22 2013
- Discuss report in which the architecture used in analysis and the visualization future steps as well as steps used in the generation of clusters. The file is here. The original iWork pages file is here . The doc file is . I will add the pdf file for all to see at the time of meeting or when the document is finished whichever comes first as right now still writing the document.
- Will need to discuss the previous point if the direction is correct and next steps. Also need to come up with some visualization for company as can’t display 2500 nodes and 90000 edges.
- Discuss the angular.js project. (STATUS: I turned down the project as neither me or Asad can fulfill the requirements of the project and the requirement is straight 8 hours i can’t commit to that as i work in mutliple periods of day and need to go for prayers and can’t work 8 hours online straight and also the learning curve of angular.js is right now too high to learn within a week and start work straight away … Me and Asad will explore angular.js but we can’t commit to this project right now).
- From Jawad: Just an update that on Monday i have meeting with the client which i have worked in past for mobile development project related to Ruby on Rails project i have worked in the past related to recipes. I will update after Monday what the client says.
August 31 – September 1st 2013
- Asad:Need to discuss the work done on the visualization for Venture Captial Dataset
- Show the visualizations
- Discuss the generation of data
- Discuss integration with mongodb using spring architecture.
- Discuss modifications on bubble graph
- Discuss modification on the basic graph generated when clicking on a firm.
- Discuss the main visualization for displaying the companies invested when clicked upon a firm.
- Discuss if anything is to be done if clicked upon a company and what exactly?
- Discuss next week goals/work
- Asad: Discuss the progress about demo prepared for d3.js client and any future developments.
- Fawad bhai: Discuss if there is anything to discuss.
- Jawad: Need to discuss the Rubric-based assessment with personalized learning recommendations article shared (LINKto article)
- I have read the article and to some extent understood the architecture for the recommendation system. How to go forward from here?
- What are the next steps?
- Jawad:Share progress related to clustering
- Discuss the MasterProjectRA.pdf which contains clustering algorithms
- Discuss progress on the MasterProjectRA
- Discuss next steps for the clustering of the data for venture capital and steps for it.
Venture Capital – Startup Network
Venture Capital – Startup Network
-
Original Documents:
- Initial Report PDF
- Initial input data file Excel File( ), CSV File( )
- git repo /home/ec2-user/GitRepo/R/VentureCapital.git/ ec2:/home/ec2-user/GitRepo/R/VentureCapital.git ec2: is a alias
-
Clustering
- For Firms FirmCluster.json
- For Companies
-
Command to import data:
- Change directory to mongo bin path.
- ./mongoimport -d ndataconsulting -c VCDeals –type csv –file /home/ntreees/VCdeals.csv –headerline
- This above command import the csv file to ndataconsulting database and within that uses VCDeals collection.
- Indexes single column:
- db.VCDeals.ensureIndex({“companyname”:1})
- db.VCDeals.ensureIndex({“datefund”:1})
- db.VCDeals.ensureIndex({“firmname”:1})
- db.VCDeals.ensureIndex({“companysituation”:1})
- db.VCDeals.ensureIndex({“companypublicstatus”:1})
- db.VCDeals.ensureIndex({“companystatecode”:1})
-
iGraph R Module
- http://igraph.wikidot.com/community-detection-in-r
- IGraph Community Detection Details http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
- iGraph Documentation ()
- iGraph Tutorial (http://igraph.sourceforge.net/igraphbook/igraphbook-datamodel.html)
- Drawing Graph (http://horicky.blogspot.com/2012/04/basic-graph-analytics-using-igraph.html)
- Psuedo Inverse()
- Get Adjency Graph(http://stackoverflow.com/questions/14849835/how-to-calculate-adjacency-matrices-in-r)
- layout=layout.fruchterman.reingold Force Based Implementation
- Drawing Graph
- Degree of Graph
- Laplacian of Graph
- Example to create graph from data http://igraph.sourceforge.net/igraphbook/igraphbook-creating.html
- Integration With R
- Interesting Snippets
- http://www.r-bloggers.com/network-visualization-in-r-with-the-igraph-package/
- http://markov.uc3m.es/2012/11/temporal-networks-with-igraph-and-r-with-20-lines-of-code/
- http://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/
- http://nsaunders.wordpress.com/2010/04/21/experiments-with-igraph/
- http://stackoverflow.com/questions/9876267/r-igraph-community-detection-edge-betweenness-method-count-list-members-of-e
- http://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/
- http://somelab.net/2012/11/how-to-create-a-network-animation-with-r-and-the-igraph-package/
-
Detail about the algorithm (Newman-Girvan cohesion-based clustering algorithm) :
- Is a algorithm which is used to find Community Structure. For a community structure n0rmally a set of nodes are densely connected to each other form a group and these groups are sparsely connected to other groups. Basically nodes will be more likely to be connected to each other if they are in the same community and less likely if in different communities.
- The algorithm works by finding an edge between communities and then removes these edges leaving behind only the communities themselves. For this it uses is Betweenness.
- Betweenness assigns a large number to edges if they are between many pair of nodes.
- Popular but slow takes O(m2n) on a network of n vertices and m edges making it impractical for a large set of nodes.
- It focuses on these edges that are least central, the edges that are most “between” communities. The communities are detected by progressively removing edges from the original graph
- If a network contains communities or groups that are only loosely connected by a few intergroup edges, then all shortest paths between different communities must go along one of these few edges. Thus, the edges connecting communities will have high edge betweenness (at least one of them). By removing these edges, the groups are separated from one another and so the underlying community structure of the network is revealed.
- http://www.sixhat.net/finding-communities-in-networks-with-r-and-igraph.html
- http://open.umich.edu/education/si/si508/fall2008/materials#Labs
Databases In Review
Can the new NoSQL databases formats like key-store, graph based, column family datastore, and document oriented databases compete with already optimized relational databases like oracle, MySQL etc.
The traditional relational databases have little room for improvement as they are highly optimized and are already in place in number of applications but these relational databases don’t scale well that is where the different type of NoSQL databases come in they don’t some things like transactions or comprise on some features but they are built to be fast and scalable. Also, with the fast speed of changes in applications and need to adapt the database quickly to changing requirements and changing database structure the traditional databases are more difficult to change. Changing the traditional databases requires changing the whole database structure, change all the applications which uses the applications and this makes it harder for the databases to accomodate change whereas in less critical areas like in social network data where change is normal and using traditional databases is too difficult to use and to scale NoSQL provides the flexibility to be able to add change the structure for new data and merge different format of data without changing existing applications. Sharding and replication work well with large databases. With the progression of internet more and more data is collected by all organizations and existing databases fail to accomodate Big data. In order to accomodate big data there is a need to use technologies like NoSQL databases, hadoop, map-reduce and similar techniques to reduce the problems in smaller chunks and use cloud computing to do what is not possible to do anymore in traditional databases.
In the past if you had a lot of data with a lot of columns and based on the columns you wanted to find a pattern between the variables and the output we are interested to analyze you would use statistical analysis. These statistical models are too difficult to use when the data approaches a large scale i.e. Big data. Big data makes the statistical models slow to use and impossible to use. So in order to use them there is a need to make some kind of algorithms which distribute the data in buckets, uses hadoop and map-reduce to apply some kind of calculation we are interested in and apply them to smaller problems, finding the result and merging them to get the result we want. This involves now use of cloud computing.