Searchable Text Database

Open Source Options

  1. Full Text Search
    1. http://en.wikipedia.org/wiki/Full_text_search#Software
    2. http://www.mediawiki.org/wiki/Fulltext_search_engines
  2. Interesting Search in My Opinion
    1. Sphinx
    2. MySql Search
    3. Sql Server search
    4. Lucene and Elastic Search on top of Lucene.
  3. Full Text search comparison
    1. http://full-text-search.findthebest.com/
    2. A very nice comparison http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
    3. http://taschenorakel.de/mathias/2012/04/18/fulltext-search-benchmarks/
    4. http://www.dbbest.com/blog/lucene-vs-sql-server-fts/
    5. http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/
  4. Sphinx http://sphinxsearch.com/
    1. I have personally used Sphinx with a Ruby on Rails project by installing sphinx in the background, installing a gem to interact with sphinx, defining which attributes in the model file to index, how to do searching, using delta index to speed up the process by using a delayed gem which would make a local copy of the change and when the index was updated after some period sometimes days sometimes after a week (done through cron job) then it would move the delta changes to the full index. I found the sphinx server easy to use once i got the hang of it. The delta index is normally smaller then the full index and is often the most recent changes which have not been integrated into the full index. This is normally used to avoid updating the whole index(as re-indexing is a time consuming process which can take long time based on the index size). 
    2. http://en.wikipedia.org/wiki/Sphinx_(search_engine)
    3. Can be used as stand-alone or with  MySQLMariaDB and PostgreSQL, or by using ODBC with ODBC-compliant DBMS’s
    4. Sphinx latest release download http://sphinxsearch.com/downloads/release/
    5. Documentation 
    6. Support for many programming languages integration and highly scalable.
    7. Has a lot of features related to natural language processing like using stopwords, tokenization etc.
    8. Note that the original contents of the fields are not stored in the Sphinx index. The text that you send to Sphinx gets processed, and a full-text index (a special data structure that enables quick searches for a keyword) gets built from that text. But the original text contents are then simply discarded. Sphinx assumes that you store those contents elsewhere anyway
    9. There are multiple modes of searching which can be found
    10. http://stackoverflow.com/questions/737275/comparison-of-full-text-search-engine-lucene-sphinx-postgresql-mysql
  5. MySql Full Text Search 
    1. Modes of search:
      1. A boolean search interprets the search string using the rules of a special query language
      2. A natural language search interprets the search string as a phrase in natural human language
      3. A query expansion search is a modification of a natural language search
  6. Sql Server Full Text Search http://technet.microsoft.com/en-us/library/ms142571.aspx
    1. The beginning of the article give overview of text search, functionality, architecture, and modes of searching.
    2. Interesting section on this page are the related tasks at the end which gives more detail on how exactly to do the search. The most helpful article is the first one on how to get started with full text search http://technet.microsoft.com/en-us/library/ms142497.aspx
  7. http://lucene.apache.org/solr/ Apache Solr/Lucene
    1. Rest Api
    2. Stand alone
    3. Tutorial http://lucene.apache.org/solr/4_6_0/tutorial.html
  8. Interesting Project on top of Lucene http://www.elasticsearch.org/overview/
    1. Interesting because it supports real time analytics and real time search, document oriented, restful(like lucene) and full text search
  9. BaseX http://basex.org/
    1. Xml Database with full text search using XPath for search.
  10. Datapark search http://www.dataparksearch.org/ for search within a website or group or intarnet
    1. Documentation http://www.dataparksearch.org/index.en.html
  11. ht://Dig http://www.htdig.org/
  12. Apache Lucy http://lucy.apache.org/
    1. Loose C port for Lucene(Java search engine)
    2. Full Text Search
  13. Lemur Project http://www.lemurproject.org/
  14. Search for Websites http://www.searchdaimon.com/
  15. http://swish-e.org/ Swish-e

Venture Capital – Startup Network

Venture Capital – Startup Network

Databases In Review

Can the new NoSQL databases formats like key-store, graph based, column family datastore, and document oriented databases compete with already optimized relational databases like oracle, MySQL etc.

The traditional relational databases have little room for improvement as they are highly optimized and are already in place in number of applications but these relational databases don’t scale well that is where the different type of NoSQL databases come in they don’t some things like transactions or comprise on some features but they are built to be fast and scalable. Also, with the fast speed of changes in applications and need to adapt the database quickly to changing requirements and changing database structure the traditional databases are more difficult to change. Changing the traditional databases requires changing the whole database structure, change all the applications which uses the applications and this makes it harder for the databases to accomodate change whereas in less critical areas like in social network data where change is normal and using traditional databases is too difficult to use and to scale NoSQL provides the flexibility to be able to add change the structure for new data and merge different format of data without changing existing applications. Sharding and replication work well with large databases. With the progression of internet more and more data is collected by all organizations and existing databases fail to accomodate Big data. In order to accomodate big data there is a need to use technologies like NoSQL databases, hadoop, map-reduce and similar techniques to reduce the problems in smaller chunks and use cloud computing to do what is not possible to do anymore in traditional databases.

In the past if you had a lot of data with a lot of columns and based on the columns you wanted to find a pattern between the variables and the output we are interested to analyze you would use statistical analysis. These statistical models are too difficult to use when the data approaches a large scale i.e. Big data. Big data makes the statistical models slow to use and impossible to use. So in order to use them there is a need to make some kind of algorithms which distribute the data in buckets, uses hadoop and map-reduce to apply some kind of calculation we are interested in and apply them to smaller problems, finding the result and merging them to get the result we want. This involves now use of cloud computing.

Transition from Traditional Computing to Cloud Computing

Apple is a consumer company and IBM is an enterprise company.

Cloud computing model now does not allow companies anymore which try forced customers to buy new product versions every time the company wants. For example Microsoft tries to forces consumers new versions of word every time they come up with a new release.

Now companies are trying to move towards cloud computing like Microsoft online and Google online instead of trying to makes sales only by selling new versions of their software. A consequence of the cloud model is that profit margins are decreasing. A traditional product which consumer was forced to buy for 100’s of dollars now in cloud computing is cheap and easily available to consumers. Cloud subscriptions are easy now.

An example of this shift can be seen in adobe products which allows consumers now to allow to use their product line for small amount per month as opposed to spending hundreds of dollars to buy the whole adobe suite.

Now no company can try to focus on one end of the market instead in order to survive in this new cloud computing companies need to be able to work on multiple ends of the market and provide cloud services rather then traditional services.

It is now a network market. Like apple does not makes it sales only by selling computers only instead they are not providing multiple smaller components like they now make money by providing iTunes cards, providing additional paid apps in addition to stock applications, developers can make money by selling applications but even in that they say that we want cut from applications you are selling in our app store, developers/publishers can make money by creating books in apple platform for their iBooks application present in their mobile devices, they provide also iTunesU courses/content and all these services apple are providing are more or less integrated with cloud computing i.e. we can enable your book through our cloud service iCloud or consumer can upload their documents through iCloud.

If we make a apple cloud then there are multiple markets, multiple ends, many type of focus are available in the apple cloud. Apple is not focusing on one type of work instead it is trying to branch into multiple areas through cloud computing. This is known as two-sided or n-sided networks as there is no end to this network or one focus to these networks instead it spans multiple areas.

IBM used to say that in enterprise if you want to do business with us this is the contract worth millions/billions of dollars if you want to use our services which can’t work now in the shift to cloud computing. They tried to negotiate contracts like this as stated above instead of trying to market their products to consumers to buy them.

All these models stated above will start coming into cloud computing. The nature of cloud computing is that enterprise can be involved directly or indirectly in cloud. Same is the case of consumer market which can be involved directly or indirectly with cloud computing.

Data nature is similar. Some data is company data, some is public data, some data is coming from different sources from outside of cloud and some is social network data. Not all data is related to enterprises. Some data is related to where/what is inventory right now, what is our shares information, what are the financial of the company, sales data.

There are two aspects of cloud computing. For example Microsoft is not a strong position in terms of shares whereas Apple and Google shares is doing very good right now. That being said the good thing about Microsoft is that their enterprise presence and consumer presence is integrable as their cloud technology is intermixed, their technology is intermixed as well. One example of this is that while working in Visual Studio you can developer for their cloud, web platforms, mobile platform, desktop environment and enterprise in a centralized way. Another example is with their Xbox xna which is now starting to overlap with .NET and their is more overlap to come in the future between the xna and .NET.  (Note: Xbox xna and .NET overlap needs to be validated first). This will mean that experience on web, mobile and television will start to merge and overlap more and more. This overlap means that there will be more commonality between the different platforms and they will start to be integrated tightly with each other to give a more pleasurable whole experience for a consumer.

Apple right now is not in a position to enter the enterprise market.  Their goverance or distribution model is not good enough for enterprise right now. An example is you have multiple computer connected in a organization then you can set the update policy or software updates which can be activated on multiple computers in an organization but that is not possible in Apple as you have to manage the updates or software yourself. Apple is a good desktop computer for consumer use but Apple does not do the updates/software itself and you have to manage that yourself. There distribution, ibooks and application distribution follows the same pattern. For example there can only be admin for whole organization/entity which is not feasible for an organization with thousands of people. A single admin can’t handle the whole organization/entity where the size of the organization matters.

IBM once had a lot of products like lotus which were niche and were better than sharepoint for that matter but know IBM does not the same market and focus as they once had related to enterprise. They were strong once but know in the cloud era they so far have not showed any sign of adoption to new demands. One of the reason their products like lotus had disappeared more or less is because IBM model for marketing is different and they ask to make deals worth millions instead of trying to sell individual low cost products to consumers. They use to survive in the past because their contracts used to be lengthy and they had service contracts were eloborate enough but now as the model has diluted and cloud has come and there is less and less demand for hundred of dollars of products now the model for cloud is subscription based low cost. IBM will have to come into the cloud market somehow and another concern is how exactly they will enter this market with innovative companies like Apple and Google as they are right now earning a lot based on their shares.  People right now are interested in innovative companies because of their thought process. So IBM will now have to evaluate if they can survive without this consumer market or not.

 

Cloud Computing

Cloud Computing

  • Reference: http://en.wikipedia.org/wiki/Cloud_computing
  • Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). Cloud computing entrusts remote services with a user’s data, software and computation.
  • What is a business model? (Answer)
  • Characteristics:
    • Agility
    • APIaccess to software to interact with cloud systems. Mostly using REST API’s.
    • Cost is reduced for some scenarios.
    • Device and location independence use browser to access systems from anywhere without additional software installations.
    • Virtualization allows servers and storage to be shared increasing utility.
    • Multitenancy allows sharing of resources and costs on large number of users.
    • Reliability improved if multiple redundant sites used.
    • Scalability and elasticity
    • Performance  monitored constantly using web services as interface.
    • Securitycan be increased due to centralization of data but complexity of security increases as data is distributed in public cloud. Due to complexity companies moving towards private cloud.
    • Maintenance  is easier.

  • Types of Cloud Computing
    • Infrastructure as a service(IaaS)
      • In this most basic cloud service model, providers offer computers, as physical or more often as virtual machines, and other resources.
      • IaaS refers not to a machine that does all the work, but simply to a facility given to businesses that offers users the leverage of extra storage space in servers and data centers.
    • Platform as a service(PaaS)
      • In the PaaS model, cloud providers deliver a computing platform typically including operating system, programming language execution environment, database, and web server.
    • Software as a service(SaaS)
      • In this model, cloud providers install and operate application software in the cloud and cloud users access the software from cloud clients
      • The cloud users do not manage the cloud infrastructure and platform on which the application is running.
      • Providers provide access to application software and databases.
      • The infrastructure and platform for the software handled by providers
      • Advantages:
        • Lower costs by reducing software/hardware costs handled using cloud provider.
        • Centralized hosting.
      • Disadvantages:
        • Users data stored on provider’s server. Risk of unauthorized access to data.
    • Storage as a service(STaaS)
      • Storage as a service (STaaS) is a business model in which a large service provider rents space in their storage infrastructure on a subscription basis. The economy of scale in the service provider’s infrastructure allows them to provide storage much more cost effectively than most individuals or corporations can provide their own storage, when total cost of ownership is considered.
    • Security as a service(SECaaS)
      • Security as a service (SECaaS) is a business model in which a large service provider integrates their security services into a corporate infrastructure on a subscription basis more cost effectively than most individuals or corporations can provide on their own, when total cost of ownership is considered. These security services often include authentication, anti-virus, anti-malware/spyware, intrusion detection, and security event management, among others.
    • Data as a service(DaaS)
      • DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer
      • Advantages
        • Agility – Customers can move quickly due to the simplicity of the data access and the fact that they don’t need extensive knowledge of the underlying data. If customers require a slightly different data structure or has location specific requirements, the implementation is easy because the changes are minimal.
        • Cost-effectiveness – Providers can build the base with the data experts and outsource the presentation layer, which makes for very cost effective user interfaces and makes change requests at the presentation layer much more feasible.
        • Data quality – Access to the data is controlled through the data services, which tends to improve data quality because there is a single point for updates. Once those services are tested thoroughly, they only need to be regression tested if they remain unchanged for the next deployment.
      • Disadvantages
        • a common criticism is that when compared to traditional data delivery, the consumer is really just “renting” the data, using it to produce a graph, chart or map, or possibly perform analysis, but for data as a service, generally the data is not available for download
    • Database as a service (DBaaS)
    • Test environment as a service(TEaaS)
      • Sometimes referred to as “on-demand test environment,” is a test environment delivery model in which software and its associated data are hosted centrally (typically in the (Internet) cloud) and are typically accessed by users using a thin client, normally using a web browser over the Internet.
    • Desktop virtualization
      • Desktop virtualization involves encapsulating and delivering either access to an entire information system environment or the environment itself to a remote client device. The client device may use an entirely different hardware architecture from that used by the projected desktop environment, and may also be based upon an entirely different operating system.The desktop virtualization model allows the use of virtual machines to let multiple network subscribers maintain individualized desktops on a single, centrally located computer or server. The central machine may operate at a residence, business, or data center. Users may be geographically scattered, but all must be connected to the central machine by a local area network, a wide area network, or the public Internet.
    • API as a service(APIaaS)
      • API as a serviceis a service platform that enables the creation and hosting of APIs (application programming interfaces).These API’s normally provide multiple entry points for API calls ranging from REST, XML web services or TCP/IP.
    • Backend as a service(BaaS)
      • Web and mobile apps require a similar set of features on the backend, including push notifications, integration with social networks, and cloud storage. Each of these services has their own API that must be individually incorporated into an app, a process that can be time-consuming and complicated for app developers. BaaS providers form a bridge between the frontend of an application and various cloud-based backends via a unified API and SDK.

Data Platform

Professor Brynjolfsson says:

“Decisions will increasingly be based on data and analysis rather than on experience and intuition. “We can start being a lot more scientific,” he observes.”

 

Mr. Smolan an enthusiast says that:

“Big Data has the potential to be “humanity’s dashboard,” an intelligent tool that can help combat poverty, crime and pollution.”

 

The paper is to propose the creation of a SOA and Big-Data enabled data-platform. The platform will allow us to achieve continuous improvement through regulating persistent and predictive statistics associated with activities. Strategize, align, govern, execute, and optimize define the governing objectives. The selection of these activities is determined by the theory of cause and effect to improve on the drivers that govern the governing objectives. This is assuming that business-governing objectives are close to the allocation of capitals for various activities. A data-platform that regulates the governing objectives by reevaluating selected statistics is claimed to have increase the productivity by 5 to 6%.

 

Creation of data-centric and extensible architectures based on SOA to expose existing infrastructure is key to creating a data-platform. Traditional data mining and business analytics techniques can be improved upon using concepts such as Big Data and SOA to create an agile data-platform. The predictive power of Big Data needs to be explored and considered economic development and economic forecasting.

 

This agile data-platform can be key in monitoring performance indicators including sales growth and earnings per share as well as non-financial measures such as customer loyalty, perceptions and product quality are can be used to measure and mange results to implement continuous improvement process.